The field of emotion recognition in artificial intelligence focuses on enabling machines to comprehend and react to the range of emotions experienced by humans. This paper presents a novel approach that integrates the Convolution Neural Network (CNN) with audio and visual modalities. The study employs the RAVDESS database as a resource to train two distinct models for the analysis of both video and audio data. When it comes to audio pre-processing, advanced signal-processing techniques are applied to extract relevant elements and capture basic acoustic characteristics. A one-dimensional Convolutional Neural Network (CNN) architecture receives the audio data as input, enabling the model to learn complicated patterns and representations from the audio domain. In the context of video pre-processing, sophisticated algorithms are employed to extract essential facial characteristics. In order to capture the changing periods of facial expressions, the video frames are analyzed using a three-dimensional CNN framework following that they have been compressed and converted to grayscale. The fusion technique involves concatenating and extending the outputs of the audio and visual models. The fused features are subsequently sent into a softmax layer, which facilitates the development of a resilient emotion identification system.