eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Multimodal Emotion Recognition through Deep Fusion of Audio-Visual Data

Resource Type: Conference
Authors: Sultana, Tamanna; Jahan, Meskat; Uddin, Md. Kamal; Kobayashi, Yoshinori; Hasan, Mahmudul
Source: 2023 26th International Conference on Computer and Information Technology (ICCIT) Computer and Information Technology (ICCIT), 2023 26th International Conference on. :1-5 Dec, 2023
Subject: Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Fields, Waves and Electromagnetics
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Emotion recognition
Visualization
Analytical models
Computational modeling
Data models
Convolutional neural networks
Task analysis
Multi-modal fusion
Convolution Neural Networks
Audio-visual recognition
Language

Online Access

Full Text (IEEE)

초록

The field of emotion recognition in artificial intelligence focuses on enabling machines to comprehend and react to the range of emotions experienced by humans. This paper presents a novel approach that integrates the Convolution Neural Network (CNN) with audio and visual modalities. The study employs the RAVDESS database as a resource to train two distinct models for the analysis of both video and audio data. When it comes to audio pre-processing, advanced signal-processing techniques are applied to extract relevant elements and capture basic acoustic characteristics. A one-dimensional Convolutional Neural Network (CNN) architecture receives the audio data as input, enabling the model to learn complicated patterns and representations from the audio domain. In the context of video pre-processing, sophisticated algorithms are employed to extract essential facial characteristics. In order to capture the changing periods of facial expressions, the video frames are analyzed using a three-dimensional CNN framework following that they have been compressed and converted to grayscale. The fusion technique involves concatenating and extending the outputs of the audio and visual models. The fused features are subsequently sent into a softmax layer, which facilitates the development of a resilient emotion identification system.

공지

DAU Library

eArticles

요약정보

Multimodal Emotion Recognition through Deep Fusion of Audio-Visual Data

Online Access

초록