eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Fusing Acoustic and Text Emotional Features for Expressive Speech Synthesis

Resource Type: Conference
Authors: Feng, Ying; Duan, Pengfei; Zi, Yunfei; Chen, Yaxiong; Xiong, Shengwu
Source: 2022 IEEE International Conference on Multimedia and Expo (ICME) Multimedia and Expo (ICME), 2022 IEEE International Conference on. :01-06 Jul, 2022
Subject: Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Signal Processing and Analysis
Analytical models
Sentiment analysis
Fuses
Feature extraction
Acoustics
Speech synthesis
Synthesized speech
text emotional features
acoustic features
feature-fused
Language
ISSN: 1945-788X

Online Access

Full Text (IEEE)

초록

Prominent methods based on Tacotron2 and advanced models have improved the quality of synthesized speech. However, most data-driven Text- To-Speech (TTS) synthesis methods only aim to achieve reasonable neutral prosody, so the synthesized speech is less expressive. In this paper, a method was proposed which fuses acoustic and text emotional features to produce more vivid and realistic speech. Specifically, to obtain acoustic features, two acoustic encoders are leveraged to extract utterance-level and phoneme-level vectors from the target speech, respectively. To obtain the objective sentiment features of the text, the sentiment analysis model is exploited to extract the sentiment vector from the text and expand it. The expanded vector is feature- fused with the output vector of the acoustic model. The experimental results on the LJSpeech dataset show that the naturalness and expressiveness of the MOS score are 3.63 and 3.45, respectively, and the similarity of the SMOS score is 4.14.

공지

DAU Library

eArticles

요약정보

Fusing Acoustic and Text Emotional Features for Expressive Speech Synthesis

Online Access

초록