학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

Resource Type: Periodical
Authors: Matsubara, K.; Okamoto, T.; Takashima, R.; Takiguchi, T.; Toda, T.; Kawai, H.
Source: IEEE/ACM Transactions on Audio, Speech, and Language Processing IEEE/ACM Trans. Audio Speech Lang. Process. Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 31:1902-1915 2023
Subject: Signal Processing and Analysis
Computing and Processing
Communication, Networking and Broadcast Technologies
General Topics for Engineers
Vocoders
Generators
Harmonic analysis
Convolution
Real-time systems
Acoustics
Training
Fundamental frequency control
neural vocoder
speech-rate conversion
speech synthesis
Language
ISSN: 2329-9290
2329-9304

Online Access

초록

There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency ($f_{\mathrm{o}}$) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN generator. The first extension is a downsampling network, named the excitation signal network, that hierarchically receives multi-channel excitation signals corresponding to $f_{\mathrm{o}}$. The second extension is the layerwise pitch-dependent dilated convolutional network (LW-PDCNN), which can flexibly change its receptive fields depending on the input $f_{\mathrm{o}}$ to handle large fluctuations in $f_{\mathrm{o}}$ for the upsampling-based HiFi-GAN generator. The proposed explicit input of excitation signals and LW-PDCNNs corresponding to $f_{\mathrm{o}}$ are expected to realize high-quality synthesis for the normal and $f_{\mathrm{o}}$-conversion conditions and for the SR-conversion condition. The results of experiments for unseen speaker synthesis, full-band singing voice synthesis, and text-to-speech synthesis show that the proposed method with harmonic waves corresponding to $f_{\mathrm{o}}$ can achieve higher synthesis quality than conventional methods in all (i.e., normal, $f_{\mathrm{o}}$-conversion, and SR-conversion) conditions.

공지

DAU Library

학술논문

요약정보

Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

Online Access

초록