语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号.现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号.通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于Conformer的层次化语音编码器,并提出了 一种基于层次化文本-语音Conformer的语音合成模型.首先,该模型根据输入文本信号的长度,构建层次化文本编码器,包括音素级、单词级、语句级文本编码器3个层次,不同层次的文本编码器描述不同长度的文本信息;并使用Conformer的注意力机制来学习该长度信号中不同时间特征之间的关系.利用层次化的文本编码器,能够找出语句中不同长度需要强调的信息,有效实现不同长度的文本特征提取,缓解合成的语音信号持续时间长度不确定的问题.其次,层次化语音编码器包括音素级、单词级、语句级语音编码器3个层次.每个层次的语音编码器将文本特征作为Conformer的查询向量,将语音特征作为Conformer的关键字向量和值向量,来提取文本特征和语音特征的匹配关系.利用层次化的语音编码器和文本语音匹配关系,可以缓解不同长度语音信号合成不准确的问题.所提模型的层次化文本-语音编码器可以灵活地嵌入现有的多种解码器中,通过文本和语音之间的互补,提供更为可靠的语音合成结果.在LJSpeech和LibriTTS两个数据集上进行实验验证,实验结果表明,所提方法的梅尔倒谱失真小于现有语音合成方法.
Speech synthesis requires synthesizing the input speech text into a speech signal containing phonemes,words and utte-rances.Existing speech synthesis methods consider utterance as a whole,and it is difficult to synthesize different lengths of speech signals accurately.In this paper,we analyze the hierarchical relationships embedded in speech signals,design a Conformer-based hierarchical text encoder and a Conformer-based hierarchical speech encoder,and propose a speech synthesis model based on the hierarchical text-speech Conformer.First,the model constructs hierarchical text encoders according to the length of the input text signal,including three levels of phoneme level,word level,and utterance level text encoders.Each level of text encoder,de-scribes text information of different lengths and uses Conformer's attention mechanism to learn the relationship between different temporal features in the signal of that length.Using the hierarchical text encoder,we can find out the information that needs to be emphasized at different lengths in the utterance,and effectively achieve the extraction of text features at different lengths to alle-viate the problem of uncertainty in the duration of the synthesized speech signal.Second,the hierarchical speech encoder includes three levels:phoneme level,word level,and utterance level speech encoder.For each level of speech encoder,the text features is used as the query vector of the Conformer,and the speech features are used as the keyword vector and value vector of the Confor-mer to extract the matching relationship between text features and speech features.The problem of inaccurate synthesis of diffe-rent length speech signals can be alleviated by using hierarchical speech encoder and text-to-speech matching relations.The hie-rarchical text-to-speech encoder modeled in this paper can be flexibly embedded into a variety of existing decoders to provide more reliable speech synthesis results through the complementarity between text and speech.Experimental validation is performed on two datasets,LJSpeech and LibriTTS,and experimental results show that the Mel inversion distortion of the proposed method is smaller than that of existing speech synthesis methods.