Prominent methods based on Tacotron2 and advanced models have improved the quality of synthesized speech. However, most data-driven Text- To-Speech (TTS) synthesis methods only aim to achieve reasonable neutral prosody, so the synthesized speech is less expressive. In this paper, a method was proposed which fuses acoustic and text emotional features to produce more vivid and realistic speech. Specifically, to obtain acoustic features, two acoustic encoders are leveraged to extract utterance-level and phoneme-level vectors from the target speech, respectively. To obtain the objective sentiment features of the text, the sentiment analysis model is exploited to extract the sentiment vector from the text and expand it. The expanded vector is feature- fused with the output vector of the acoustic model. The experimental results on the LJSpeech dataset show that the naturalness and expressiveness of the MOS score are 3.63 and 3.45, respectively, and the similarity of the SMOS score is 4.14.