In today's highly interactive human-computer world, speech synthesis is widely used in many scenarios, and the requirements for rhyme effects in speech synthesis technology are increasing, so rhyme-controllable models have become a research hotspot. Most current rhyme controllable models are achieved by using a separate neural network to generate reference features, but this approach requires the training of more complex neural network models and the availability of reference audio to achieve display rhyme control. This paper proposes a rhyme-controllable solution based on an end-to-end acoustic model to address the problem of models being unable to precisely control the tone at the word level. The proposed model includes a tone-controllable module, which obtains duration information through the MFA alignment tool and adjusts the tone of the words by using word-level pitch control values and duration information. The acoustic model in this paper is improved by introducing pitch control during the generation of acoustic features and generating more robust audio by combining it with the decoder. In addition, to adjust the overall tone of the audio, a fixed coefficient is multiplied to the pitch values of all frames. Furthermore, this paper also proposes a 48KHz ultra-high-definition audio model by increasing the spectral parameter dimensions and upsampling by a factor.