Deep learning approaches have significantly advanced fundamental tasks in natural language processing (NLP), such as text-to-speech and speech-to-text translation. In this study, we combine Inception V3 and LSTM (Long Short-Term Memory) neural networks to propose an ensemble technique for text-to-speech and speech-to-text translation. Inception V3 is the most advanced convolutional neural network (CNN) available architecture known for its ability to extract rich and meaningful features from images. By adapting Inception V3 for acoustic feature extraction in speech signals, we exploit its capability to capture relevant information for speech-to-text translation. We train the Inception V3 model on a large dataset of speech samples, allowing it to learn discriminative acoustic patterns. To handle the sequential nature of text-to-speech translation, we incorporate LSTM is a kind of RNN (recurrent neural network). into our ensemble. LSTM excels at capturing long-term connections in sequences, which makes it ideal for using textual input to produce speech that sounds natural and cohesive. The LSTM model is trained on a parallel corpus of text and corresponding speech signals, leveraging the power of sequential modelling combine the predictions of the Inception V3 and LSTM models through an ensemble approach, where the outputs of both models are averaged or fused to generate the final translations. This ensemble strategy allows us to exploit the complementary strengths of both CNNs and RNNs, leading to improved translation accuracy and quality. Experiments conducted on benchmark datasets show that our ensemble method for text-to-speech and speech-to-text translation is efficient. The ensemble achieves significant gains in accuracy and fluency compared to individual models.Furthermore, our approach shows robustness in handling different languages and input variations. In conclusion, our proposed ensemble of Inception V3 and LSTM models presents a powerful solution for translation from text to voice and from speech to text. By leveraging the strengths of CNNs and RNNs, we achieve superior performance in accurately transcribing speech and generating natural-sounding speech from text.