Human Computer Interaction (HCI) relies on accurate speech emotion identification. Speech Emotion Recognition (SER) analyzes voice signals to classify emotions. English based Speech Emotion Recognition (SER) has been extensively studied, while Bangla SER has not. The study integrates a one-dimensional convolution neural network with a long short-term memory (LSTM) architecture into a fully linked network for SER. Speech categorization requires feature inclusion, which this method achieves. We included Additive White Gaussian Noise (AWGN), signal elongation, and pitch alteration to improve dataset dependability. Mel-frequency cepstral coefficients (MFCC), Mel-Spectrogram, Zero Crossing Rate (ZCR), chromagram, and Root Mean Square Error are analyzed in this study. One-dimensional convolutional neural network blocks extract local information, while LSTM layers catch global trends in our model. Training and testing loss curves, confusion matrix, recall, precision, F1-score, and accuracy are used to evaluate the model. We assessed using two cutting-edge datasets, the SUST Bangla Emotional Speech Corpus (SUBESCO) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Experimental results show that the suggested BSER model is more resilient than baseline models on both datasets. BSER improves research in this sector and shows that our hybrid model can detect and classify emotions in voice inputs.