Language detection and localization in audio is an essential task in Natural Language Processing as it serves as the starting point in the NLP pipeline to accomplish other tasks such as detecting and localizing the language in an audio then segmenting it to perform speech to text conversion, sentiment analysis and audio-speech recognition on the text in that particular language. This task can help us achieve goals that include targeted extraction of specific references in an audio, regardless of their language. Our work leads towards generating auto-summarizations from any given audio. In our work, we propose a novel approach to detect and localize the language in an audio using a hybrid architecture of CNN-LSTM. We achieved approximately 93% accuracy for detecting a language and classifying it in the local context on native languages (Urdu, Sindhi, and Pushto) and international languages (English and Arabic), respectively.