We propose and study a novel multi-temporal CNN architecture for end-to-end ‘audio-scene classification’ (ASC) from raw audio signal. Conventional CNNs use a fixed size kernel (whether for image or 1-d signal classification) which corresponds to applying a filter bank, where each filter has a fixed time-frequency resolution (i.e., fixed duration impulse response and a fixed band-width frequency response), importantly with a specific time-frequency trade-off. In contrast, in a way to allow for multiple time-frequency resolutions, we use a multi-temporal CNN architecture having multiple kernel branches (up to 12 branches) each of different lengths, thereby allowing for multiple filter banks with different time-frequency resolution to process the input raw audio signal and create feature-maps (e.g. ranging from very narrow-band to very wide-band spectrographic maps in steps of fine time-frequency resolutions) corresponding to different time-frequency trade-offs. Applying this architecture to end-to-end audio-scene classification is shown to offer consistent and significant performance enhancements (e.g. 11-15% absolute in accuracy for the multi-temporal case of 12 branches) over the conventional single-temporal CNN and also outperform state-of the-art results for this task.