Steady-state visual evoked potential (SSVEP) are widely utilized in brain-computer interfaces (BCIs) due to their exceptional stability and performance. However, extracting abundant information for object recognition in a deep learning framework remains a formidable task. To tackle this challenge, this paper introduces a multi-scale spatio-temporal convolutional neural network (MS-STCNN). The network considers the combination of multi-scale spatial filters and temporal filters from both reference signals and SSVEP signals, resulting in enhanced feature information extraction and improved model performance. The detection of SSVEP is accomplished by considering spatio-temporal information at different scales and employing distinct weights to combine the correlation scores obtained from the final correlation analysis layer for various categories. To evaluate the effectiveness of the proposed method, experiments are conducted using the Benchmark dataset. The results demonstrate a significant enhancement in the classification performance of SSVEP achieved by proposed method.