Brain-computer interfaces (BCI) based on EEG have attracted extensive research and attention worldwide, while motor imagery (MI), mental arithmetic (MA), and P300 event-related potentials are a few of the more commonly used paradigms.Vision Transformer(ViT) is a new Transformer model that has superior global processing power compared to Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).In this study, we propose a hybrid CNN-Transformer based model that uses CNN to convolve EEG signals in time and space, followed by ViT for global processing, and finally optimizes the model using 10-run $\times 10$-fold cross-validation and validates it on a publicly available dataset of 29 subjects. Final accuracies of 87.23% and 90.79% were achieved on the MI and MA tasks, respectively. Compared to other literature, the model achieved higher classification accuracies.