The study of eye movement recognition has emerged as a pivotal focus, particularly in fields such as human-computer interaction, healthcare diagnostics, and adaptive technologies, due to its potential to enhance lives, especially for those with physical impairments. However, employing deep learning models that utilize non-intrusive cameras for recognizing and classifying eye movements has been impeded by issues that stem from environmental, physiological, and technical factors. These encompass unpredictable lighting, noise, head movements, and inherent human differences. In response to these challenges, this study presents an in-depth comparison between the performance of the ViT vit-base-patch16-224-in21k model and traditional deep learning models including ResNet18, and AlexNet, all of which were adapted and optimized for our collected dataset that consists of diverse eye movements from eight participants, captured under varied environmental and physiological conditions. The evaluation criteria included accuracy, interference time, and memory footprint. The findings indicate that the ViT model delivers a balanced performance, effectively addressing the intricacies of the multi-class eye movement dataset while maintaining interference time efficiency. This study underscores the importance of considering both performance and computational demands in choosing appropriate models for eye movement recognition and offers insights to guide future research. ViT and ResNet18 were about equally accurate but ViT was faster, while ResNet18 used less memory; AlexNet was less accurate and its speed and memory use were in between the two. We find that ViT showed remarkable efficiency with an average of 0.0588 seconds per image which makes it promising for applications that rely on the interference time.