Egocentric gaze estimation represents a challenging and immensely significant task which has promising future applications in areas such as human-computer interaction and AR/VR. In this work, we propose a novel model based on the Video Swin Transformer architecture. Through the introduction of localized inductive bias, our model extracts essential local features from first person videos during the windowed self-attention computation process. Additionally, we approximate the modeling of the global context within the gaze region using a shift window approach. We evaluate our approach on the EGTEA Gaze+ dataset, a publicly available dataset for egocentric activity videos. Experimental results unequivocally demonstrate that our model achieves state-of-the-art performance.