A person’s gaze can reveal where their interest or attention lies in a social scenario. Detecting a person’s gaze is essential in multiple domains (i.e., security, psychology, or medical diagnosis). We propose a two-stage framework, Gaze Visual Attention Estimator (GazeVAE). In the first stage, we train the 3D direction on the GazeFollow dataset with a pseudo label to produce the field of view. Afterward, we decompose the 3D direction into a 2D image plane and a depth-channel gaze to obtain the depth mask image. In the second stage, we concatenate the scene image, the output from stage one, and the head position to predict the gaze target’s location. We propose a novel equivalent loss to reduce angle error further. We train the model from scratch except for the off-the-shelf depth network. Our model outperforms the baseline model in AUC and achieves competitive results for GazeFollow and VideoAttentionTarget datasets without pretraining