Apathy is characterized by symptoms such as reduced emotional response, lack of motivation, and limited social interaction. Current methods for apathy diagnosis require the patient’s presence in a clinic and time consuming clinical interviews, which are costly and inconvenient for both, patients and clinical staff, hindering among other large-scale diagnostics. In this work, we propose a novel spatio-temporal framework for apathy classification, which is streamlined to analyze facial dynamics and emotion in videos. Specifically, we divide the videos into smaller clips, and proceed to extract associated facial dynamics and emotion-based features. Statistical representations/descriptors based on each feature and clip serve as input of the proposed Gated Recurrent Unit (GRU)-architecture. Temporal representations of individual features at the lower level of the proposed architecture are combined at deeper layers of the proposed GRU architecture, in order to obtain the final feature-set for apathy classification. Based on extensive experiments, we show that fusion of characteristics such as emotion and facial dynamics in proposed deep-bi-directional GRU obtains an accuracy of 95.34% in apathy classification.