Automated emotion recognition (AER) has a growing number of applications, ranging from behavior analysis in assistive robotics and e-learning to depression and pain estimation healthcare. Systems for multimodal AER typically outperform unimodal approaches due to the complementary and redundant semantic information across modalities like visual, audio, language, physiological, etc. However, in practice, only a subset of these modalities is available at inference time, and using multiple modalities increases systems complexity. This paper focuses on video-based AER and aims to enhance the accuracy of unimodal systems by leveraging the Learning Under Privileged Information (LUPI) paradigm with information from multiple modalities. Without loss of generality, this study considers the audio modality as privileged information (only available during training), and introduces a new multimodal to unimodal privileged knowledge distillation (PKD). The teacher network is comprised of a multimodal AER architecture that can process audio-visual information and distills the learned knowledge to a unimodal visual student network. We validate our proposed multimodal PKD method on the challenging RECOLA and Affwild2 datasets for video-based AER, using weak and strong baseline AER architectures, as well as joint cross-attention fusion methods. The proposed method increases the absolute average concordance correlation coefficient accuracy by 8% on the RECOLA dataset, and by 2% on the arousal dimension of the Affwild2 dataset. The code available at multimodal-pkd.