Facial expression recognition has enormous potential for downstream applications by revealing users’ emotional status when interacting with digital content. Previous studies consider using cameras or wearable sensors for expression recognition. However, these approaches bring considerable privacy concerns or extra device burdens. Moreover, the recognition performance of camera-based methods deteriorates when users are wearing masks. In this paper, we propose FacER, an active acoustic facial expression recognition system. As a software solution on a smartphone, FacER avoids the extra costs of external microphone arrays. Facial expression features are extracted by modeling the echoes of emitted near-ultrasound signals between the earpiece speaker and the 3D facial contour. Besides isolating a range of background noises, FacER is designed to identify different expressions from various users with a limited set of training data. To achieve this, we propose a contrastive external attention-based model to learn consistent expression features across different users. Extensive experiments with 20 volunteers with or without masks show that FacER can recognize 6 common facial expressions with more than 85% accuracy, outperforming the state-of-the-art acoustic sensing approach by 10% in various real-life scenarios. FacER provides a more robust solution for recognizing facial expressions in a convenient and usable manner.