Facial Action Unit (AU) detection is an important task to enable the emotion recognition from facial movements. In this paper, we propose a novel algorithm which utilizes identity-labeled face images to tackle the identity-based intra-class variation of AU detection that the appearances of the same AU vary significantly among different subjects, which makes existing methods generate low performance under cross-domain scenarios in case that the training and test datasets are dissimilar. The proposed method is based on network cascades consisting of two sub-tasks, face clustering and AU detection. The face clustering network, trained from a large dataset containing numerous identity-annotated face images, is designed to learn a transformation to extract identity-dependent image features, which are used to predict AU labels in the second network. The cascades are jointly trained by AU- and identity-annotated datasets that contain numerous subjects to improve the method’s applicability. Experimental results show that the proposed method achieves state-of-the-art AU detection performance on benchmark datasets BP4D, UNBC-McMaster, and DISFA.