The expressions of the human face are defined by the contraction of facial muscles. The most widely used and accepted standard that provides the description of all visual changes on the face is the Facial Action Coding System (FACS). In this paper, Vision Transformer (ViT) and Perceiver attention mechanisms are individually employed to detect Action Units (AUs) from the whole face on two spontaneous datasets (DISFA, BP4D) and one in-the-wild dataset (EmotioNet) with different patch sizes, then experimented the same attention mechanisms using patches cropped around facial landmarks to examine the improvements on AU detection. The experiments show that ViT and Perceiver attention mechanisms reach, and most of the time outperform, state-of-the-art methods on AU detection on the first set of experiments. However, the most significant performance increase is observed when using only landmark patches as the input sequence to both networks.