Currently, mixed reality head-mounted displays tracking the full body of users is an important human-computer interaction mode through the pose of the head and the hands. Unfortunately, users’ virtual representation and experience is limited due to high reconstruction error when simple transformer network architecture is applied. In this paper, we present a novel model, named Dual Attention Poser, which can learn the whole body reconstruction at a high accuracy. The proposed model consists of three key modules. Among them, dual-path attention encoder is designed to extract feature of the sparse signals. Cross attention mixer module enable the fusion of representation in the double path. Attention-gated-mlp decoder is applied to decode the hidden feature from the sparse input through attention gate. Test results on the AMASS dataset show that Dual Attention Poser can reduce the error by up to 18.2% in comparison with the state-of-the-art results.