Human pose estimation in crowded scenes can be applied to many computer vision tasks, such as video surveillance, action recognition, and human–computer interaction. Although the existing methods achieved good results in sparse scenes, the precision accuracy always decreased due to overlapping occlusion and dense distribution of human bodies in crowded scenes. The result of using a single Intersection-over-Union (IoU) threshold to filter redundant boxes with different congestion levels is not effective; thus, IoU similarity-based nonmaximum suppression (NMS) is used to solve the problem of error suppression of the adjacent target box in crowded scenes. Meanwhile, the Transformer structure which has a natural advantage in predicting the relationship between keypoint pairs is introduced for predicting the tag heatmap in the keypoint detection stage to avoid the interference of noise keypoints on keypoint detection. To better use the tag information to distinguish the main keypoints and interference keypoints, the keypoints refinement algorithm based on tag information is proposed to filter noise keypoints. In the experiment, our proposed method outperforms previous methods on the CrowdPose test data set.