Limited to the design of the pre-text task and the singularity of the decoder, the current Masked Image Modeling fails to achieve satisfactory global semantic representation in downstream tasks and to discern contextual information. To tackle above issues, we propose a masked image modeling framework based on momentum contrast. First, we introduce a multi-scale hidden layer fusion module to capture nuanced features. Then, the dense reconstruction decoder is designed to capture contextual information through reconstruction, while the global mapping decoder models global semantic information. Besides, the target branch updates the parameters and prevents model collapse by merging the momentum encoder. Furthermore, it extracts features from the remaining sections to stabilize the model. Last, contrastive loss is computed using the global features and the features obtained from the target branch. The experimental results on four benchmark datasets show that the proposed method can effectively improve the classification accuracy in fine-tuning and linear classification, demonstrating its strong generalization and representation capabilities.