Clustering individual cells or spots based on their gene expression profiles in a spatial context is a powerful approach to uncovering the underlying biological diversity and relationships among cells. The intricate information within spatial transcriptomics data demands sophisticated algorithms that effectively integrate gene expression, cell position, and tissue image data for accurate cell or spot clustering. This work proposes a Multi-View Comparative Learning method for clustering Spatial Transcriptomics data (MVCLST). MVCLST first builds on two data views using gene expression profiles, cell space coordinates, and image features. Then it employs four different encoders to capture the common and private features of the two views. The model employs a contrastive learning loss to encourage effective interaction between the two views and ensure feature consistency. The shared and private features from both views are fused using corresponding decoders. Finally, the model employs the Leiden algorithm for downstream clustering of the learned features. We test the MVCLST method on a human dorsolateral prefrontal cortex dataset. The results show that MVCLST outperforms other state-of-the-art methods in most cases. Additionally, the clusters identified by MVCLST align closely with manual annotations and established neuroscience definitions.