Visual Question Answering (VQA) has emerged as a significant area of research in recent years. Currently, a considerable proportion of studies focus on two-dimensional images, which are prone to spatial ambiguities due to alterations in viewpoint, occlusions, and reprojections. However, real-world human-computer interaction scenarios are typically three-dimensional, thus underscoring the practical applicability and value of research in 3D question answering. Existing VQA models based on 3D point clouds are capable of comprehending 3D scenes and providing responses to a wide array of complex questions. With the progression of technology, new challenges have arisen. Specifically, 3D scenes and associated questions, represented by point clouds, belong to two distinct modalities. These modalities have notable differences, making alignment challenging, and potentially correlated features are often overlooked. To address this issue, we propose a novel self-supervised learning method for question answering in realistic 3D scenes. For the first time, we introduce contrast learning into a 3D question answering model, employing 3D Cross-Modal Contrastive Learning to align 3D scenes with associated questions, thereby reducing the heterogeneity gap between the two modalities and facilitating the extraction of relevant features. Furthermore, we utilize a Deep Interactive Attention Network to guide the attention of visual information, thus enhancing the deep integration of information from both modalities. Extensive experiments on the ScanQA dataset reveal that our 3D Self-Supervised Question Answering (3DSSQA) method achieves an accuracy of 24.75% on the primary Exact Match at (EM@1) metric, thereby outperforming the current state-of-the-art models.