eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Self-Supervised Learning based 3D Visual Question answering for Scene Understanding

Resource Type: Conference
Authors: Li, Xiang; Fan, Zhiguang; Li, Xuexiang; Lin, Nan
Source: 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC) Frontiers Technology of Information and Computer (ICFTIC), 2023 5th International Conference on. :936-942 Nov, 2023
Subject: Computing and Processing
Robotics and Control Systems
Signal Processing and Analysis
Point cloud compression
Solid modeling
Visualization
Three-dimensional displays
Computational modeling
Self-supervised learning
Feature extraction
3D question answering
Contrastive learning
Point clouds
Deep Interactive Attention
Language

Online Access

Full Text (IEEE)

초록

Visual Question Answering (VQA) has emerged as a significant area of research in recent years. Currently, a considerable proportion of studies focus on two-dimensional images, which are prone to spatial ambiguities due to alterations in viewpoint, occlusions, and reprojections. However, real-world human-computer interaction scenarios are typically three-dimensional, thus underscoring the practical applicability and value of research in 3D question answering. Existing VQA models based on 3D point clouds are capable of comprehending 3D scenes and providing responses to a wide array of complex questions. With the progression of technology, new challenges have arisen. Specifically, 3D scenes and associated questions, represented by point clouds, belong to two distinct modalities. These modalities have notable differences, making alignment challenging, and potentially correlated features are often overlooked. To address this issue, we propose a novel self-supervised learning method for question answering in realistic 3D scenes. For the first time, we introduce contrast learning into a 3D question answering model, employing 3D Cross-Modal Contrastive Learning to align 3D scenes with associated questions, thereby reducing the heterogeneity gap between the two modalities and facilitating the extraction of relevant features. Furthermore, we utilize a Deep Interactive Attention Network to guide the attention of visual information, thus enhancing the deep integration of information from both modalities. Extensive experiments on the ScanQA dataset reveal that our 3D Self-Supervised Question Answering (3DSSQA) method achieves an accuracy of 24.75% on the primary Exact Match at (EM@1) metric, thereby outperforming the current state-of-the-art models.

공지

DAU Library

eArticles

요약정보

Self-Supervised Learning based 3D Visual Question answering for Scene Understanding

Online Access

초록