학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Evaluating CLIP’s Understanding on Relationships in a Blocks World

Resource Type: Conference
Authors: Zhang, Kairui; Lewis, Martha
Source: 2023 IEEE International Conference on Big Data (BigData) Big Data (BigData), 2023 IEEE International Conference on. :2257-2264 Dec, 2023
Subject: Bioengineering
Computing and Processing
Geoscience
Robotics and Control Systems
Signal Processing and Analysis
Training
Visualization
Analytical models
Image recognition
Shape
Big Data
Data augmentation
multi-modal learning
representation learning
visual-language model
compositional generalization
relationship binding
Language

Online Access

Full Text (IEEE)

초록

Compositionality enables humans to learn new concepts from their components and the way they combine. Spatial relationships describe how objects are combined in the real world. These are challenging to recognize because they lack a specific shape, yet they communicate information within an image. Recent Vision-Language Models like CLIP [1], a large pre-trained vision and language model, are designed to align images and text. To gain a deeper understanding of CLIP’s capabilities in terms of relationships, we conduct comprehensive experiments to evaluate its performance. We evaluate and fine-tune the CLIP model using a block world dataset. Our findings indicate that CLIP performs poorly when recognizing spatial relationships. Furthermore, various fine-tuned variants can capture some relationships in the training dataset but struggle to generalize to unseen combinations. In an effort to improve CLIP’s performance, we implement several modifications. First, we assemble other models, such as a Symbolic model, which explicitly encodes relationships, as well as pre-trained models like BERT [2] and T5 [3] with the aim of providing additional information and enhancing CLIP’s ability to understand relationship binding. Second, we test two distinct learning strategies, including an auxiliary classifier and a novel compositional contrastive loss. The results show that these methods struggle to generalize learned relationships. Most methods only capture relationships within the training set. While some methods achieve higher scores on the validation and generalization sets, performance on the training set remains relatively low. We also analyze these methods and discuss the types of relationships that the models struggle with, shedding light on potential areas for further improvement.

공지

DAU Library

학술논문

요약정보

Evaluating CLIP’s Understanding on Relationships in a Blocks World

Online Access

초록