Compositionality enables humans to learn new concepts from their components and the way they combine. Spatial relationships describe how objects are combined in the real world. These are challenging to recognize because they lack a specific shape, yet they communicate information within an image. Recent Vision-Language Models like CLIP [1], a large pre-trained vision and language model, are designed to align images and text. To gain a deeper understanding of CLIP’s capabilities in terms of relationships, we conduct comprehensive experiments to evaluate its performance. We evaluate and fine-tune the CLIP model using a block world dataset. Our findings indicate that CLIP performs poorly when recognizing spatial relationships. Furthermore, various fine-tuned variants can capture some relationships in the training dataset but struggle to generalize to unseen combinations. In an effort to improve CLIP’s performance, we implement several modifications. First, we assemble other models, such as a Symbolic model, which explicitly encodes relationships, as well as pre-trained models like BERT [2] and T5 [3] with the aim of providing additional information and enhancing CLIP’s ability to understand relationship binding. Second, we test two distinct learning strategies, including an auxiliary classifier and a novel compositional contrastive loss. The results show that these methods struggle to generalize learned relationships. Most methods only capture relationships within the training set. While some methods achieve higher scores on the validation and generalization sets, performance on the training set remains relatively low. We also analyze these methods and discuss the types of relationships that the models struggle with, shedding light on potential areas for further improvement.