학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Attentive Mask CLIP

Resource Type: Conference
Authors: Yang, Yifan; Huang, Weiquan; Wei, Yixuan; Peng, Houwen; Jiang, Xinyang; Jiang, Huiqiang; Wei, Fangyun; Wang, Yin; Hu, Han; Qiu, Lili; Yang, Yuqing
Source: 2023 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2023 IEEE/CVF International Conference on. :2759-2769 Oct, 2023
Subject: Computing and Processing
Signal Processing and Analysis
Training
Computer vision
Correlation
Image coding
Costs
Codes
Semantics
Language
ISSN: 2380-7504

Online Access

Full Text (IEEE)

초록

In vision-language modeling, image token removal is an efficient augmentation technique to reduce the cost of encoding image features. The CLIP-style models, however, have been found to be negatively impacted by this technique. We hypothesize that removing a large portion of image tokens may inadvertently destroy the semantic information associated to a given text description, resulting in misaligned paired data in CLIP training. To address this issue, we propose an attentive token removal approach, which retains a small number of tokens that have a strong semantic correlation to the corresponding text description. The correlation scores are dynamically evaluated through an EMA-updated vision encoder. Our method, termed attentive mask CLIP, outperforms original CLIP and CLIP variant with random token removal while saving the training time. In addition, our approach also enables efficient multi-view contrastive learning. Experimentally, by training ViT-B on YFCC-15M dataset, our approach achieves 43.9% top-1 accuracy on ImageNet-1K zero-shot classification, 62.7/42.1 and 38.0/23.2 I2T/T2I retrieval accuracy on Flickr30K and MS COCO, outperforming SLIP by +1.1%, +5.5/+0.9, and +4.4/+1.3, respectively, while being 2.30× faster. An efficient version of our approach runs 1.16× faster than the plain CLIP model, while achieving significant gains of +5.3%, +11.3/+8.0, and +9.5/+4.9 on these benchmarks, respectively. Code will be release in https://github.com/microsoft/A-CLIP.

공지

DAU Library

학술논문

요약정보

Attentive Mask CLIP

Online Access

초록