학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Resource Type: Conference
Authors: Yuan, Hangjie; Zhang, Shiwei; Wang, Xiang; Albanie, Samuel; Pan, Yining; Feng, Tao; Jiang, Jianwen; Ni, Dong; Zhang, Yingya; Zhao, Deli
Source: 2023 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2023 IEEE/CVF International Conference on. :21592-21604 Oct, 2023
Subject: Computing and Processing
Signal Processing and Analysis
Computer vision
Humanities
Object detection
Computer architecture
Logic gates
Cognition
Encoding
Language
ISSN: 2380-7504

Online Access

Full Text (IEEE)

초록

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.

공지

DAU Library

학술논문

요약정보

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Online Access

초록