학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Lightening the Load: Lightweighting Multimodal Understanding for Visual Grounding Tasks

Resource Type: Conference
Authors: Pragati; Singh, Navjot
Source: 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS) Data Engineering and Communication Systems (ICDECS), 2024 4th International Conference on. :1-5 Mar, 2024
Subject: Communication, Networking and Broadcast Technologies
Computing and Processing
Robotics and Control Systems
Training
Visualization
Grounding
Target recognition
Computational modeling
Natural languages
Transformers
Language

Online Access

Full Text (IEEE)

초록

In this paper, we propose a refined approach for visual grounding to find the most relevant object according to a natural language query. Typically, the machine must comprehend the question, recognise the main ideas in the picture, and then locate the target object by providing its bounding box. Our model adopts an end-to-end modulated detector, utilising a transformer-based architecture for the early fusion of text and image modalities, thus significantly improving task efficiency. Large numbers of trainable parameters, long training times, and high computational complexity are common challenges state-of-the-art models for vision language problems are facing nowadays. Our work tackles these problems by minimising the number of parameters and computational complexity while also reducing training time. Using our techniques, we can reduce training time by about 12% without sacrificing accuracy much on datasets flickr30k and MSCOCO. This accomplishment shows how to balance accuracy and processing efficiency in vision-language tasks effectively. Our approach can be easily extended to other problems, such as visual question answering and image caption generation.

공지

DAU Library

학술논문

요약정보

Lightening the Load: Lightweighting Multimodal Understanding for Visual Grounding Tasks

Online Access

초록