eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

C2SFormer: Rethinking the Local-Global Design for Efficient Visual Recognition Model

Resource Type: Conference
Authors: Tang, Yin; Wan, Xili; Wu, Yaping; Guan, Xinjie; Zhu, Aichun
Source: 2023 International Joint Conference on Neural Networks (IJCNN) Neural Networks (IJCNN), 2023 International Joint Conference on. :1-9 Jun, 2023
Subject: Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Visualization
Additives
Convolution
Redundancy
Neural networks
Transformers
Task analysis
Vision Transformer
Hybrid Model
Multi-Scale Learning
Convolutional Network
Language
ISSN: 2161-4407

Online Access

Full Text (IEEE)

초록

Vision Transformers are born with the property of data-dependent and long-range dependencies, accomplishing a number of astonishing results against their contemporary competitor CNNs. To alleviate the excessive computational burden, previous methods apply the local operation (e.g., convolution, local attention) in the high-resolution stages. Although these designs are efficient for local relations learning, especially for the high redundancy stages, they inevitably lead to the losses of non-locality and are constrained by the limited receptive field. In this paper, we present an effective hybrid-style vision backbone that is explicitly built with dynamic convolution and self-attention to respectively undertake both local and global interaction, dubbed C2SFormer. We adopt two homogeneous modules whose structure follows the typical Transformers. For local relations learning, we take the parallel multi-scale design and additive aggregation as simple but effective ideas, named MS-SCDC. For the global context modeling, we leverage the efficient factorized self-attention mechanism proposed in CoaT and apply it with the MS-SCDC in a cross-stacking manner over the high-resolution stages. Additionally, we further introduce a general approach for multi-scale learning of transformer-based modules, named MS-MHSA. The experiments conducted on a variety of general-purpose vision tasks demonstrate the superiority of the proposed model.

공지

DAU Library

eArticles

요약정보

C2SFormer: Rethinking the Local-Global Design for Efficient Visual Recognition Model

Online Access

초록