eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

VT-Grapher: Video Tube Graph Network With Self-Distillation for Human Action Recognition

Resource Type: Periodical
Authors: Liu, X.; Liu, J.; Cheng, X.; Li, J.; Wan, W.; Sun, J.
Source: IEEE Sensors Journal IEEE Sensors J. Sensors Journal, IEEE. 24(9):14855-14868 May, 2024
Subject: Signal Processing and Analysis
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Robotics and Control Systems
Electron tubes
Transformers
Graph neural networks
Three-dimensional displays
Sensors
Feature extraction
Convolutional neural networks
Self-distillation
temporal cluster
video action recognition
vision graph neural network (GNN)
Language
ISSN: 1530-437X
1558-1748
2379-9153

Online Access

초록

The proliferation of videos captured by sensor-based cameras has driven the application of human action recognition (HAR) task. As the fundamental video application in human–computer interaction devices, HAR aims to identify human action in video clips, where lightweight networks are crucial. In this field, the convolutional neural networks (CNNs) and transformers have shown great potential for feature representation in Euclidean space, but ignoring more flexible non-Euclidean manifolds. To address this issue, we interpret a video as a set of graph nodes and propose a Video Tube Graph network (VT-Grapher) for action recognition task. As the first lightweight graph neural network (GNN) for RGB-based action recognition, our VT-Grapher contains three main components: 1) three spatial–temporal learning strategies for effectively mining the relationships between video visual features and semantics, where the tube-in-embedding spatial–temporal (TE-ST) strategy demonstrates the best balance between performance and computing; 2) the video tube generation block with temporal center loss for generating the multiple granular video tubes based on temporal similarity and pushing away video tubes with low semantic similarity; and 3) adversarial self-distillation method for enhancing the multigranularity information aggregation capabilities of VT-Grapher. The proposed VT-Grapher network works in a plug-and-play way and can be integrated with vision GNNs, such as ViG and Mobile ViG. Extensive experiments are carried out on the Mini-Kinetics (Top-1 76.1%), Kinetics-400 (Top-1 73.7%), UCF101 (Acc 94.5%), and the multimodal Northwestern-UCLA (N-UCLA) datasets (Top-1 99.7%), showing the effectiveness of VT-Grapher.

공지

DAU Library

eArticles

요약정보

VT-Grapher: Video Tube Graph Network With Self-Distillation for Human Action Recognition

Online Access

초록