Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
- Resource Type
- Conference
- Authors
- Feng, Zishun; Tu, Ming; xia, Rui; Wang, Yuxuan; Krishnamurthy, Ashok
- Source
- 2020 IEEE International Conference on Big Data (Big Data) Big Data (Big Data), 2020 IEEE International Conference on. :5671-5672 Dec, 2020
- Subject
- Communication, Networking and Broadcast Technologies
Computing and Processing
Engineering Profession
Geoscience
Signal Processing and Analysis
Training
Visualization
Conferences
Data visualization
Big Data
Task analysis
Videos
self-supervised learning
multimodal representation learning
large scale video understanding
- Language
Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self-supervised cross-modal representation approach for learning audio-visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross-modal and intra-modal manner with the learned representations. We verify our experimental results on the VGGSound dataset [1], and our approach achieves promising results.