Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song Identification
- Resource Type
- Conference
- Authors
- Gu, Yanmei; JingLi; JiayiZhou; Wang, Zhiming; Zhu, Huijia
- Source
- 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Automatic Speech Recognition and Understanding Workshop (ASRU), 2023 IEEE. :1-8 Dec, 2023
- Subject
- Signal Processing and Analysis
Representation learning
Training
Measurement
Conferences
Self-supervised learning
Data models
Task analysis
Cover Song Identification
dual-encoder architecture
multi-modal representation learning
joint training
- Language
Cover Song Identification (CSI) is an important and challenging task in Music Information Retrieval (MIR). This paper focuses on investigating the multi-modal features of audio and text in the music domain and proposes two significant improvements to enhance the model performance for CSI. Firstly, our approach consists of a dual-encoder architecture that learns the embedding between the audio and corresponding song title information of music. Secondly, we propose a multi-modal representation learning strategy by jointly optimizing classification and metric learning losses in the audio modality, and contrastive learning loss in the audio-text modality. Experimental results demonstrate that our method efficiently learns more robust multi-modal representations for cover songs compared to a single audio encoder and achieves state-of-the-art results in CSI tasks.