Cross-Modal Learning for CTC-Based ASR: Leveraging CTC-Bertscore and Sequence-Level Training
- Resource Type
- Conference
- Authors
- Lee, Mun-Hak; Lee, Sang-Eon; Choi, Ji-Eun; Chang, Joon-Hyuk
- Source
- 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Automatic Speech Recognition and Understanding Workshop (ASRU), 2023 IEEE. :1-8 Dec, 2023
- Subject
- Signal Processing and Analysis
Training
Conferences
Machine learning
Brain modeling
Linear programming
Data models
Biological neural networks
Speech recognition
Connectionist temporal classification
BERT
Cross-modal learning
- Language
Due to the nature of neural networks that easily overfit the training set, neural network-based speech recognition models are vulnerable to prior shifts in data distribution or unseen words. Therefore, studies have been conducted to over-come this problem by using language models trained with a relatively easy-to-obtain unpaired corpus. In this paper, we present a new training method that uses BERT to improve the performance of a connectionist temporal classification (CTC)-based ASR model. The proposed method follows a cross-modal learning scenario and induces the CTC model to better embed contextual information by utilizing an auxiliary objective function operating at the sequence level. We applied the proposed method to fine-tune the pre-trained wav2vec 2.0 model with CTC loss and confirmed that the proposed method improves the generalization performance of the ASR model.