Distant, Multichannel Speech Recognition Using Microphone Array Coding and Cloud-Based Beamforming with a Self-Attention Channel Combinator
- Resource Type
- Conference
- Authors
- Sharma, Dushyant; Jones, Daniel; Kruchinin, Stanislav; Gong, Rong; Naylor, Patrick A.
- Source
- 2023 57th Asilomar Conference on Signals, Systems, and Computers Signals, Systems, and Computers, 2023 57th Asilomar Conference on. :1415-1419 Oct, 2023
- Subject
- Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Signal Processing and Analysis
Speech codecs
Direction-of-arrival estimation
Array signal processing
Speech coding
Frequency-domain analysis
Estimation
Transfer functions
- Language
- ISSN
- 2576-2303
Distant Automatic Speech Recognition (ASR) holds the promise of more natural human-machine interface and using multiple microphones to acquire speech in such environments often leads to better accuracy of ASR. The benefits come from encoding spatial information which can be used to enhance the speech and estimate the direction of sound arrival. Current ASR systems are based on end-to-end models that require considerable computational resources and are typically deployed in the cloud, which requires the use of a CODEC to help reduce the transmission bandwidth. We present a multichannel speech coding scheme specifically adapted for microphone array signals and unlike typical speech codecs, this scheme preserves phase relationships of the signals so that the spatial information can be exploited in the cloud. We explore the use of a frequency domain relative transfer function estimator as part of the CODEC. We also explore the use of a modified discrete cosine transform based Self Attention Channel Combinator (SACC) front-end for ASR and show that the time domain signal post SACC processing leads to significant improvements in C50. Furthermore, we show that preprocessing of the array signals with a de-reverberation method leads to a lower WER and also more accurate DOA estimation.