MAAS: Multi-modal Assignation for Active Speaker Detection
- Resource Type
- Conference
- Authors
- Alcazar, Juan Leon; Heilbron, Fabian Caba; Thabet, Ali K.; Ghanem, Bernard
- Source
- 2021 IEEE/CVF International Conference on Computer Vision (ICCV) ICCV Computer Vision (ICCV), 2021 IEEE/CVF International Conference on. :265-274 Oct, 2021
- Subject
- Computing and Processing
Visualization
Computer vision
Benchmark testing
Feature extraction
Data structures
Task analysis
Vision + other modalities
Video analysis and understanding
- Language
- ISSN
- 2380-7504
Active speaker detection requires a mindful integration of multi-modal cues. Current methods focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. We present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem and provides a straightforward strategy, where independent visual features (speakers) in the scene are assigned to a previously detected speech event. Our experiments show that a small graph data structure built from local information can approximate an instantaneous audio-visual assignment problem. More-over, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.