Martin Heckmann, "Audio-visual Word Prominence Detection from Clean and Noisy Speech", Computer Speech and Language, 2018.Abstract
In this paper we introduce the audio-visual detection of word prominence and investigate how the additional visual information can be used to increase the robustness when acoustic background noise is present. We evaluate the detection performance for each modality individually and also perform experiments using feature and decision fusion. Our experiments are based on a corpus with 11 English speakers which contains in addition to the speech signal also videos of the speakers' head region. We capture the rigid head movements of the speakers' by tracking their nose. As an additional visual feature we use a 2D DCT calculated from the mouth region. The results show that as well the rigid head movements as movements inside the mouth region can be used to discriminate prominent from non-prominent words. Based only on the visual features we obtain an Equal Error Rate (EER) of approx. 20% when averaged over all speakers. When we combine the visual and the acoustic features we only see a small improvement compared to the audio-only detection for clean speech. To simulate the background noise we added 4 different noise types at varying SNR levels to the acoustic stream. As the extraction of prosodic cues from noisy speech is little researched we put our results into perspective via juxtaposing them to those of a comparable speech recognition experiment. The results indicate that word prominence detection is quite robust against additional background noise. Despite this the audio-visual fusion leads to marked improvements for the detection from noisy speech. We observe relative reductions of the EER of more than 40%.