XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HJ Han, M Anwar, J Pino, WN Hsu, M Carpuat… - arXiv preprint arXiv …, 2024 - arxiv.org
Speech recognition and translation systems perform poorly on noisy inputs, which are
frequent in realistic environments. Augmenting these systems with visual signals has the …

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

A Rouditchenko, Y Gong, S Thomas, L Karlinsky… - arXiv preprint arXiv …, 2024 - arxiv.org
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …

SpeechQE: Estimating the Quality of Direct Speech Translation

HJ Han, K Duh, M Carpuat - arXiv preprint arXiv:2410.21485, 2024 - arxiv.org
Recent advances in automatic quality estimation for machine translation have exclusively
focused on written language, leaving the speech modality underexplored. In this work, we …