XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
Speech recognition and translation systems perform poorly on noisy inputs, which are
frequent in realistic environments. Augmenting these systems with visual signals has the …
frequent in realistic environments. Augmenting these systems with visual signals has the …
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …
SpeechQE: Estimating the Quality of Direct Speech Translation
Recent advances in automatic quality estimation for machine translation have exclusively
focused on written language, leaving the speech modality underexplored. In this work, we …
focused on written language, leaving the speech modality underexplored. In this work, we …