Audio-visual fine-tuning of audio-only ASR models

文章

学术资源搜索

获得 3 条结果（用时0.02秒）

我的图书馆

Audio-visual fine-tuning of audio-only ASR models

在引用文章中搜索

[PDF] arxiv.org

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HJ Han, M Anwar, J Pino, WN Hsu, M Carpuat… - arXiv preprint arXiv …, 2024 - arxiv.org

Speech recognition and translation systems perform poorly on noisy inputs, which are
frequent in realistic environments. Augmenting these systems with visual signals has the …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

A Rouditchenko, Y Gong, S Thomas, L Karlinsky… - arXiv preprint arXiv …, 2024 - arxiv.org

Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …

被引用次数：1 相关文章

[PDF] arxiv.org

SpeechQE: Estimating the Quality of Direct Speech Translation

HJ Han, K Duh, M Carpuat - arXiv preprint arXiv:2410.21485, 2024 - arxiv.org

Recent advances in automatic quality estimation for machine translation have exclusively
focused on written language, leaving the speech modality underexplored. In this work, we …