Deep spoken keyword spotting: An overview
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams
and has become a fast-growing technology thanks to the paradigm shift introduced by deep …
and has become a fast-growing technology thanks to the paradigm shift introduced by deep …
Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
Learning audio-visual speech representation by masked multimodal cluster prediction
Video recordings of speech contain correlated audio and visual information, providing a
strong signal for speech representation learning from the speaker's lip movements and the …
strong signal for speech representation learning from the speaker's lip movements and the …
Automatic speech recognition: a survey
Recently great strides have been made in the field of automatic speech recognition (ASR) by
using various deep learning techniques. In this study, we present a thorough comparison …
using various deep learning techniques. In this study, we present a thorough comparison …
Visual speech recognition for multiple languages in the wild
Visual speech recognition (VSR) aims to recognize the content of speech based on lip
movements, without relying on the audio stream. Advances in deep learning and the …
movements, without relying on the audio stream. Advances in deep learning and the …
End-to-end audio-visual speech recognition with conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
Convolution-augmented transformer (Conformer), that can be trained in an end-to-end …
Auto-avsr: Audio-visual speech recognition with automatic labels
Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …
Robust self-supervised audio-visual speech recognition
Audio-based automatic speech recognition (ASR) degrades significantly in noisy
environments and is particularly vulnerable to interfering speech, as the model cannot …
environments and is particularly vulnerable to interfering speech, as the model cannot …
Sub-word level lip reading with visual attention
KR Prajwal, T Afouras… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The goal of this paper is to learn strong lip reading models that can recognise speech in
silent videos. Most prior works deal with the open-set visual speech recognition problem by …
silent videos. Most prior works deal with the open-set visual speech recognition problem by …
Audio-visual speech and gesture recognition by sensors of mobile devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …
speech recognition, particularly when audio is corrupted by noise. Additional visual …