Deep learning for visual speech analysis: A survey
Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …
due to its wide applications, such as public security, medical treatment, military defense, and …
Auto-avsr: Audio-visual speech recognition with automatic labels
Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …
Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
Audio-visual efficient conformer for robust speech recognition
Abstract End-to-end Automatic Speech Recognition (ASR) systems based on neural
networks have seen large improvements in recent years. The availability of large scale hand …
networks have seen large improvements in recent years. The availability of large scale hand …
VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality
While audio-visual speech models can yield superior performance and robustness
compared to audio-only models, their development and adoption are hindered by the lack of …
compared to audio-only models, their development and adoption are hindered by the lack of …
Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
Jointly learning visual and auditory speech representations from raw data
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and
auditory speech representations. Our pre-training objective involves encoding masked …
auditory speech representations. Our pre-training objective involves encoding masked …
Synthvsr: Scaling up visual speech recognition with synthetic supervision
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …
increasingly large amounts of video data, while the publicly available transcribed video …
Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …
researchers exploring cross-lingual translation techniques such as machine translation and …