Deep learning for visual speech analysis: A survey

C Sheng, G Kuang, L Bai, C Hou, Y Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …

Auto-avsr: Audio-visual speech recognition with automatic labels

P Ma, A Haliassos, A Fernandez-Lopez… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …

Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring

J Hong, M Kim, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …

Audio-visual efficient conformer for robust speech recognition

M Burchi, R Timofte - Proceedings of the IEEE/CVF Winter …, 2023 - openaccess.thecvf.com
Abstract End-to-end Automatic Speech Recognition (ASR) systems based on neural
networks have seen large improvements in recent years. The availability of large scale hand …

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality

WN Hsu, B Shi - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc
While audio-visual speech models can yield superior performance and robustness
compared to audio-only models, their development and adoption are hindered by the lack of …

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

Jointly learning visual and auditory speech representations from raw data

A Haliassos, P Ma, R Mira, S Petridis… - arXiv preprint arXiv …, 2022 - arxiv.org
We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and
auditory speech representations. Our pre-training objective involves encoding masked …

Synthvsr: Scaling up visual speech recognition with synthetic supervision

X Liu, E Lakomkin, K Vougioukas… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …