Sub-word level lip reading with visual attention

C Sheng, G Kuang, L Bai, C Hou, Y Guo… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Visual speech, referring to the visual domain of speech, has attracted increasing attention
due to its wide applications, such as public security, medical treatment, military defense, and …

被引用次数：40 相关文章所有 9 个版本

[PDF] arxiv.org

Auto-avsr: Audio-visual speech recognition with automatic labels

P Ma, A Haliassos, A Fernandez-Lopez… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Audio-visual speech recognition has received a lot of attention due to its robustness against
acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech …

被引用次数：103 相关文章所有 4 个版本

[PDF] thecvf.com

Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring

J Hong, M Kim, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …

被引用次数：32 相关文章所有 7 个版本

[PDF] thecvf.com

Audio-visual efficient conformer for robust speech recognition

M Burchi, R Timofte - Proceedings of the IEEE/CVF Winter …, 2023 - openaccess.thecvf.com

Abstract End-to-end Automatic Speech Recognition (ASR) systems based on neural
networks have seen large improvements in recent years. The availability of large scale hand …

被引用次数：40 相关文章所有 5 个版本

[PDF] arxiv.org

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

被引用次数：35 相关文章所有 3 个版本

[PDF] neurips.cc

u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality

WN Hsu, B Shi - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc

While audio-visual speech models can yield superior performance and robustness
compared to audio-only models, their development and adoption are hindered by the lack of …

被引用次数：34 相关文章所有 5 个版本

[PDF] thecvf.com

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

被引用次数：12 相关文章所有 6 个版本

[PDF] arxiv.org

Jointly learning visual and auditory speech representations from raw data

A Haliassos, P Ma, R Mira, S Petridis… - arXiv preprint arXiv …, 2022 - arxiv.org

We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and
auditory speech representations. Our pre-training objective involves encoding masked …

被引用次数：48 相关文章所有 4 个版本

[PDF] thecvf.com

Synthvsr: Scaling up visual speech recognition with synthetic supervision

X Liu, E Lakomkin, K Vougioukas… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on
increasingly large amounts of video data, while the publicly available transcribed video …

被引用次数：20 相关文章所有 7 个版本

[PDF] thecvf.com

Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition

X Cheng, T Jin, R Huang, L Li, W Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Multi-media communications facilitate global interaction among people. However, despite
researchers exploring cross-lingual translation techniques such as machine translation and …

被引用次数：22 相关文章所有 6 个版本