Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring

J Hong, M Kim, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …

Distinguishing homophenes using multi-head visual-audio memory for lip reading

M Kim, JH Yeo, YM Ro - Proceedings of the AAAI conference on …, 2022 - ojs.aaai.org
Recognizing speech from silent lip movement, which is called lip reading, is a challenging
task due to 1) the inherent information insufficiency of lip movement to fully represent the …

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

Speaker-adaptive lip reading with user-dependent padding

M Kim, H Kim, YM Ro - European Conference on Computer Vision, 2022 - Springer
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual
information to model the speech, its performance is inherently sensitive to personal lip …

Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques

SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …

DiffV2S: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding

J Choi, J Hong, YM Ro - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recent research has demonstrated impressive results in video-to-speech synthesis which
involves reconstructing speech solely from visual input. However, previous works have …

[PDF][PDF] SVTS: scalable video-to-speech synthesis

R Mira, A Haliassos, S Petridis… - arXiv preprint …, 2022 - opus.bibliothek.uni-augsburg.de
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip
movements into the corresponding audio. This task has received an increasing amount of …

Intelligible lip-to-speech synthesis with speech units

J Choi, M Kim, YM Ro - arXiv preprint arXiv:2305.19603, 2023 - arxiv.org
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …

Lip-to-speech synthesis in the wild with multi-task learning

M Kim, J Hong, YM Ro - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to
reconstruct speech from visual information alone. However, they have been suffering from …

Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition

J Hong, M Kim, D Yoo, YM Ro - arXiv preprint arXiv:2207.06020, 2022 - arxiv.org
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech
Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature …