Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring

J Hong, M Kim, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …

Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge

M Kim, JH Yeo, J Choi, YM Ro - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …

DiffV2S: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding

J Choi, J Hong, YM Ro - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Recent research has demonstrated impressive results in video-to-speech synthesis which
involves reconstructing speech solely from visual input. However, previous works have …

Intelligible lip-to-speech synthesis with speech units

J Choi, M Kim, YM Ro - arXiv preprint arXiv:2305.19603, 2023 - arxiv.org
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …

Visual speech recognition for low-resource languages with automatic labels from whisper model

JH Yeo, M Kim, S Watanabe, YM Ro - arXiv preprint arXiv:2309.08535, 2023 - arxiv.org
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple
languages, especially for low-resource languages that have a limited number of labeled …

Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens

M Kim, J Choi, S Maiti, JH Yeo… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

JH Kim, J Kim, JS Chung - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
The goal of this work is to reconstruct high quality speech from lip motions alone, a task also
known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many …

Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper

JH Yeo, M Kim, S Watanabe… - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple
languages, especially for low-resource languages that have a limited number of labeled …

Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction

Z Mu, X Yang - arXiv preprint arXiv:2404.12725, 2024 - arxiv.org
The integration of visual cues has revitalized the performance of the target speech extraction
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation

S Lei, X Cheng, M Lyu, J Hu, J Tan, R Liu… - Proceedings of the …, 2024 - aclanthology.org
In the field of speech synthesis, there is a growing emphasis on employing multimodal
speech to enhance robustness. A key challenge in this area is the scarcity of datasets that …