Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
DiffV2S: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding
Recent research has demonstrated impressive results in video-to-speech synthesis which
involves reconstructing speech solely from visual input. However, previous works have …
involves reconstructing speech solely from visual input. However, previous works have …
Intelligible lip-to-speech synthesis with speech units
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …
intelligible speech from a silent lip movement video. Specifically, to complement the …
Visual speech recognition for low-resource languages with automatic labels from whisper model
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple
languages, especially for low-resource languages that have a limited number of labeled …
languages, especially for low-resource languages that have a limited number of labeled …
Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens
In this paper, we propose methods to build a powerful and efficient Image-to-Speech
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to …
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
The goal of this work is to reconstruct high quality speech from lip motions alone, a task also
known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many …
known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many …
Visual Speech Recognition for Languages with Limited Labeled Data Using Automatic Labels from Whisper
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple
languages, especially for low-resource languages that have a limited number of labeled …
languages, especially for low-resource languages that have a limited number of labeled …
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Z Mu, X Yang - arXiv preprint arXiv:2404.12725, 2024 - arxiv.org
The integration of visual cues has revitalized the performance of the target speech extraction
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …
task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm …
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation
In the field of speech synthesis, there is a growing emphasis on employing multimodal
speech to enhance robustness. A key challenge in this area is the scarcity of datasets that …
speech to enhance robustness. A key challenge in this area is the scarcity of datasets that …