Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring
This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
corruption situation where audio inputs and visual inputs are both corrupted, which is not …
Distinguishing homophenes using multi-head visual-audio memory for lip reading
Recognizing speech from silent lip movement, which is called lip reading, is a challenging
task due to 1) the inherent information insufficiency of lip movement to fully represent the …
task due to 1) the inherent information insufficiency of lip movement to fully represent the …
Lip reading for low-resource languages by learning and combining general speech knowledge and language-specific knowledge
This paper proposes a novel lip reading framework, especially for low-resource languages,
which has not been well addressed in the previous literature. Since low-resource languages …
which has not been well addressed in the previous literature. Since low-resource languages …
Speaker-adaptive lip reading with user-dependent padding
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual
information to model the speech, its performance is inherently sensitive to personal lip …
information to model the speech, its performance is inherently sensitive to personal lip …
Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques
SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …
applications. This article provides a comprehensive review of benchmark datasets available …
DiffV2S: Diffusion-based video-to-speech synthesis with vision-guided speaker embedding
Recent research has demonstrated impressive results in video-to-speech synthesis which
involves reconstructing speech solely from visual input. However, previous works have …
involves reconstructing speech solely from visual input. However, previous works have …
[PDF][PDF] SVTS: scalable video-to-speech synthesis
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip
movements into the corresponding audio. This task has received an increasing amount of …
movements into the corresponding audio. This task has received an increasing amount of …
Intelligible lip-to-speech synthesis with speech units
In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing
intelligible speech from a silent lip movement video. Specifically, to complement the …
intelligible speech from a silent lip movement video. Specifically, to complement the …
Lip-to-speech synthesis in the wild with multi-task learning
Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to
reconstruct speech from visual information alone. However, they have been suffering from …
reconstruct speech from visual information alone. However, they have been suffering from …
Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition
This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech
Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature …
Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature …