Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques
SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …
applications. This article provides a comprehensive review of benchmark datasets available …
Learning to dub movies via hierarchical prosody models
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …
Ummaformer: A universal multimodal-adaptive transformer framework for temporal forgery localization
The emergence of artificial intelligence-generated content (AIGC) has raised concerns about
the authenticity of multimedia content in various fields. However, existing research for …
the authenticity of multimedia content in various fields. However, existing research for …
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
This paper studies the task of speech reconstruction from ultrasound tongue images and
optical lip videos recorded in a silent speaking mode, where people only activate their intra …
optical lip videos recorded in a silent speaking mode, where people only activate their intra …
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to
generate natural and intelligible speech directly from silent talking face videos. While recent …
generate natural and intelligible speech directly from silent talking face videos. While recent …
Towards Accurate Lip-to-Speech Synthesis in-the-Wild
In this paper, we introduce a novel approach to address the task of synthesizing speech from
silent videos of any in-the-wild speaker solely based on lip movements. The traditional …
silent videos of any in-the-wild speaker solely based on lip movements. The traditional …
Npvforensics: Jointing non-critical phonemes and visemes for deepfake detection
Deepfake technologies empowered by deep learning are rapidly evolving, creating new
security concerns for society. Existing multimodal detection methods usually capture audio …
security concerns for society. Existing multimodal detection methods usually capture audio …
Lip2Speech: lightweight multi-speaker speech reconstruction with Gabor features
In environments characterised by noise or the absence of audio signals, visual cues, notably
facial and lip movements, serve as valuable substitutes for missing or corrupted speech …
facial and lip movements, serve as valuable substitutes for missing or corrupted speech …
FARV: Leveraging Facial and Acoustic Representation in Vocoder For Video-to-Speech Synthesis
Y Liu, Y Fang, Z Lin - openreview.net
In this paper, we introduce FARV, a vocoder specifically designed for Video-to-Speech
(V2S) synthesis, which integrates both facial embeddings and acoustic units to generate …
(V2S) synthesis, which integrates both facial embeddings and acoustic units to generate …