Analyzing lower half facial gestures for lip reading applications: Survey on vision techniques

SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier
Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …

Learning to dub movies via hierarchical prosody models

G Cong, L Li, Y Qi, ZJ Zha, Q Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …

Ummaformer: A universal multimodal-adaptive transformer framework for temporal forgery localization

R Zhang, H Wang, M Du, H Liu, Y Zhou… - Proceedings of the 31st …, 2023 - dl.acm.org
The emergence of artificial intelligence-generated content (AIGC) has raised concerns about
the authenticity of multimedia content in various fields. However, existing research for …

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

RC Zheng, Y Ai, ZH Ling - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
This paper studies the task of speech reconstruction from ultrasound tongue images and
optical lip videos recorded in a silent speaking mode, where people only activate their intra …

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

J Choi, JH Kim, J Li, JS Chung, S Liu - arXiv preprint arXiv:2411.19486, 2024 - arxiv.org
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to
generate natural and intelligible speech directly from silent talking face videos. While recent …

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

S Hegde, R Mukhopadhyay, CV Jawahar… - Proceedings of the 31st …, 2023 - dl.acm.org
In this paper, we introduce a novel approach to address the task of synthesizing speech from
silent videos of any in-the-wild speaker solely based on lip movements. The traditional …

Npvforensics: Jointing non-critical phonemes and visemes for deepfake detection

Y Chen, Y Yu, R Ni, Y Zhao, H Li - arXiv preprint arXiv:2306.06885, 2023 - arxiv.org
Deepfake technologies empowered by deep learning are rapidly evolving, creating new
security concerns for society. Existing multimodal detection methods usually capture audio …

Lip2Speech: lightweight multi-speaker speech reconstruction with Gabor features

Z Dong, Y Xu, A Abel, D Wang - Applied Sciences, 2024 - mdpi.com
In environments characterised by noise or the absence of audio signals, visual cues, notably
facial and lip movements, serve as valuable substitutes for missing or corrupted speech …

FARV: Leveraging Facial and Acoustic Representation in Vocoder For Video-to-Speech Synthesis

Y Liu, Y Fang, Z Lin - openreview.net
In this paper, we introduce FARV, a vocoder specifically designed for Video-to-Speech
(V2S) synthesis, which integrates both facial embeddings and acoustic units to generate …