Lip-to-speech synthesis for arbitrary speakers in the wild

SJ Preethi - Computer Vision and Image Understanding, 2023 - Elsevier

Lip reading has gained popularity due to the proliferation of emerging real-world
applications. This article provides a comprehensive review of benchmark datasets available …

被引用次数：8 相关文章所有 2 个版本

[PDF] thecvf.com

Learning to dub movies via hierarchical prosody models

G Cong, L Li, Y Qi, ZJ Zha, Q Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …

被引用次数：18 相关文章所有 8 个版本

[PDF] arxiv.org

Ummaformer: A universal multimodal-adaptive transformer framework for temporal forgery localization

R Zhang, H Wang, M Du, H Liu, Y Zhou… - Proceedings of the 31st …, 2023 - dl.acm.org

The emergence of artificial intelligence-generated content (AIGC) has raised concerns about
the authenticity of multimedia content in various fields. However, existing research for …

被引用次数：16 相关文章所有 4 个版本

[PDF] openreview.net

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

RC Zheng, Y Ai, ZH Ling - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org

This paper studies the task of speech reconstruction from ultrasound tongue images and
optical lip videos recorded in a silent speaking mode, where people only activate their intra …

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

J Choi, JH Kim, J Li, JS Chung, S Liu - arXiv preprint arXiv:2411.19486, 2024 - arxiv.org

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to
generate natural and intelligible speech directly from silent talking face videos. While recent …

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

S Hegde, R Mukhopadhyay, CV Jawahar… - Proceedings of the 31st …, 2023 - dl.acm.org

In this paper, we introduce a novel approach to address the task of synthesizing speech from
silent videos of any in-the-wild speaker solely based on lip movements. The traditional …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

Npvforensics: Jointing non-critical phonemes and visemes for deepfake detection

Y Chen, Y Yu, R Ni, Y Zhao, H Li - arXiv preprint arXiv:2306.06885, 2023 - arxiv.org

Deepfake technologies empowered by deep learning are rapidly evolving, creating new
security concerns for society. Existing multimodal detection methods usually capture audio …

被引用次数：2 相关文章所有 2 个版本

[PDF] mdpi.com

Lip2Speech: lightweight multi-speaker speech reconstruction with Gabor features

Z Dong, Y Xu, A Abel, D Wang - Applied Sciences, 2024 - mdpi.com

In environments characterised by noise or the absence of audio signals, visual cues, notably
facial and lip movements, serve as valuable substitutes for missing or corrupted speech …

FARV: Leveraging Facial and Acoustic Representation in Vocoder For Video-to-Speech Synthesis

Y Liu, Y Fang, Z Lin - openreview.net

In this paper, we introduce FARV, a vocoder specifically designed for Video-to-Speech
(V2S) synthesis, which integrates both facial embeddings and acoustic units to generate …