Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration

WN Hsu, T Remez, B Shi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Prior works on improving speech quality with visual input typically study each type of
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …

Learning to dub movies via hierarchical prosody models

G Cong, L Li, Y Qi, ZJ Zha, Q Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …

A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track

MH Alshahrani, MS Maashi - IEEE Access, 2024 - ieeexplore.ieee.org
This systematic literature review (SLR) explores the topic of Facial Expression and Lip
Movement Synchronization of an Audio Track in the context of Automatic Dubbing. This SLR …

Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech enhancement

WN Hsu, T Remez, B Shi, J Donley, Y Adi - arXiv preprint arXiv …, 2022 - arxiv.org
Prior works on improving speech quality with visual input typically study each type of
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …

Data-Driven Advancements in Lip Motion Analysis: A Review

S Torrie, A Sumsion, DJ Lee, Z Sun - Electronics, 2023 - mdpi.com
This work reviews the dataset-driven advancements that have occurred in the area of lip
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …

M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing

Y Liu, LF Wei, X Qian, TH Zhang, SL Chen… - Pattern Recognition …, 2024 - Elsevier
Dubbing refers to the procedure of recording characters by professional voice actors in films
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

G Cong, Y Qi, L Li, A Beheshti, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate
speech that aligns well with the video in both time and emotion, based on the tone of a …

MCDubber: Multimodal Context-Aware Expressive Video Dubbing

Y Zhao, Z Jia, R Liu, D Hu, F Bao, G Gao - arXiv preprint arXiv:2408.11593, 2024 - arxiv.org
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that
aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual …

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

J Lu, B Sisman, M Zhang, H Li - arXiv preprint arXiv:2306.17005, 2023 - arxiv.org
The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video
given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have …

V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

J Choi, JH Kim, J Li, JS Chung, S Liu - arXiv preprint arXiv:2411.19486, 2024 - arxiv.org
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to
generate natural and intelligible speech directly from silent talking face videos. While recent …