Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech regeneration
Prior works on improving speech quality with visual input typically study each type of
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …
Learning to dub movies via hierarchical prosody models
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …
visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion …
A Systematic Literature Review: Facial Expression and Lip Movement Synchronization of an Audio Track
MH Alshahrani, MS Maashi - IEEE Access, 2024 - ieeexplore.ieee.org
This systematic literature review (SLR) explores the topic of Facial Expression and Lip
Movement Synchronization of an Audio Track in the context of Automatic Dubbing. This SLR …
Movement Synchronization of an Audio Track in the context of Automatic Dubbing. This SLR …
Revise: Self-supervised speech resynthesis with visual input for universal and generalized speech enhancement
Prior works on improving speech quality with visual input typically study each type of
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …
auditory distortion separately (eg, separation, inpainting, video-to-speech) and present …
Data-Driven Advancements in Lip Motion Analysis: A Review
This work reviews the dataset-driven advancements that have occurred in the area of lip
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …
motion analysis, particularly visual lip-reading and visual lip motion authentication, in the …
M3TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
Dubbing refers to the procedure of recording characters by professional voice actors in films
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …
and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) …
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate
speech that aligns well with the video in both time and emotion, based on the tone of a …
speech that aligns well with the video in both time and emotion, based on the tone of a …
MCDubber: Multimodal Context-Aware Expressive Video Dubbing
Automatic Video Dubbing (AVD) aims to take the given script and generate speech that
aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual …
aligns with lip motion and prosody expressiveness. Current AVD models mainly utilize visual …
High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units
The goal of Automatic Voice Over (AVO) is to generate speech in sync with a silent video
given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have …
given its text script. Recent AVO frameworks built upon text-to-speech synthesis (TTS) have …
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to
generate natural and intelligible speech directly from silent talking face videos. While recent …
generate natural and intelligible speech directly from silent talking face videos. While recent …