Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
AV-Deepfake1M: A large-scale LLM-driven audio-visual deepfake dataset
The detection and localization of highly realistic deepfake audio-visual content are
challenging even for the most advanced state-of-the-art methods. While most of the research …
challenging even for the most advanced state-of-the-art methods. While most of the research …
Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis
Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …
shot speech synthesis. However, they require a large-scale data and possess the same …
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …
cloning capabilities, requiring only a few seconds of unseen speaker voice prompts …
1M-Deepfakes Detection Challenge
The detection and localization of deepfake content, particularly when small fake segments
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …
are seamlessly mixed with real videos, remains a significant challenge in the field of digital …
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Laughter is one of the most expressive and natural aspects of human speech, conveying
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …
LLMs Meet Multimodal Generation and Editing: A Survey
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis
The long speech sequence has been troubling language models (LM) based TTS
approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a …
approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a …
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …
images, video, speech, and audio. A crucial component of these models is the codec …
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …