Autoregressive speech synthesis without vector quantization
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
E2 TTS: Embarrassingly easy fully non-autoregressive zero-shot TTS
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
Wavllm: Towards robust and adaptive speech large language model
The recent advancements in large language models (LLMs) have revolutionized the field of
natural language processing, progressively broadening their scope to multimodal …
natural language processing, progressively broadening their scope to multimodal …
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …
LLMs Meet Multimodal Generation and Editing: A Survey
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
combining LLMs with multimodal learning. Previous surveys of multimodal large language …
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …
responding with speech in an appropriate style is a natural occurrence in human …
Multi-modal adversarial training for zero-shot voice cloning
A text-to-speech (TTS) model trained to reconstruct speech given text tends towards
predictions that are close to the average characteristics of a dataset, failing to model the …
predictions that are close to the average characteristics of a dataset, failing to model the …
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition
Language models (LMs) have long been used to improve results of automatic speech
recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error …
recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error …
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …