Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

E2 TTS: Embarrassingly easy fully non-autoregressive zero-shot TTS

SE Eskimez, X Wang, M Thakker, C Li, CH Tsai… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

Wavllm: Towards robust and adaptive speech large language model

S Hu, L Zhou, S Liu, S Chen, H Hao, J Pan… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent advancements in large language models (LLMs) have revolutionized the field of
natural language processing, progressively broadening their scope to multimodal …

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

M Łajszczak, G Cámbara, Y Li, F Beyhan… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …

LLMs Meet Multimodal Generation and Editing: A Survey

Y He, Z Liu, J Chen, Z Tian, H Liu, X Chi, R Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the recent advancement in large language models (LLMs), there is a growing interest in
combining LLMs with multimodal learning. Previous surveys of multimodal large language …

UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis

X Zhu, W Tian, X Wang, L He, Y Xiao, X Wang… - ACM Multimedia …, 2024 - openreview.net
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

Multi-modal adversarial training for zero-shot voice cloning

J Janiczek, D Chong, D Dai, A Faria, C Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
A text-to-speech (TTS) model trained to reconstruct speech given text tends towards
predictions that are close to the average characteristics of a dataset, failing to model the …

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Z Gu, T Likhomanenko, H Bai, E McDermott… - arXiv preprint arXiv …, 2024 - arxiv.org
Language models (LMs) have long been used to improve results of automatic speech
recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error …

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Z Jin, Y Yang, M Shi, W Kang, X Yang, Z Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
The evolving speech processing landscape is increasingly focused on complex scenarios
like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions …