Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers.

S Bengesi, H El-Sayed, MK Sarker, Y Houkpati… - IEEE …, 2024 - ieeexplore.ieee.org
The launch of ChatGPT in 2022 garnered global attention, marking a significant milestone in
the Generative Artificial Intelligence (GAI) field. While GAI has been in effect for the past …

Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle

L Yan, R Martinez-Maldonado, D Gasevic - Proceedings of the 14th …, 2024 - dl.acm.org
Generative artificial intelligence (GenAI), exemplified by ChatGPT, Midjourney, and other
state-of-the-art large language models and diffusion models, holds significant potential for …

Understanding diffusion objectives as the elbo with simple data augmentation

D Kingma, R Gao - Advances in Neural Information …, 2024 - proceedings.neurips.cc
To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized
with objectives that typically look very different from the maximum likelihood and the …

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arXiv preprint arXiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Prompttts 2: Describing and generating voices with text prompt

Y Leng, Z Guo, K Shen, X Tan, Z Ju, Y Liu, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

Generative pre-training for speech with flow matching

AH Liu, M Le, A Vyas, B Shi, A Tjandra… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative models have gained more and more attention in recent years for their
remarkable success in tasks that required estimating and sampling data distribution to …

Adriver-i: A general world model for autonomous driving

F Jia, W Mao, Y Liu, Y Zhao, Y Wen, C Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Typically, autonomous driving adopts a modular design, which divides the full stack into
perception, prediction, planning and control parts. Though interpretable, such modular …