Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers.
The launch of ChatGPT in 2022 garnered global attention, marking a significant milestone in
the Generative Artificial Intelligence (GAI) field. While GAI has been in effect for the past …
the Generative Artificial Intelligence (GAI) field. While GAI has been in effect for the past …
Generative artificial intelligence in learning analytics: Contextualising opportunities and challenges through the learning analytics cycle
Generative artificial intelligence (GenAI), exemplified by ChatGPT, Midjourney, and other
state-of-the-art large language models and diffusion models, holds significant potential for …
state-of-the-art large language models and diffusion models, holds significant potential for …
Understanding diffusion objectives as the elbo with simple data augmentation
To achieve the highest perceptual quality, state-of-the-art diffusion models are optimized
with objectives that typically look very different from the maximum likelihood and the …
with objectives that typically look very different from the maximum likelihood and the …
Speechx: Neural codec language model as a versatile speech transformer
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
Seamless: Multilingual Expressive and Streaming Speech Translation
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …
mediated communication feel seamless when compared to human-to-human dialogue. In …
Prompttts 2: Describing and generating voices with text prompt
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
Generative pre-training for speech with flow matching
Generative models have gained more and more attention in recent years for their
remarkable success in tasks that required estimating and sampling data distribution to …
remarkable success in tasks that required estimating and sampling data distribution to …
Adriver-i: A general world model for autonomous driving
Typically, autonomous driving adopts a modular design, which divides the full stack into
perception, prediction, planning and control parts. Though interpretable, such modular …
perception, prediction, planning and control parts. Though interpretable, such modular …