Voicebox: Text-guided multilingual universal speech generation at scale
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …
community. These models not only generate high fidelity outputs, but are also generalists …
Uniaudio: An audio foundation model toward universal audio generation
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
Speechx: Neural codec language model as a versatile speech transformer
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
Soundstorm: Efficient parallel audio generation
We present SoundStorm, a model for efficient, non-autoregressive audio generation.
SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional …
SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional …
Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …
diffusion and adversarial training with large speech language models (SLMs) to achieve …
Efficient neural music generation
Recent progress in music generation has been remarkably advanced by the state-of-the-art
MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse …
MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse …
Seamless: Multilingual Expressive and Streaming Speech Translation
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …
mediated communication feel seamless when compared to human-to-human dialogue. In …
Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …
achieving timbre and speech style generalization, particularly in zero-shot TTS. However …
Prompttts 2: Describing and generating voices with text prompt
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …