Neural codec language models are zero-shot text to speech synthesizers

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier

The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

被引用次数：150 相关文章所有 6 个版本

[PDF] mdpi.com

Transformers in the real world: A survey on nlp applications

N Patwardhan, S Marrone, C Sansone - Information, 2023 - mdpi.com

The field of Natural Language Processing (NLP) has undergone a significant transformation
with the introduction of Transformers. From the first introduction of this technology in 2017 …

被引用次数：63 相关文章所有 5 个版本

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

被引用次数：395 相关文章所有 5 个版本

[PDF] neurips.cc

Simple and controllable music generation

J Copet, F Kreuk, I Gat, T Remez… - Advances in …, 2024 - proceedings.neurips.cc

We tackle the task of conditional music generation. We introduce MusicGen, a single
Language Model (LM) that operates over several streams of compressed discrete music …

被引用次数：307 相关文章所有 9 个版本

[PDF] github.io

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org

For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

被引用次数：488 相关文章所有 4 个版本

[PDF] neurips.cc

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc

Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

被引用次数：182 相关文章所有 8 个版本

[PDF] neurips.cc

High-fidelity audio compression with improved rvqgan

R Kumar, P Seetharaman, A Luebs… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality neural …

被引用次数：160 相关文章所有 5 个版本

[PDF] mit.edu

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision

E Kharitonov, D Vincent, Z Borsos… - Transactions of the …, 2023 - direct.mit.edu

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained
with minimal supervision. By combining two types of discrete speech representations, we …

被引用次数：152 相关文章所有 5 个版本

[PDF] arxiv.org

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

被引用次数：155 相关文章所有 3 个版本

[HTML] sciencedirect.com

[HTML][HTML] Combined scaling for zero-shot transfer learning

H Pham, Z Dai, G Ghiasi, K Kawaguchi, H Liu, AW Yu… - Neurocomputing, 2023 - Elsevier

Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …

被引用次数：168 相关文章所有 5 个版本