Uniaudio: An audio foundation model toward universal audio generation

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

被引用次数：142 相关文章所有 3 个版本

[PDF] arxiv.org

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arXiv preprint arXiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Towards audio language modeling-an overview

H Wu, X Chen, YC Lin, K Chang, HL Chung… - arXiv preprint arXiv …, 2024 - arxiv.org

Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …

被引用次数：20 相关文章所有 2 个版本

[PDF] arxiv.org

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Z Du, J Wang, Q Chen, Y Chu, Z Gao, Z Li, K Hu… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …

被引用次数：41 相关文章

[PDF] arxiv.org

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arXiv preprint arXiv …, 2023 - arxiv.org

Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

被引用次数：78 相关文章所有 2 个版本

[PDF] arxiv.org

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

被引用次数：105 相关文章所有 4 个版本

[PDF] arxiv.org

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li, CH Tsai… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

被引用次数：14 相关文章所有 3 个版本

[PDF] openreview.net

Multi-modal and multi-agent systems meet rationality: A survey

B Jiang, Y Xie, X Wang, WJ Su, CJ Taylor… - ICML 2024 Workshop …, 2024 - openreview.net

Rationality is characterized by logical thinking and decision-making that align with evidence
and logical rules. This quality is essential for effective problem-solving, as it ensures that …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech

J Shi, J Tian, Y Wu, J Jung, JQ Yip… - arXiv preprint arXiv …, 2024 - arxiv.org

Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …

被引用次数：5 相关文章所有 2 个版本