Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …
recent years, yet the limited size of existing audio-language datasets poses challenges for …
Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Towards audio language modeling-an overview
Neural audio codecs are initially introduced to compress audio data into compact codes to
reduce transmission latency. Researchers recently discovered the potential of codecs as …
reduce transmission latency. Researchers recently discovered the potential of codecs as …
Lauragpt: Listen, attend, understand, and regenerate audio with gpt
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …
on various natural language processing tasks, and have shown great potential as …
Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
Flashspeech: Efficient zero-shot speech synthesis
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …
by language models and diffusion models. However, the generation process of both …
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
Multi-modal and multi-agent systems meet rationality: A survey
Rationality is characterized by logical thinking and decision-making that align with evidence
and logical rules. This quality is essential for effective problem-solving, as it ensures that …
and logical rules. This quality is essential for effective problem-solving, as it ensures that …
Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …
addition to signal compression capabilities, discrete codecs have also been found to …