Natural language supervision for general-purpose audio representations

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

被引用次数：44 相关文章所有 4 个版本

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Training audio captioning models without audio

S Deshmukh, B Elizalde… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org

Automated Audio Captioning (AAC) is the task of generating natural language descriptions
given an audio stream. A typical AAC system requires manually curated training data of …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Audio Dialogues: Dialogues dataset for audio and music understanding

A Goel, Z Kong, R Valle, B Catanzaro - arXiv preprint arXiv:2404.07616, 2024 - arxiv.org

Existing datasets for audio understanding primarily focus on single-turn interactions (ie
audio captioning, audio question answering) for describing audio in natural language, thus …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - arXiv preprint arXiv …, 2024 - arxiv.org

Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …

Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey

P Sahoo, P Meharia, A Ghosh, S Saha, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

H Bukhari, S Deshmukh, H Dhamyal, B Raj… - arXiv preprint arXiv …, 2024 - arxiv.org

Speech Emotion Recognition (SER) has been traditionally formulated as a classification
task. However, emotions are generally a spectrum whose distribution varies from situation to …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Correlation of Fr\'echet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant

M Tailleur, J Lee, M Lagrange, K Choi… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper explores whether considering alternative domain-specific embeddings to
calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

On the audio hallucinations in large audio-video language models

T Nishimura, S Nakada, M Kondo - arXiv preprint arXiv:2401.09774, 2024 - arxiv.org

Large audio-video language models can generate descriptions for both video and audio.
However, they sometimes ignore audio content, producing audio descriptions solely reliant …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Y Wang, R Hu, R Huang, Z Hong, R Li, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and
naturalness, yet they lack the capability to control the style attributes of the synthesized …

被引用次数：1 相关文章所有 3 个版本