Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …

Training audio captioning models without audio

S Deshmukh, B Elizalde… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Automated Audio Captioning (AAC) is the task of generating natural language descriptions
given an audio stream. A typical AAC system requires manually curated training data of …

Audio Dialogues: Dialogues dataset for audio and music understanding

A Goel, Z Kong, R Valle, B Catanzaro - arXiv preprint arXiv:2404.07616, 2024 - arxiv.org
Existing datasets for audio understanding primarily focus on single-turn interactions (ie
audio captioning, audio question answering) for describing audio in natural language, thus …

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - arXiv preprint arXiv …, 2024 - arxiv.org
Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …

Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey

P Sahoo, P Meharia, A Ghosh, S Saha, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

H Bukhari, S Deshmukh, H Dhamyal, B Raj… - arXiv preprint arXiv …, 2024 - arxiv.org
Speech Emotion Recognition (SER) has been traditionally formulated as a classification
task. However, emotions are generally a spectrum whose distribution varies from situation to …

Correlation of Fr\'echet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant

M Tailleur, J Lee, M Lagrange, K Choi… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper explores whether considering alternative domain-specific embeddings to
calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with …

On the audio hallucinations in large audio-video language models

T Nishimura, S Nakada, M Kondo - arXiv preprint arXiv:2401.09774, 2024 - arxiv.org
Large audio-video language models can generate descriptions for both video and audio.
However, they sometimes ignore audio content, producing audio descriptions solely reliant …

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Y Wang, R Hu, R Huang, Z Hong, R Li, W Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and
naturalness, yet they lack the capability to control the style attributes of the synthesized …