Retrieval-augmented generation for ai-generated content: A survey
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …
advancements in model algorithms, scalable foundation model architectures, and the …
Adapting frechet audio distance for generative music evaluation
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
Training audio captioning models without audio
S Deshmukh, B Elizalde… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Automated Audio Captioning (AAC) is the task of generating natural language descriptions
given an audio stream. A typical AAC system requires manually curated training data of …
given an audio stream. A typical AAC system requires manually curated training data of …
Audio Dialogues: Dialogues dataset for audio and music understanding
Existing datasets for audio understanding primarily focus on single-turn interactions (ie
audio captioning, audio question answering) for describing audio in natural language, thus …
audio captioning, audio question answering) for describing audio in natural language, thus …
Audio-Language Datasets of Scenes and Events: A Survey
Audio-language models (ALMs) process sounds to provide a linguistic description of sound-
producing events and scenes. Recent advances in computing power and dataset creation …
producing events and scenes. Recent advances in computing power and dataset creation …
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …
video domains has shown remarkable capabilities in diverse tasks. However, the …
SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
Speech Emotion Recognition (SER) has been traditionally formulated as a classification
task. However, emotions are generally a spectrum whose distribution varies from situation to …
task. However, emotions are generally a spectrum whose distribution varies from situation to …
Correlation of Fr\'echet Audio Distance With Human Perception of Environmental Audio Is Embedding Dependant
This paper explores whether considering alternative domain-specific embeddings to
calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with …
calculate the Fr\'echet Audio Distance (FAD) metric can help the FAD to correlate better with …
On the audio hallucinations in large audio-video language models
T Nishimura, S Nakada, M Kondo - arXiv preprint arXiv:2401.09774, 2024 - arxiv.org
Large audio-video language models can generate descriptions for both video and audio.
However, they sometimes ignore audio content, producing audio descriptions solely reliant …
However, they sometimes ignore audio content, producing audio descriptions solely reliant …
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and
naturalness, yet they lack the capability to control the style attributes of the synthesized …
naturalness, yet they lack the capability to control the style attributes of the synthesized …