Training audio captioning models without audio

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

被引用次数：44 相关文章所有 4 个版本

[PDF] arxiv.org

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Y Zhang, X Xu, R Du, H Liu, Y Dong, ZH Tan… - arXiv preprint arXiv …, 2024 - arxiv.org

In traditional audio captioning methods, a model is usually trained in a fully supervised
manner using a human-annotated dataset containing audio-text pairs and then evaluated on …

EDTC: enhance depth of text comprehension in automated audio captioning

L Tan, Y Cao, Y Zhou - arXiv preprint arXiv:2402.17259, 2024 - arxiv.org

Modality discrepancies have perpetually posed significant challenges within the realm of
Automated Audio Captioning (AAC) and across all multi-modal domains. Facilitating models …

Dissecting Temporal Understanding in Text-to-Audio Retrieval

AM Oncescu, JF Henriques, AS Koepke - ACM Multimedia 2024 - openreview.net

Recent advancements in machine learning have fueled research on multimodal interactions,
such as for instance text-to-video and text-to-audio retrieval tasks. These tasks require …

[PDF] arxiv.org

Domain Adaptation for Contrastive Audio-Language Models

S Deshmukh, R Singh, B Raj - arXiv preprint arXiv:2402.09585, 2024 - arxiv.org

Audio-Language Models (ALM) aim to be general-purpose audio models by providing zero-
shot capabilities at test time. The zero-shot performance of ALM improves by using suitable …

被引用次数：2 相关文章所有 3 个版本