Audiocaps: Generating captions for audios in the wild

The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges

A Bandi, PVSR Adapa, YEVPK Kuchi - Future Internet, 2023 - mdpi.com

Generative artificial intelligence (AI) has emerged as a powerful technology with numerous
applications in various domains. There is a need to identify the requirements and evaluation …

被引用次数：158 相关文章所有 7 个版本

[PDF] springer.com

Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer

Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

被引用次数：48 相关文章所有 11 个版本

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

被引用次数：559 相关文章所有 7 个版本

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

被引用次数：272 相关文章所有 4 个版本

[PDF] arxiv.org

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

被引用次数：357 相关文章所有 7 个版本

[PDF] mlr.press

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press

Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

被引用次数：199 相关文章所有 7 个版本

[PDF] arxiv.org

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

被引用次数：315 相关文章所有 5 个版本

[PDF] arxiv.org

Clap learning audio concepts from natural language supervision

B Elizalde, S Deshmukh, M Al Ismail… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org

Mainstream machine listening models are trained to learn audio concepts under the
paradigm of one class label to many recordings focusing on one task. Learning under such …

被引用次数：285 相关文章所有 3 个版本

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

被引用次数：96 相关文章所有 8 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：52 相关文章所有 3 个版本