The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges

A Bandi, PVSR Adapa, YEVPK Kuchi - Future Internet, 2023 - mdpi.com
Generative artificial intelligence (AI) has emerged as a powerful technology with numerous
applications in various domains. There is a need to identify the requirements and evaluation …

Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Clap learning audio concepts from natural language supervision

B Elizalde, S Deshmukh, M Al Ismail… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Mainstream machine listening models are trained to learn audio concepts under the
paradigm of one class label to many recordings focusing on one task. Learning under such …

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …