Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Pengi: An audio language model for audio tasks

S Deshmukh, B Elizalde, R Singh… - Advances in Neural …, 2023 - proceedings.neurips.cc
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

LM Heller, B Elizalde, B Raj, S Deshmukh - arXiv preprint arXiv …, 2023 - arxiv.org
Machine Listening, as usually formalized, attempts to perform a task that is, from our
perspective, fundamentally human-performable, and performed by humans. Current …

Natural language supervision for general-purpose audio representations

B Elizalde, S Deshmukh, H Wang - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Audio-Language models jointly learn multimodal text and audio representations that enable
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …

Flap: Fast language-audio pre-training

CF Yeh, PY Huang, V Sharma, SW Li… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that
efficiently and effectively learns aligned audio and language representations through …

CoLLAT: on adding fine-grained audio understanding to language models using token-level locked-language tuning

DAR Silva, S Whitehead… - Advances in Neural …, 2024 - proceedings.neurips.cc
Humans can easily understand various audio concepts, but conventional audio
classification models fail due to their inability to predict unseen classes during training. To …

Learning tri-modal embeddings for zero-shot soundscape mapping

S Khanal, S Sastry, A Dhakal, N Jacobs - arXiv preprint arXiv:2309.10667, 2023 - arxiv.org
We focus on the task of soundscape mapping, which involves predicting the most probable
sounds that could be perceived at a particular geographic location. We utilise recent state-of …