Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
text-to-video generation. Its application to audio still lags behind for two main reasons: the …
Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …
representation learning. In this paper, we propose a pipeline of contrastive language-audio …
Pengi: An audio language model for audio tasks
S Deshmukh, B Elizalde, R Singh… - Advances in Neural …, 2023 - proceedings.neurips.cc
In the domain of audio processing, Transfer Learning has facilitated the rise of Self-
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …
Supervised Learning and Zero-Shot Learning techniques. These approaches have led to …
One-peace: Exploring one general representation model toward unlimited modalities
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …
recent years, yet the limited size of existing audio-language datasets poses challenges for …
Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session
Machine Listening, as usually formalized, attempts to perform a task that is, from our
perspective, fundamentally human-performable, and performed by humans. Current …
perspective, fundamentally human-performable, and performed by humans. Current …
Natural language supervision for general-purpose audio representations
Audio-Language models jointly learn multimodal text and audio representations that enable
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …
Zero-Shot inference. Models rely on the encoders to create powerful representations of the …
Flap: Fast language-audio pre-training
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that
efficiently and effectively learns aligned audio and language representations through …
efficiently and effectively learns aligned audio and language representations through …
CoLLAT: on adding fine-grained audio understanding to language models using token-level locked-language tuning
DAR Silva, S Whitehead… - Advances in Neural …, 2024 - proceedings.neurips.cc
Humans can easily understand various audio concepts, but conventional audio
classification models fail due to their inability to predict unseen classes during training. To …
classification models fail due to their inability to predict unseen classes during training. To …
Learning tri-modal embeddings for zero-shot soundscape mapping
We focus on the task of soundscape mapping, which involves predicting the most probable
sounds that could be perceived at a particular geographic location. We utilise recent state-of …
sounds that could be perceived at a particular geographic location. We utilise recent state-of …