Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Beyond the status quo: A contemporary survey of advances and challenges in audio captioning

X Xu, Z Xie, M Wu, K Yu - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Automated audio captioning (AAC), a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has overseen much …

ACTUAL: Audio captioning with caption feature space regularization

Y Zhang, H Yu, R Du, ZH Tan, W Wang… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Audio captioning aims at describing the content of audio clips with human language. Due to
the ambiguity of audio content, different people may perceive the same audio clip differently …

Graph attention for automated audio captioning

F Xiao, J Guan, Q Zhu, W Wang - IEEE Signal Processing …, 2023 - ieeexplore.ieee.org
State-of-the-art audio captioning methods typically use the encoder-decoder structure with
pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the …

Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning

J Kim, J Jung, J Lee, SH Woo - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs
two acoustic representation models, EnCodec and CLAP, along with a pretrained language …

Towards generating diverse audio captions via adversarial training

X Mei, X Liu, J Sun, MD Plumbley - IEEE/ACM transactions on …, 2024 - ieeexplore.ieee.org
Automated audio captioning is a cross-modal translation task for describing the content of
audio clips with natural language sentences. This task has attracted increasing attention and …

A novel plant type, leaf disease and severity identification framework using CNN and transformer with multi-label method

B Yang, M Li, F Li, Y Wang, Q Liang, R Zhao, C Li… - Scientific Reports, 2024 - nature.com
The growth of plants is threatened by numerous diseases. Accurate and timely identification
of these diseases is crucial to prevent disease spreading. Many deep learning-based …

Synth-ac: Enhancing audio captioning with synthetic supervision

F Xiao, Q Zhu, J Guan, X Liu, H Liu, K Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Data-driven approaches hold promise for audio captioning. However, the development of
audio captioning methods can be biased due to the limited availability and quality of text …

Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing

X Sun, X Wang, Q Liu, X Zhou - IEEE Signal Processing Letters, 2024 - ieeexplore.ieee.org
The weakly-supervised audio-visual video parsing (AVVP) task aims toparse a video into
temporal events and predict their modality-specific categories. Current works primarily focus …

Generating Accurate and Diverse Audio Captions through Variational Autoencoder Framework

Y Zhang, R Du, ZH Tan, W Wang… - IEEE Signal Processing …, 2024 - ieeexplore.ieee.org
Generating both diverse and accurate descriptions is an essential goal in the audio
captioning task. Traditional methods mainly focus on improving the accuracy of the …