Automated audio captioning: An overview of recent progress and new challenges
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …
language descriptions for given audio clips. This task has received increasing attention with …
Beyond the status quo: A contemporary survey of advances and challenges in audio captioning
Automated audio captioning (AAC), a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has overseen much …
innovatively links audio processing and natural language processing, has overseen much …
ACTUAL: Audio captioning with caption feature space regularization
Audio captioning aims at describing the content of audio clips with human language. Due to
the ambiguity of audio content, different people may perceive the same audio clip differently …
the ambiguity of audio content, different people may perceive the same audio clip differently …
Graph attention for automated audio captioning
State-of-the-art audio captioning methods typically use the encoder-decoder structure with
pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the …
pretrained audio neural networks (PANNs) as encoders for feature extraction. However, the …
Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs
two acoustic representation models, EnCodec and CLAP, along with a pretrained language …
two acoustic representation models, EnCodec and CLAP, along with a pretrained language …
Towards generating diverse audio captions via adversarial training
Automated audio captioning is a cross-modal translation task for describing the content of
audio clips with natural language sentences. This task has attracted increasing attention and …
audio clips with natural language sentences. This task has attracted increasing attention and …
A novel plant type, leaf disease and severity identification framework using CNN and transformer with multi-label method
The growth of plants is threatened by numerous diseases. Accurate and timely identification
of these diseases is crucial to prevent disease spreading. Many deep learning-based …
of these diseases is crucial to prevent disease spreading. Many deep learning-based …
Synth-ac: Enhancing audio captioning with synthetic supervision
Data-driven approaches hold promise for audio captioning. However, the development of
audio captioning methods can be biased due to the limited availability and quality of text …
audio captioning methods can be biased due to the limited availability and quality of text …
Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing
The weakly-supervised audio-visual video parsing (AVVP) task aims toparse a video into
temporal events and predict their modality-specific categories. Current works primarily focus …
temporal events and predict their modality-specific categories. Current works primarily focus …
Generating Accurate and Diverse Audio Captions through Variational Autoencoder Framework
Generating both diverse and accurate descriptions is an essential goal in the audio
captioning task. Traditional methods mainly focus on improving the accuracy of the …
captioning task. Traditional methods mainly focus on improving the accuracy of the …