Guiding image captioning models toward more specific captions

S Kornblith, L Li, Z Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Image captioning is conventionally formulated as the task of generating captions that match
the conditional distribution of reference image-caption pairs. However, reference captions in …

Segment and caption anything

X Huang, J Wang, Y Tang, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability
to generate regional captions. SAM presents strong generalizability to segment anything …

Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition

CHH Yang, T Park, Y Gong, Y Li, Z Chen, YT Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
Given recent advances in generative AI technology, a key question is how large language
models (LLMs) can enhance acoustic modeling tasks using text decoding results from a …

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Y Ge, X Zeng, JS Huffman, TY Lin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Existing automatic captioning methods for visual content face challenges such as lack of
detail content hallucination and poor instruction following. In this work we propose …

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

N Moratelli, D Caffagni, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
The conventional training approach for image captioning involves pre-training a network
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …

ALOHa: A New Measure for Hallucination in Captioning Models

S Petryk, DM Chan, A Kachinthaya, H Zou… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite recent advances in multimodal pre-training for visual description, state-of-the-art
models still produce captions containing errors, such as hallucinating objects not present in …

Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation

JH Yang, J Lin - arXiv preprint arXiv:2408.01363, 2024 - arxiv.org
Vision--Language Models (VLMs) have demonstrated success across diverse applications,
yet their potential to assist in relevance judgments remains uncertain. This paper assesses …

[PDF][PDF] Automatic audio captioning with encoder fusion, multi-layer aggregation, and large language model enriched summarization

J Jung, D Zhang, HCH Yang, SL Wu, DM Chan, Z Kong… - 2024 - dcase.community
In this report, we describe our submission to Track 6 of the DCASE 2024 challenge for the
task of Automated Audio Captioning (AAC). The submitted models utilize an encoder …

Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

B Hu, B Ray, A Leung, A Summerville, D Joy… - arXiv preprint arXiv …, 2024 - arxiv.org
In difficult decision-making scenarios, it is common to have conflicting opinions among
expert human decision-makers as there may not be a single right answer. Such decisions …

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

N Moratelli, M Cornia, L Baraldi… - … Conference on Pattern …, 2025 - Springer
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has
been a classical strategy for promoting caption quality at the sequence level. This approach …