Guiding image captioning models toward more specific captions
S Kornblith, L Li, Z Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Image captioning is conventionally formulated as the task of generating captions that match
the conditional distribution of reference image-caption pairs. However, reference captions in …
the conditional distribution of reference image-caption pairs. However, reference captions in …
Segment and caption anything
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability
to generate regional captions. SAM presents strong generalizability to segment anything …
to generate regional captions. SAM presents strong generalizability to segment anything …
Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition
Given recent advances in generative AI technology, a key question is how large language
models (LLMs) can enhance acoustic modeling tasks using text decoding results from a …
models (LLMs) can enhance acoustic modeling tasks using text decoding results from a …
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Existing automatic captioning methods for visual content face challenges such as lack of
detail content hallucination and poor instruction following. In this work we propose …
detail content hallucination and poor instruction following. In this work we propose …
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
The conventional training approach for image captioning involves pre-training a network
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …
ALOHa: A New Measure for Hallucination in Captioning Models
Despite recent advances in multimodal pre-training for visual description, state-of-the-art
models still produce captions containing errors, such as hallucinating objects not present in …
models still produce captions containing errors, such as hallucinating objects not present in …
Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation
Vision--Language Models (VLMs) have demonstrated success across diverse applications,
yet their potential to assist in relevance judgments remains uncertain. This paper assesses …
yet their potential to assist in relevance judgments remains uncertain. This paper assesses …
[PDF][PDF] Automatic audio captioning with encoder fusion, multi-layer aggregation, and large language model enriched summarization
In this report, we describe our submission to Track 6 of the DCASE 2024 challenge for the
task of Automated Audio Captioning (AAC). The submitted models utilize an encoder …
task of Automated Audio Captioning (AAC). The submitted models utilize an encoder …
Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain
In difficult decision-making scenarios, it is common to have conflicting opinions among
expert human decision-makers as there may not be a single right answer. Such decisions …
expert human decision-makers as there may not be a single right answer. Such decisions …
Fluent and Accurate Image Captioning with a Self-Trained Reward Model
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has
been a classical strategy for promoting caption quality at the sequence level. This approach …
been a classical strategy for promoting caption quality at the sequence level. This approach …