Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …
instruction-following data have demonstrated remarkable performance in various multimodal …
Generative Region-Language Pretraining for Open-Ended Object Detection
In recent research significant attention has been devoted to the open-vocabulary object
detection task aiming to generalize beyond the limited number of classes labeled during …
detection task aiming to generalize beyond the limited number of classes labeled during …
Cultural and linguistic diversity improves visual representations
Computer vision often treats perception as objective, and this assumption gets reflected in
the way that datasets are collected and models are trained. For instance, image descriptions …
the way that datasets are collected and models are trained. For instance, image descriptions …
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …
two main types: training-free and text-only-training methods. While both types integrate pre …
Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …
Large Language Models (MLLMs), they still struggle with critical problems when required to …
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
Image captioning evaluation metrics can be divided into two categories, reference-based
metrics and reference-free metrics. However, reference-based approaches may struggle to …
metrics and reference-free metrics. However, reference-based approaches may struggle to …
Benchmarking and Improving Detail Image Caption
Image captioning has long been regarded as a fundamental task in visual understanding.
Recently, however, few large vision-language model (LVLM) research discusses model's …
Recently, however, few large vision-language model (LVLM) research discusses model's …
Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction
Q Shen, H Lin, H Liu, Z Lin… - 2024 International Joint …, 2024 - ieeexplore.ieee.org
Given a text and its related image, the multimodal relation extraction (MRE) task aims at
predicting the correct semantic relation between two entities in the input text. Though certain …
predicting the correct semantic relation between two entities in the input text. Though certain …
Multimodal scene-graph matching for cheapfakes detection
MT Nguyen, QT Nguyen, MS Dao, BT Nguyen - 2024 - researchsquare.com
The development of technology and social media platforms has led to the proliferation of
fake news, including the cheapfakes problem. Cheapfakes can be produced easily and …
fake news, including the cheapfakes problem. Cheapfakes can be produced easily and …
Linear Alignment of Vision-language Models for Image Captioning
Recently, vision-language models like CLIP have advanced the state of the art in a variety of
multi-modal tasks including image captioning and caption evaluation. Many approaches …
multi-modal tasks including image captioning and caption evaluation. Many approaches …