Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data

Q Yu, J Li, L Wei, L Pang, W Ye, B Qin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) tuned on machine-generated
instruction-following data have demonstrated remarkable performance in various multimodal …

Generative Region-Language Pretraining for Open-Ended Object Detection

C Lin, Y Jiang, L Qu, Z Yuan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In recent research significant attention has been devoted to the open-vocabulary object
detection task aiming to generalize beyond the limited number of classes labeled during …

Cultural and linguistic diversity improves visual representations

A Ye, S Santy, JD Hwang, AX Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Computer vision often treats perception as objective, and this assumption gets reflected in
the way that datasets are collected and models are trained. For instance, image descriptions …

MeaCap: Memory-Augmented Zero-shot Image Captioning

Z Zeng, Y Xie, H Zhang, C Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …

Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

D Qi, H Zhao, Z Wei, S Li - arXiv preprint arXiv:2406.10839, 2024 - arxiv.org
Despite recent advances in the general visual instruction-following ability of Multimodal
Large Language Models (MLLMs), they still struggle with critical problems when required to …

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Z Zeng, J Sun, H Zhang, T Wen, Y Su, Y Xie… - arXiv preprint arXiv …, 2024 - arxiv.org
Image captioning evaluation metrics can be divided into two categories, reference-based
metrics and reference-free metrics. However, reference-based approaches may struggle to …

Benchmarking and Improving Detail Image Caption

H Dong, J Li, B Wu, J Wang, Y Zhang, H Guo - arXiv preprint arXiv …, 2024 - arxiv.org
Image captioning has long been regarded as a fundamental task in visual understanding.
Recently, however, few large vision-language model (LVLM) research discusses model's …

Exploiting Visual Relation and Multi-Grained Knowledge for Multimodal Relation Extraction

Q Shen, H Lin, H Liu, Z Lin… - 2024 International Joint …, 2024 - ieeexplore.ieee.org
Given a text and its related image, the multimodal relation extraction (MRE) task aims at
predicting the correct semantic relation between two entities in the input text. Though certain …

Multimodal scene-graph matching for cheapfakes detection

MT Nguyen, QT Nguyen, MS Dao, BT Nguyen - 2024 - researchsquare.com
The development of technology and social media platforms has led to the proliferation of
fake news, including the cheapfakes problem. Cheapfakes can be produced easily and …

Linear Alignment of Vision-language Models for Image Captioning

Recently, vision-language models like CLIP have advanced the state of the art in a variety of
multi-modal tasks including image captioning and caption evaluation. Many approaches …