Viclevr: A visual reasoning dataset and hybrid multimodal fusion model for visual question answering in vietnamese

KV Tran, HP Phan, K Van Nguyen, NLT Nguyen - Multimedia Systems, 2024 - Springer
In recent years, visual question answering (VQA) has gained significant attention for its
diverse applications, including intelligent car assistance, aiding visually impaired …

Semi-supervised image captioning by adversarially propagating labeled data

DJ Kim, TH Oh, J Choi, IS Kweon - IEEE Access, 2024 - ieeexplore.ieee.org
We present a novel data-efficient semi-supervised framework to improve the generalization
of image captioning models. Constructing a large-scale labeled image captioning dataset is …

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

J Woo, H Ryu, Y Jang, JW Cho, JS Chung - Proceedings of the 32nd …, 2024 - dl.acm.org
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match
text queries. Recent studies in VTG employ cross-attention to correlate visual frames and …

Ifcap: Image-like retrieval and frequency-based entity filtering for zero-shot captioning

S Lee, SW Kim, T Kim, DJ Kim - arXiv preprint arXiv:2409.18046, 2024 - arxiv.org
Recent advancements in image captioning have explored text-only training methods to
overcome the limitations of paired image-text data. However, existing text-only training …

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Y Oh, JW Cho, DJ Kim, IS Kweon, J Kim - arXiv preprint arXiv:2410.05210, 2024 - arxiv.org
In this paper, we propose a new method to enhance compositional understanding in pre-
trained vision and language models (VLMs) without sacrificing performance in zero-shot …