Viclevr: A visual reasoning dataset and hybrid multimodal fusion model for visual question answering in vietnamese
In recent years, visual question answering (VQA) has gained significant attention for its
diverse applications, including intelligent car assistance, aiding visually impaired …
diverse applications, including intelligent car assistance, aiding visually impaired …
Semi-supervised image captioning by adversarially propagating labeled data
We present a novel data-efficient semi-supervised framework to improve the generalization
of image captioning models. Constructing a large-scale labeled image captioning dataset is …
of image captioning models. Constructing a large-scale labeled image captioning dataset is …
Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match
text queries. Recent studies in VTG employ cross-attention to correlate visual frames and …
text queries. Recent studies in VTG employ cross-attention to correlate visual frames and …
Ifcap: Image-like retrieval and frequency-based entity filtering for zero-shot captioning
S Lee, SW Kim, T Kim, DJ Kim - arXiv preprint arXiv:2409.18046, 2024 - arxiv.org
Recent advancements in image captioning have explored text-only training methods to
overcome the limitations of paired image-text data. However, existing text-only training …
overcome the limitations of paired image-text data. However, existing text-only training …
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
In this paper, we propose a new method to enhance compositional understanding in pre-
trained vision and language models (VLMs) without sacrificing performance in zero-shot …
trained vision and language models (VLMs) without sacrificing performance in zero-shot …