How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Lxmert: Learning cross-modality encoder representations from transformers
Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …
semantics, and, most importantly, the alignment and relationships between these two …
Unified vision-language pre-training for image captioning and vqa
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …
Ok-vqa: A visual question answering benchmark requiring external knowledge
Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the
joint space of vision and language and serves as a proxy for the AI task of scene …
joint space of vision and language and serves as a proxy for the AI task of scene …
Towards vqa models that can read
Studies have shown that a dominant class of questions asked by visually impaired users on
images of their surroundings involves reading text in the image. But today's VQA models can …
images of their surroundings involves reading text in the image. But today's VQA models can …
In defense of grid features for visual question answering
Popularized asbottom-up'attention, bounding box (or region) based visual features have
recently surpassed vanilla grid-based convolutional features as the de facto standard for …
recently surpassed vanilla grid-based convolutional features as the de facto standard for …
Promptcap: Prompt-guided task-aware image captioning
Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …
Relation-aware graph attention network for visual question answering
In order to answer semantically-complicated questions about an image, a Visual Question
Answering (VQA) model needs to fully understand the visual scene in the image, especially …
Answering (VQA) model needs to fully understand the visual scene in the image, especially …
Latr: Layout-aware transformer for scene-text vqa
We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …
Tap: Text-aware pre-training for text-vqa and text-caption
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …
tasks. These two tasks aim at reading and understanding scene text in images for question …