Pythia v0. 1: the winning entry to the vqa challenge 2018

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arXiv preprint arXiv …, 2021 - arxiv.org

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

被引用次数：382 相关文章所有 3 个版本

[PDF] arxiv.org

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org

Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

被引用次数：2527 相关文章所有 4 个版本

[PDF] aaai.org

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

被引用次数：924 相关文章所有 7 个版本

[PDF] thecvf.com

Ok-vqa: A visual question answering benchmark requiring external knowledge

K Marino, M Rastegari, A Farhadi… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the
joint space of vision and language and serves as a proxy for the AI task of scene …

被引用次数：795 相关文章所有 8 个版本

[PDF] thecvf.com

Towards vqa models that can read

A Singh, V Natarajan, M Shah… - Proceedings of the …, 2019 - openaccess.thecvf.com

Studies have shown that a dominant class of questions asked by visually impaired users on
images of their surroundings involves reading text in the image. But today's VQA models can …

被引用次数：712 相关文章所有 8 个版本

[PDF] thecvf.com

In defense of grid features for visual question answering

H Jiang, I Misra, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Popularized asbottom-up'attention, bounding box (or region) based visual features have
recently surpassed vanilla grid-based convolutional features as the de facto standard for …

被引用次数：375 相关文章所有 10 个版本

[PDF] arxiv.org

Promptcap: Prompt-guided task-aware image captioning

Y Hu, H Hua, Z Yang, W Shi, NA Smith… - arXiv preprint arXiv …, 2022 - arxiv.org

Knowledge-based visual question answering (VQA) involves questions that require world
knowledge beyond the image to yield the correct answer. Large language models (LMs) like …

被引用次数：78 相关文章所有 2 个版本

[PDF] thecvf.com

Relation-aware graph attention network for visual question answering

L Li, Z Gan, Y Cheng, J Liu - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com

In order to answer semantically-complicated questions about an image, a Visual Question
Answering (VQA) model needs to fully understand the visual scene in the image, especially …

被引用次数：402 相关文章所有 8 个版本

[PDF] thecvf.com

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

被引用次数：87 相关文章所有 7 个版本

[PDF] thecvf.com

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

被引用次数：159 相关文章所有 8 个版本