From strings to things: Knowledge-enabled vqa model that can read and reason

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

C Chen, Y Wu, Q Dai, HY Zhou, M Xu… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org

Graph Neural Networks (GNNs) have gained momentum in graph representation learning
and boosted the state of the art in a variety of areas, such as data mining (eg, social network …

被引用次数：54 相关文章所有 3 个版本

[PDF] researchgate.net

An analysis of graph convolutional networks and recent datasets for visual question answering

AA Yusuf, F Chong, M Xianling - Artificial Intelligence Review, 2022 - Springer

Graph neural network is a deep learning approach widely applied on structural and non-
structural scenarios due to its substantial performance and interpretability recently. In a non …

被引用次数：37 相关文章所有 5 个版本

[PDF] thecvf.com

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

被引用次数：427 相关文章所有 8 个版本

[PDF] thecvf.com

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

被引用次数：96 相关文章所有 7 个版本

[PDF] thecvf.com

Room-and-object aware knowledge reasoning for remote embodied referring expression

C Gao, J Chen, S Liu, L Wang… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract The Remote Embodied Referring Expression (REVERIE) is a recently raised task
that requires an agent to navigate to and localise a referred remote object according to a …

被引用次数：80 相关文章所有 6 个版本

[PDF] researchgate.net

Boosting visual question answering with context-aware knowledge aggregation

G Li, X Wang, W Zhu - Proceedings of the 28th ACM International …, 2020 - dl.acm.org

Given an image and a natural language question, Visual Question Answering (VQA) aims at
answering the textual question correctly. Most VQA approaches in literature targets at finding …

被引用次数：84 相关文章所有 3 个版本

[PDF] thecvf.com

Open-vocabulary object detection with an open corpus

J Wang, H Zhang, H Hong, X Jin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing open vocabulary object detection (OVD) works expand the object detector toward
open categories by replacing the classifier with the category text embeddings and optimizing …

被引用次数：9 相关文章所有 3 个版本

Room-object entity prompting and reasoning for embodied referring expression

C Gao, S Liu, J Chen, L Wang, Q Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Given a high-level instruction, the task of Embodied Referring Expression (REVERIE)
requires an embodied agent to localise a remote referred object via navigating in the …

被引用次数：10 相关文章所有 6 个版本

[PDF] google.com

Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang - Proceedings of the 29th ACM …, 2021 - dl.acm.org

Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …

被引用次数：39 相关文章所有 2 个版本

[PDF] thecvf.com

Learning situation hyper-graphs for video question answering

A Urooj, H Kuehne, B Wu, K Chheu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Answering questions about complex situations in videos requires not only capturing of the
presence of actors, objects, and their relations, but also the evolution of these relationships …

被引用次数：12 相关文章所有 7 个版本