Blindfold baselines for embodied QA

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

被引用次数：108 相关文章所有 5 个版本

[PDF] arxiv.org

Don't take the easy way out: Ensemble based methods for avoiding known dataset biases

C Clark, M Yatskar, L Zettlemoyer - arXiv preprint arXiv:1909.03683, 2019 - arxiv.org

State-of-the-art models often make use of superficial patterns in the data that do not
generalize well to out-of-domain or adversarial settings. For example, textual entailment …

被引用次数：479 相关文章所有 3 个版本

[PDF] neurips.cc

Rubi: Reducing unimodal biases for visual question answering

R Cadene, C Dancette, M Cord… - Advances in neural …, 2019 - proceedings.neurips.cc

Abstract Visual Question Answering (VQA) is the task of answering questions about an
image. Some VQA models often exploit unimodal biases to provide the correct answer …

被引用次数：398 相关文章所有 11 个版本

[PDF] thecvf.com

Habitat-web: Learning embodied object-search strategies from human demonstrations at scale

R Ramrakhya, E Undersander… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present a large-scale study of imitating human demonstrations on tasks that require a
virtual robot to search for objects in new environments-(1) ObjectGoal Navigation (eg'find & …

被引用次数：84 相关文章所有 6 个版本

[PDF] thecvf.com

Embodied question answering in photorealistic environments with point cloud perception

E Wijmans, S Datta, O Maksymets… - Proceedings of the …, 2019 - openaccess.thecvf.com

To help bridge the gap between internet vision-style problems and the goal of vision for
embodied perception we instantiate a large-scale navigation task--Embodied Question …

被引用次数：160 相关文章所有 8 个版本

[PDF] frontiersin.org

Challenges and prospects in vision and language research

K Kafle, R Shrestha, C Kanan - Frontiers in Artificial Intelligence, 2019 - frontiersin.org

Language grounded image understanding tasks have often been proposed as a method for
evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of …

被引用次数：47 相关文章所有 7 个版本

[PDF] arxiv.org

Do neural dialog systems use the conversation history effectively? an empirical study

C Sankar, S Subramanian, C Pal, S Chandar… - arXiv preprint arXiv …, 2019 - arxiv.org

Neural generative models have been become increasingly popular when building
conversational agents. They offer flexibility, can be easily adapted to new domains, and …

被引用次数：144 相关文章所有 8 个版本

[PDF] thecvf.com

Bayesian relational memory for semantic visual navigation

Y Wu, Y Wu, A Tamar, S Russell… - Proceedings of the …, 2019 - openaccess.thecvf.com

We introduce a new memory architecture, Bayesian Relational Memory (BRM), to improve
the generalization ability for semantic visual navigation agents in unseen environments …

被引用次数：111 相关文章所有 12 个版本

[PDF] arxiv.org

Vision-language navigation: a survey and taxonomy

W Wu, T Chang, X Li, Q Yin, Y Hu - Neural Computing and Applications, 2024 - Springer

Vision-language navigation (VLN) tasks require an agent to follow language instructions
from a human guide to navigate in previously unseen environments using visual …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Multiviz: Towards visualizing and understanding multimodal models

PP Liang, Y Lyu, G Chhablani, N Jain, Z Deng… - arXiv preprint arXiv …, 2022 - arxiv.org

The promise of multimodal models for real-world applications has inspired research in
visualizing and understanding their internal mechanics with the end goal of empowering …

被引用次数：31 相关文章所有 3 个版本