Invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

Don't take the easy way out: Ensemble based methods for avoiding known dataset biases

C Clark, M Yatskar, L Zettlemoyer - arXiv preprint arXiv:1909.03683, 2019 - arxiv.org
State-of-the-art models often make use of superficial patterns in the data that do not
generalize well to out-of-domain or adversarial settings. For example, textual entailment …

Rubi: Reducing unimodal biases for visual question answering

R Cadene, C Dancette, M Cord… - Advances in neural …, 2019 - proceedings.neurips.cc
Abstract Visual Question Answering (VQA) is the task of answering questions about an
image. Some VQA models often exploit unimodal biases to provide the correct answer …

Habitat-web: Learning embodied object-search strategies from human demonstrations at scale

R Ramrakhya, E Undersander… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a large-scale study of imitating human demonstrations on tasks that require a
virtual robot to search for objects in new environments-(1) ObjectGoal Navigation (eg'find & …

Embodied question answering in photorealistic environments with point cloud perception

E Wijmans, S Datta, O Maksymets… - Proceedings of the …, 2019 - openaccess.thecvf.com
To help bridge the gap between internet vision-style problems and the goal of vision for
embodied perception we instantiate a large-scale navigation task--Embodied Question …

Challenges and prospects in vision and language research

K Kafle, R Shrestha, C Kanan - Frontiers in Artificial Intelligence, 2019 - frontiersin.org
Language grounded image understanding tasks have often been proposed as a method for
evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of …

Do neural dialog systems use the conversation history effectively? an empirical study

C Sankar, S Subramanian, C Pal, S Chandar… - arXiv preprint arXiv …, 2019 - arxiv.org
Neural generative models have been become increasingly popular when building
conversational agents. They offer flexibility, can be easily adapted to new domains, and …

Bayesian relational memory for semantic visual navigation

Y Wu, Y Wu, A Tamar, S Russell… - Proceedings of the …, 2019 - openaccess.thecvf.com
We introduce a new memory architecture, Bayesian Relational Memory (BRM), to improve
the generalization ability for semantic visual navigation agents in unseen environments …

Vision-language navigation: a survey and taxonomy

W Wu, T Chang, X Li, Q Yin, Y Hu - Neural Computing and Applications, 2024 - Springer
Vision-language navigation (VLN) tasks require an agent to follow language instructions
from a human guide to navigate in previously unseen environments using visual …

Multiviz: Towards visualizing and understanding multimodal models

PP Liang, Y Lyu, G Chhablani, N Jain, Z Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
The promise of multimodal models for real-world applications has inspired research in
visualizing and understanding their internal mechanics with the end goal of empowering …