Invariant grounding for video question answering
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …
video. At its core is understanding the alignments between visual scenes in video and …
Don't take the easy way out: Ensemble based methods for avoiding known dataset biases
State-of-the-art models often make use of superficial patterns in the data that do not
generalize well to out-of-domain or adversarial settings. For example, textual entailment …
generalize well to out-of-domain or adversarial settings. For example, textual entailment …
Rubi: Reducing unimodal biases for visual question answering
R Cadene, C Dancette, M Cord… - Advances in neural …, 2019 - proceedings.neurips.cc
Abstract Visual Question Answering (VQA) is the task of answering questions about an
image. Some VQA models often exploit unimodal biases to provide the correct answer …
image. Some VQA models often exploit unimodal biases to provide the correct answer …
Habitat-web: Learning embodied object-search strategies from human demonstrations at scale
R Ramrakhya, E Undersander… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a large-scale study of imitating human demonstrations on tasks that require a
virtual robot to search for objects in new environments-(1) ObjectGoal Navigation (eg'find & …
virtual robot to search for objects in new environments-(1) ObjectGoal Navigation (eg'find & …
Embodied question answering in photorealistic environments with point cloud perception
To help bridge the gap between internet vision-style problems and the goal of vision for
embodied perception we instantiate a large-scale navigation task--Embodied Question …
embodied perception we instantiate a large-scale navigation task--Embodied Question …
Challenges and prospects in vision and language research
Language grounded image understanding tasks have often been proposed as a method for
evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of …
evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of …
Do neural dialog systems use the conversation history effectively? an empirical study
Neural generative models have been become increasingly popular when building
conversational agents. They offer flexibility, can be easily adapted to new domains, and …
conversational agents. They offer flexibility, can be easily adapted to new domains, and …
Bayesian relational memory for semantic visual navigation
We introduce a new memory architecture, Bayesian Relational Memory (BRM), to improve
the generalization ability for semantic visual navigation agents in unseen environments …
the generalization ability for semantic visual navigation agents in unseen environments …
Vision-language navigation: a survey and taxonomy
Vision-language navigation (VLN) tasks require an agent to follow language instructions
from a human guide to navigate in previously unseen environments using visual …
from a human guide to navigate in previously unseen environments using visual …
Multiviz: Towards visualizing and understanding multimodal models
The promise of multimodal models for real-world applications has inspired research in
visualizing and understanding their internal mechanics with the end goal of empowering …
visualizing and understanding their internal mechanics with the end goal of empowering …