Visual instruction tuning with polite flamingo
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large
Language Models (LLMs) using an assortment of annotated downstream vision-language …
Language Models (LLMs) using an assortment of annotated downstream vision-language …
Eqa-mx: Embodied question answering using multimodal expression
MM Islam, A Gladstone, R Islam… - The Twelfth International …, 2023 - openreview.net
Humans predominantly use verbal utterances and nonverbal gestures (eg, eye gaze and
pointing gestures) in their natural interactions. For instance, pointing gestures and verbal …
pointing gestures) in their natural interactions. For instance, pointing gestures and verbal …
CAESAR: An embodied simulator for generating multimodal referring expression datasets
MM Islam, R Mirzaiee, A Gladstone… - Advances in Neural …, 2022 - proceedings.neurips.cc
Humans naturally use verbal utterances and nonverbal gestures to refer to various objects
(known as $\textit {referring expressions} $) in different interactional scenarios. As collecting …
(known as $\textit {referring expressions} $) in different interactional scenarios. As collecting …
Patron: perspective-aware multitask model for referring expression grounding using embodied multimodal cues
Humans naturally use referring expressions with verbal utterances and nonverbal gestures
to refer to objects and events. As these referring expressions can be interpreted differently …
to refer to objects and events. As these referring expressions can be interpreted differently …
Harnessing the power of multi-task pretraining for ground-truth level natural language explanations
Natural language explanations promise to offer intuitively understandable explanations of a
neural network's decision process in complex vision-language tasks, as pursued in recent …
neural network's decision process in complex vision-language tasks, as pursued in recent …
Visually Grounded Continual Language Learning with Selective Specialization
A desirable trait of an artificial agent acting in the visual world is to continually learn a
sequence of language-informed tasks while striking a balance between sufficiently …
sequence of language-informed tasks while striking a balance between sufficiently …
Neuro-Symbolic Spatio-Temporal Reasoning
Abstract Knowledge about space and time is necessary to solve problems in the physical
world. Spatio-temporal knowledge, however, is required beyond interacting with the physical …
world. Spatio-temporal knowledge, however, is required beyond interacting with the physical …
Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning
Spatial reasoning poses a particular challenge for intelligent agents and is at the same time
a prerequisite for their successful interaction and communication in the physical world. One …
a prerequisite for their successful interaction and communication in the physical world. One …