What is right for me is not yet right for you: A dataset for grounding relative directions...

D Chen, J Liu, W Dai, B Wang - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org

Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large
Language Models (LLMs) using an assortment of annotated downstream vision-language …

被引用次数：22 相关文章所有 3 个版本

[PDF] openreview.net

Eqa-mx: Embodied question answering using multimodal expression

MM Islam, A Gladstone, R Islam… - The Twelfth International …, 2023 - openreview.net

Humans predominantly use verbal utterances and nonverbal gestures (eg, eye gaze and
pointing gestures) in their natural interactions. For instance, pointing gestures and verbal …

被引用次数：2 相关文章

[PDF] neurips.cc

CAESAR: An embodied simulator for generating multimodal referring expression datasets

MM Islam, R Mirzaiee, A Gladstone… - Advances in Neural …, 2022 - proceedings.neurips.cc

Humans naturally use verbal utterances and nonverbal gestures to refer to various objects
(known as $\textit {referring expressions} $) in different interactional scenarios. As collecting …

被引用次数：10 相关文章所有 3 个版本

[PDF] aaai.org

Patron: perspective-aware multitask model for referring expression grounding using embodied multimodal cues

MM Islam, A Gladstone, T Iqbal - … of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org

Humans naturally use referring expressions with verbal utterances and nonverbal gestures
to refer to objects and events. As these referring expressions can be interpreted differently …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Harnessing the power of multi-task pretraining for ground-truth level natural language explanations

B Plüster, J Ambsdorf, L Braach, JH Lee… - arXiv preprint arXiv …, 2022 - arxiv.org

Natural language explanations promise to offer intuitively understandable explanations of a
neural network's decision process in complex vision-language tasks, as pursued in recent …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Visually Grounded Continual Language Learning with Selective Specialization

K Ahrens, L Bengtson, JH Lee, S Wermter - arXiv preprint arXiv …, 2023 - arxiv.org

A desirable trait of an artificial agent acting in the visual world is to continually learn a
sequence of language-informed tasks while striking a balance between sufficiently …

Neuro-Symbolic Spatio-Temporal Reasoning

JH Lee, M Sioutis, K Ahrens, M Alirezaie… - Compendium of …, 2023 - ebooks.iospress.nl

Abstract Knowledge about space and time is necessary to solve problems in the physical
world. Spatio-temporal knowledge, however, is required beyond interacting with the physical …

被引用次数：3 相关文章所有 9 个版本

[PDF] arxiv.org

Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning

K Ahrens, M Kerzel, JH Lee, C Weber… - arXiv preprint arXiv …, 2022 - arxiv.org

Spatial reasoning poses a particular challenge for intelligent agents and is at the same time
a prerequisite for their successful interaction and communication in the physical world. One …