Llm-planner: Few-shot grounded planning for embodied agents with large language models
This study focuses on using large language models (LLMs) as a planner for embodied
agents that can follow natural language instructions to complete complex tasks in a visually …
agents that can follow natural language instructions to complete complex tasks in a visually …
Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action
Goal-conditioned policies for robotic navigation can be trained on large, unannotated
datasets, providing for good generalization to real-world settings. However, particularly in …
datasets, providing for good generalization to real-world settings. However, particularly in …
Language to rewards for robotic skill synthesis
Large language models (LLMs) have demonstrated exciting progress in acquiring diverse
new capabilities through in-context learning, ranging from logical reasoning to code-writing …
new capabilities through in-context learning, ranging from logical reasoning to code-writing …
Habitat 2.0: Training home assistants to rearrange their habitat
Abstract We introduce Habitat 2.0 (H2. 0), a simulation platform for training virtual robots in
interactive 3D environments and complex physics-enabled scenarios. We make …
interactive 3D environments and complex physics-enabled scenarios. We make …
A survey on multimodal large language models for autonomous driving
With the emergence of Large Language Models (LLMs) and Vision Foundation Models
(VFMs), multimodal AI systems benefiting from large models have the potential to equally …
(VFMs), multimodal AI systems benefiting from large models have the potential to equally …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai
We present the Habitat-Matterport 3D (HM3D) dataset. HM3D is a large-scale dataset of
1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each …
1,000 building-scale 3D reconstructions from a diverse set of real-world locations. Each …
History aware multimodal transformer for vision-and-language navigation
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …
instructions and navigate in real scenes. To remember previously visited locations and …
Conversational information seeking
Conversational information seeking (CIS) is concerned with a sequence of interactions
between one or more users and an information system. Interactions in CIS are primarily …
between one or more users and an information system. Interactions in CIS are primarily …
Think global, act local: Dual-scale graph transformer for vision-and-language navigation
Following language instructions to navigate in unseen environments is a challenging
problem for autonomous embodied agents. The agent not only needs to ground languages …
problem for autonomous embodied agents. The agent not only needs to ground languages …