Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Llm-planner: Few-shot grounded planning for embodied agents with large language models
This study focuses on using large language models (LLMs) as a planner for embodied
agents that can follow natural language instructions to complete complex tasks in a visually …
agents that can follow natural language instructions to complete complex tasks in a visually …
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …
interactive environments? In this paper, we investigate the possibility of grounding high-level …
Navigation with large language models: Semantic guesswork as a heuristic for planning
Navigation in unfamiliar environments presents a major challenge for robots: while mapping
and planning techniques can be used to build up a representation of the world, quickly …
and planning techniques can be used to build up a representation of the world, quickly …
Pre-trained language models for interactive decision-making
Abstract Language model (LM) pre-training is useful in many language processing tasks.
But can pre-trained LMs be further leveraged for more general machine learning problems …
But can pre-trained LMs be further leveraged for more general machine learning problems …
Foundation models for decision making: Problems, methods, and opportunities
Foundation models pretrained on diverse data at scale have demonstrated extraordinary
capabilities in a wide range of vision and language tasks. When such models are deployed …
capabilities in a wide range of vision and language tasks. When such models are deployed …
History aware multimodal transformer for vision-and-language navigation
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …
instructions and navigate in real scenes. To remember previously visited locations and …
Large-scale adversarial training for vision-and-language representation learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …
Think global, act local: Dual-scale graph transformer for vision-and-language navigation
Following language instructions to navigate in unseen environments is a challenging
problem for autonomous embodied agents. The agent not only needs to ground languages …
problem for autonomous embodied agents. The agent not only needs to ground languages …
Vln bert: A recurrent vision-and-language bert for navigation
Accuracy of many visiolinguistic tasks has benefited significantly from the application of
vision-and-language (V&L) BERT. However, its application for the task of vision-and …
vision-and-language (V&L) BERT. However, its application for the task of vision-and …