Improving vision-and-language navigation with image-text pairs from the web

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

被引用次数：2398 相关文章所有 8 个版本

[PDF] thecvf.com

Llm-planner: Few-shot grounded planning for embodied agents with large language models

CH Song, J Wu, C Washington… - Proceedings of the …, 2023 - openaccess.thecvf.com

This study focuses on using large language models (LLMs) as a planner for embodied
agents that can follow natural language instructions to complete complex tasks in a visually …

被引用次数：291 相关文章所有 6 个版本

[PDF] mlr.press

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

W Huang, P Abbeel, D Pathak… - … conference on machine …, 2022 - proceedings.mlr.press

Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …

被引用次数：808 相关文章所有 5 个版本

[PDF] mlr.press

Navigation with large language models: Semantic guesswork as a heuristic for planning

D Shah, MR Equi, B Osiński, F Xia… - … on Robot Learning, 2023 - proceedings.mlr.press

Navigation in unfamiliar environments presents a major challenge for robots: while mapping
and planning techniques can be used to build up a representation of the world, quickly …

被引用次数：49 相关文章所有 4 个版本

[PDF] neurips.cc

Pre-trained language models for interactive decision-making

S Li, X Puig, C Paxton, Y Du, C Wang… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Language model (LM) pre-training is useful in many language processing tasks.
But can pre-trained LMs be further leveraged for more general machine learning problems …

被引用次数：156 相关文章所有 8 个版本

[PDF] arxiv.org

Foundation models for decision making: Problems, methods, and opportunities

S Yang, O Nachum, Y Du, J Wei, P Abbeel… - arXiv preprint arXiv …, 2023 - arxiv.org

Foundation models pretrained on diverse data at scale have demonstrated extraordinary
capabilities in a wide range of vision and language tasks. When such models are deployed …

被引用次数：109 相关文章所有 3 个版本

[PDF] neurips.cc

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

被引用次数：187 相关文章所有 8 个版本

[PDF] neurips.cc

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc

We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

被引用次数：501 相关文章所有 8 个版本

[PDF] thecvf.com

Think global, act local: Dual-scale graph transformer for vision-and-language navigation

S Chen, PL Guhur, M Tapaswi… - Proceedings of the …, 2022 - openaccess.thecvf.com

Following language instructions to navigate in unseen environments is a challenging
problem for autonomous embodied agents. The agent not only needs to ground languages …

被引用次数：119 相关文章所有 9 个版本

[PDF] thecvf.com

Vln bert: A recurrent vision-and-language bert for navigation

Y Hong, Q Wu, Y Qi… - Proceedings of the …, 2021 - openaccess.thecvf.com

Accuracy of many visiolinguistic tasks has benefited significantly from the application of
vision-and-language (V&L) BERT. However, its application for the task of vision-and …

被引用次数：242 相关文章所有 5 个版本