Pretrained language models as visual planners for human assistance

W Huang, C Wang, R Zhang, Y Li, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) are shown to possess a wealth of actionable knowledge that
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …

被引用次数：359 相关文章所有 6 个版本

[PDF] arxiv.org

Grounded decoding: Guiding text generation with grounded models for robot control

W Huang, F Xia, D Shah, D Driess, A Zeng, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …

被引用次数：76 相关文章

[PDF] arxiv.org

Language-conditioned learning for robotic manipulation: A survey

H Zhou, X Yao, Y Meng, S Sun, Z BIng, K Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Language-conditioned robotic manipulation represents a cutting-edge area of research,
enabling seamless communication and cooperation between humans and robotic agents …

被引用次数：12 相关文章所有 2 个版本

[PDF] neurips.cc

Grounded decoding: Guiding text generation with grounded models for embodied agents

W Huang, F Xia, D Shah, D Driess… - Advances in …, 2024 - proceedings.neurips.cc

Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …

被引用次数：17 相关文章所有 5 个版本

[PDF] arxiv.org

Roboscript: Code generation for free-form manipulation tasks across real and simulation

J Chen, Y Mu, Q Yu, T Wei, S Wu, Z Yuan… - arXiv preprint arXiv …, 2024 - arxiv.org

Rapid progress in high-level task planning and code generation for open-world robot
manipulation has been witnessed in Embodied AI. However, previous studies put much …

被引用次数：5 相关文章所有 2 个版本

[PDF] pkwyx.com

PALM: Predicting Actions through Language Models

S Kim, D Huang, Y Xian, O Hilliges, L Van Gool… - … on Computer Vision, 2025 - Springer

Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional …

Lalm: Long-term action anticipation with language models

S Kim, D Huang, Y Xian, O Hilliges, L Van Gool… - arXiv preprint arXiv …, 2023 - arxiv.org

Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. While …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

M Verghese, B Chen, H Eghbalzadeh… - arXiv preprint arXiv …, 2024 - arxiv.org

Our research investigates the capability of modern multimodal reasoning models, powered
by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Y Wang, C Xie, Y Liu, Z Zheng - arXiv preprint arXiv:2409.01071, 2024 - arxiv.org

Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

H Zhang, Z Wang, Q Lyu, Z Zhang, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we investigate the problem of embodied multi-agent cooperation, where
decentralized agents must cooperate given only partial egocentric views of the world. To …

被引用次数：3 相关文章所有 2 个版本