Voxposer: Composable 3d value maps for robotic manipulation with language models

W Huang, C Wang, R Zhang, Y Li, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …

Grounded decoding: Guiding text generation with grounded models for robot control

W Huang, F Xia, D Shah, D Driess, A Zeng, Y Lu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …

Language-conditioned learning for robotic manipulation: A survey

H Zhou, X Yao, Y Meng, S Sun, Z BIng, K Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
Language-conditioned robotic manipulation represents a cutting-edge area of research,
enabling seamless communication and cooperation between humans and robotic agents …

Grounded decoding: Guiding text generation with grounded models for embodied agents

W Huang, F Xia, D Shah, D Driess… - Advances in …, 2024 - proceedings.neurips.cc
Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …

Roboscript: Code generation for free-form manipulation tasks across real and simulation

J Chen, Y Mu, Q Yu, T Wei, S Wu, Z Yuan… - arXiv preprint arXiv …, 2024 - arxiv.org
Rapid progress in high-level task planning and code generation for open-world robot
manipulation has been witnessed in Embodied AI. However, previous studies put much …

PALM: Predicting Actions through Language Models

S Kim, D Huang, Y Xian, O Hilliges, L Van Gool… - … on Computer Vision, 2025 - Springer
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional …

Lalm: Long-term action anticipation with language models

S Kim, D Huang, Y Xian, O Hilliges, L Van Gool… - arXiv preprint arXiv …, 2023 - arxiv.org
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. While …

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

M Verghese, B Chen, H Eghbalzadeh… - arXiv preprint arXiv …, 2024 - arxiv.org
Our research investigates the capability of modern multimodal reasoning models, powered
by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step …

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Y Wang, C Xie, Y Liu, Z Zheng - arXiv preprint arXiv:2409.01071, 2024 - arxiv.org
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

H Zhang, Z Wang, Q Lyu, Z Zhang, S Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we investigate the problem of embodied multi-agent cooperation, where
decentralized agents must cooperate given only partial egocentric views of the world. To …