Voxposer: Composable 3d value maps for robotic manipulation with language models
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …
Grounded decoding: Guiding text generation with grounded models for robot control
Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …
leverage Internet-scale knowledge through pre-training with autoregressive models …
Language-conditioned learning for robotic manipulation: A survey
Language-conditioned robotic manipulation represents a cutting-edge area of research,
enabling seamless communication and cooperation between humans and robotic agents …
enabling seamless communication and cooperation between humans and robotic agents …
Grounded decoding: Guiding text generation with grounded models for embodied agents
Recent progress in large language models (LLMs) has demonstrated the ability to learn and
leverage Internet-scale knowledge through pre-training with autoregressive models …
leverage Internet-scale knowledge through pre-training with autoregressive models …
Roboscript: Code generation for free-form manipulation tasks across real and simulation
Rapid progress in high-level task planning and code generation for open-world robot
manipulation has been witnessed in Embodied AI. However, previous studies put much …
manipulation has been witnessed in Embodied AI. However, previous studies put much …
PALM: Predicting Actions through Language Models
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional …
focuses on capturing visual perspectives from the camera wearer's viewpoint. Traditional …
Lalm: Long-term action anticipation with language models
Understanding human activity is a crucial yet intricate task in egocentric vision, a field that
focuses on capturing visual perspectives from the camera wearer's viewpoint. While …
focuses on capturing visual perspectives from the camera wearer's viewpoint. While …
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
Our research investigates the capability of modern multimodal reasoning models, powered
by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step …
by Large Language Models (LLMs), to facilitate vision-powered assistants for multi-step …
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …
potential for real-time planning and detailed interactions. However, their high computational …
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
In this paper, we investigate the problem of embodied multi-agent cooperation, where
decentralized agents must cooperate given only partial egocentric views of the world. To …
decentralized agents must cooperate given only partial egocentric views of the world. To …