Llara: Supercharging robot learning data for vision-language policy
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …
state information as visual-textual prompts and respond with policy decisions in text. We …
Limited data, unlimited potential: A study on vits augmented by masked autoencoders
Abstract Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite
their success, ViTs lack inductive biases, which can make it difficult to train them with limited …
their success, ViTs lack inductive biases, which can make it difficult to train them with limited …
Diffusion illusions: Hiding images in plain sight
We explore the problem of computationally generating special images that produce multi-
arrangement optical illusions when physically arranged and viewed in a certain way, which …
arrangement optical illusions when physically arranged and viewed in a certain way, which …
Generative image as action models
M Shridhar, YL Lo, S James - arXiv preprint arXiv:2407.07875, 2024 - arxiv.org
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as
image-editing and novel view synthesis. Can we similarly unlock image-generation models …
image-editing and novel view synthesis. Can we similarly unlock image-generation models …
Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner
Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks.
However, existing multi-task planners or policies typically rely on task-specific …
However, existing multi-task planners or policies typically rely on task-specific …
Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies
Diffusion models (DMs) have emerged as a promising approach for behavior cloning (BC).
Diffusion policies (DP) based on DMs have elevated BC performance to new heights …
Diffusion policies (DP) based on DMs have elevated BC performance to new heights …
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals
M Reuss, ÖE Yağmurlu, F Wenzel… - First Workshop on Vision …, 2024 - openreview.net
This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy
framework, that excels at learning versatile behavior from multimodal goal specifications …
framework, that excels at learning versatile behavior from multimodal goal specifications …
Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting
Diffusion policies have demonstrated robust performance in generative modeling, prompting
their application in robotic manipulation controlled via language descriptions. In this paper …
their application in robotic manipulation controlled via language descriptions. In this paper …
Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning
A common failure mode for policies trained with imitation is compounding execution errors at
test time. When the learned policy encounters states that were not present in the expert …
test time. When the learned policy encounters states that were not present in the expert …
Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures,
which, despite their effectiveness, encounter a computational bottleneck due to the quadratic …
which, despite their effectiveness, encounter a computational bottleneck due to the quadratic …