Llara: Supercharging robot learning data for vision-language policy

X Li, C Mata, J Park, K Kahatapitiya, YS Jang… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs with visual inputs, ie, Vision Language Models (VLMs), have the capacity to process
state information as visual-textual prompts and respond with policy decisions in text. We …

Limited data, unlimited potential: A study on vits augmented by masked autoencoders

S Das, T Jain, D Reilly, P Balaji… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite
their success, ViTs lack inductive biases, which can make it difficult to train them with limited …

Diffusion illusions: Hiding images in plain sight

R Burgert, X Li, A Leite, K Ranasinghe… - ACM SIGGRAPH 2024 …, 2024 - dl.acm.org
We explore the problem of computationally generating special images that produce multi-
arrangement optical illusions when physically arranged and viewed in a certain way, which …

Generative image as action models

M Shridhar, YL Lo, S James - arXiv preprint arXiv:2407.07875, 2024 - arxiv.org
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as
image-editing and novel view synthesis. Can we similarly unlock image-generation models …

Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

C Fan, C Bai, Z Shan, H He, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks.
However, existing multi-task planners or policies typically rely on task-specific …

Diffusion Policy Attacker: Crafting Adversarial Attacks for Diffusion-based Policies

Y Chen, H Xue, Y Chen - arXiv preprint arXiv:2405.19424, 2024 - arxiv.org
Diffusion models (DMs) have emerged as a promising approach for behavior cloning (BC).
Diffusion policies (DP) based on DMs have elevated BC performance to new heights …

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

M Reuss, ÖE Yağmurlu, F Wenzel… - First Workshop on Vision …, 2024 - openreview.net
This work introduces the Multimodal Diffusion Transformer (MDT), a novel diffusion policy
framework, that excels at learning versatile behavior from multimodal goal specifications …

Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

C Hao, K Lin, S Luo, H Soh - arXiv preprint arXiv:2406.09767, 2024 - arxiv.org
Diffusion policies have demonstrated robust performance in generative modeling, prompting
their application in robotic manipulation controlled via language descriptions. In this paper …

Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning

X Zhang, M Chang, P Kumar, S Gupta - arXiv preprint arXiv:2402.17768, 2024 - arxiv.org
A common failure mode for policies trained with imitation is compounding execution errors at
test time. When the learned policy encounters states that were not present in the expert …

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

AK Rahimian, MK Govind, S Maity, D Reilly… - arXiv preprint arXiv …, 2024 - arxiv.org
Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures,
which, despite their effectiveness, encounter a computational bottleneck due to the quadratic …