Learning visual representations via language-guided sampling

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com
Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

Perceptual grouping in contrastive vision-language models

K Ranasinghe, B McKinzie, S Ravi… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances in zero-shot image recognition suggest that vision-language models learn
generic visual representations with a high degree of semantic information that may be …

Theia: Distilling diverse vision foundation models for robot learning

J Shang, K Schmeckpeper, BB May, MV Minniti… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-based robot policy learning, which maps visual inputs to actions, necessitates a
holistic understanding of diverse visual tasks beyond single-task needs like classification or …

Starformer: Transformer with state-action-reward representations for visual reinforcement learning

J Shang, K Kahatapitiya, X Li, MS Ryoo - European conference on …, 2022 - Springer
Reinforcement Learning (RL) can be considered as a sequence modeling task: given a
sequence of past state-action-reward experiences, an agent predicts a sequence of next …

Limited data, unlimited potential: A study on vits augmented by masked autoencoders

S Das, T Jain, D Reilly, P Balaji… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite
their success, ViTs lack inductive biases, which can make it difficult to train them with limited …

Cross-view action recognition understanding from exocentric to egocentric perspective

TD Truong, K Luu - Neurocomputing, 2025 - Elsevier
Understanding action recognition in egocentric videos has emerged as a vital research topic
with numerous practical applications. With the limitation in the scale of egocentric data …

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

B Xu, S Zheng, Q Jin - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
We humans are good at translating third-person observations of hand-object interactions
(HOI) into an egocentric view. However, current methods struggle to replicate this ability of …

Neural neural textures make sim2real consistent

R Burgert, J Shang, X Li, M Ryoo - arXiv preprint arXiv:2206.13500, 2022 - arxiv.org
Unpaired image translation algorithms can be used for sim2real tasks, but many fail to
generate temporally consistent results. We present a new approach that combines …

Seeing the pose in the pixels: learning pose-aware representations in vision transformers

D Reilly, A Chadha, S Das - arXiv preprint arXiv:2306.09331, 2023 - arxiv.org
Human perception of surroundings is often guided by the various poses present within the
environment. Many computer vision tasks, such as human action recognition and robot …

Active vision reinforcement learning under limited visual observability

J Shang, MS Ryoo - Advances in Neural Information …, 2024 - proceedings.neurips.cc
In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where
an embodied agent simultaneously learns action policy for the task while also controlling its …