Learning viewpoint-agnostic visual representations by recovering tokens in 3d space

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com

Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

被引用次数：32 相关文章所有 7 个版本

[PDF] thecvf.com

Perceptual grouping in contrastive vision-language models

K Ranasinghe, B McKinzie, S Ravi… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent advances in zero-shot image recognition suggest that vision-language models learn
generic visual representations with a high degree of semantic information that may be …

被引用次数：42 相关文章所有 6 个版本

[PDF] arxiv.org

Theia: Distilling diverse vision foundation models for robot learning

J Shang, K Schmeckpeper, BB May, MV Minniti… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-based robot policy learning, which maps visual inputs to actions, necessitates a
holistic understanding of diverse visual tasks beyond single-task needs like classification or …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Starformer: Transformer with state-action-reward representations for visual reinforcement learning

J Shang, K Kahatapitiya, X Li, MS Ryoo - European conference on …, 2022 - Springer

Reinforcement Learning (RL) can be considered as a sequence modeling task: given a
sequence of past state-action-reward experiences, an agent predicts a sequence of next …

被引用次数：28 相关文章所有 7 个版本

[PDF] thecvf.com

Limited data, unlimited potential: A study on vits augmented by masked autoencoders

S Das, T Jain, D Reilly, P Balaji… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite
their success, ViTs lack inductive biases, which can make it difficult to train them with limited …

被引用次数：9 相关文章所有 6 个版本

[PDF] arxiv.org

Cross-view action recognition understanding from exocentric to egocentric perspective

TD Truong, K Luu - Neurocomputing, 2025 - Elsevier

Understanding action recognition in egocentric videos has emerged as a vital research topic
with numerous practical applications. With the limitation in the scale of egocentric data …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

B Xu, S Zheng, Q Jin - Proceedings of the 31st ACM International …, 2023 - dl.acm.org

We humans are good at translating third-person observations of hand-object interactions
(HOI) into an egocentric view. However, current methods struggle to replicate this ability of …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Neural neural textures make sim2real consistent

R Burgert, J Shang, X Li, M Ryoo - arXiv preprint arXiv:2206.13500, 2022 - arxiv.org

Unpaired image translation algorithms can be used for sim2real tasks, but many fail to
generate temporally consistent results. We present a new approach that combines …

被引用次数：7 相关文章所有 4 个版本

[PDF] arxiv.org

Seeing the pose in the pixels: learning pose-aware representations in vision transformers

D Reilly, A Chadha, S Das - arXiv preprint arXiv:2306.09331, 2023 - arxiv.org

Human perception of surroundings is often guided by the various poses present within the
environment. Many computer vision tasks, such as human action recognition and robot …

被引用次数：4 相关文章所有 5 个版本

[PDF] neurips.cc

Active vision reinforcement learning under limited visual observability

J Shang, MS Ryoo - Advances in Neural Information …, 2024 - proceedings.neurips.cc

In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where
an embodied agent simultaneously learns action policy for the task while also controlling its …

被引用次数：3 相关文章所有 5 个版本