Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
Aria digital twin: A new benchmark dataset for egocentric 3d machine perception
Abstract We introduce the Aria Digital Twin (ADT)-an egocentric dataset captured using Aria
glasses with extensive object, environment, and human level ground truth. This ADT release …
glasses with extensive object, environment, and human level ground truth. This ADT release …
[HTML][HTML] An outlook into the future of egocentric vision
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …
research in egocentric vision and the ever-anticipated future, where wearable computing …
Self-supervised object detection from egocentric videos
Understanding the visual world from the perspective of humans (egocentric) has been a
long-standing challenge in computer vision. Egocentric videos exhibit high scene complexity …
long-standing challenge in computer vision. Egocentric videos exhibit high scene complexity …
Instance tracking in 3D scenes from egocentric videos
Egocentric sensors such as AR/VR devices capture human-object interactions and offer the
potential to provide task-assistance by recalling 3D locations of objects of interest in the …
potential to provide task-assistance by recalling 3D locations of objects of interest in the …
Object Reprojection Error (ORE): Camera pose benchmarks from lightweight tracking annotations
Abstract 3D spatial understanding is highly valuable in the context of semantic modeling of
environments, agents, and their relationships. Semantic modeling approaches employed on …
environments, agents, and their relationships. Semantic modeling approaches employed on …
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Humans use multiple senses to comprehend the environment. Vision and language are two
of the most vital senses since they allow us to easily communicate our thoughts and …
of the most vital senses since they allow us to easily communicate our thoughts and …
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking
The Associating Objects with Transformers (AOT) framework has exhibited exceptional
performance in a wide range of complex scenarios for video object tracking and …
performance in a wide range of complex scenarios for video object tracking and …