Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
Video ReCap: Recursive Captioning of Hour-Long Videos
Most video captioning models are designed to process short video clips of few seconds and
output text describing low-level visual concepts (eg objects scenes atomic actions). However …
output text describing low-level visual concepts (eg objects scenes atomic actions). However …
Learning to Segment Referred Objects from Narrated Egocentric Videos
Egocentric videos provide a first-person perspective of the wearer's activities involving
simultaneous interactions with multiple objects. In this work we propose the task of weakly …
simultaneous interactions with multiple objects. In this work we propose the task of weakly …
EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World
Being able to map the activities of others into one's own point of view is one fundamental
human skill even from a very early age. Taking a step toward understanding this human …
human skill even from a very early age. Taking a step toward understanding this human …
VideoLLM-online: Online Video Large Language Model for Streaming Video
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …
enabling them to comprehend images videos and interleaved vision-language content …
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
KRY Nagasinghe, H Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we explore the capability of an agent to construct a logical sequence of action
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …
Egovideo: Exploring egocentric foundation model and downstream adaptation
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including
five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building …
five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building …
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …
increasing the number of vision tokens generally enhances visual understanding, it also …
Human Action Anticipation: A Survey
Predicting future human behavior is an increasingly popular topic in computer vision, driven
by the interest in applications such as autonomous vehicles, digital assistants and human …
by the interest in applications such as autonomous vehicles, digital assistants and human …
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Procedural video representation learning is an active research area where the objective is to
learn an agent which can anticipate and forecast the future given the present video input …
learn an agent which can anticipate and forecast the future given the present video input …