Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Video ReCap: Recursive Captioning of Hour-Long Videos

MM Islam, N Ho, X Yang, T Nagarajan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Most video captioning models are designed to process short video clips of few seconds and
output text describing low-level visual concepts (eg objects scenes atomic actions). However …

Learning to Segment Referred Objects from Narrated Egocentric Videos

Y Shen, H Wang, X Yang, M Feiszli… - Proceedings of the …, 2024 - openaccess.thecvf.com
Egocentric videos provide a first-person perspective of the wearer's activities involving
simultaneous interactions with multiple objects. In this work we propose the task of weakly …

EgoExoLearn: A Dataset for Bridging Asynchronous Ego-and Exo-centric View of Procedural Activities in Real World

Y Huang, G Chen, J Xu, M Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Being able to map the activities of others into one's own point of view is one fundamental
human skill even from a very early age. Taking a step toward understanding this human …

VideoLLM-online: Online Video Large Language Model for Streaming Video

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

KRY Nagasinghe, H Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this paper we explore the capability of an agent to construct a logical sequence of action
steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from …

Egovideo: Exploring egocentric foundation model and downstream adaptation

B Pei, G Chen, J Xu, Y He, Y Liu, K Pan… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including
five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building …

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

S Wu, J Chen, KQ Lin, Q Wang, Y Gao, Q Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
A well-known dilemma in large vision-language models (eg, GPT-4, LLaVA) is that while
increasing the number of vision tokens generally enhances visual understanding, it also …

Human Action Anticipation: A Survey

B Lai, S Toyer, T Nagarajan, R Girdhar, S Zha… - arXiv preprint arXiv …, 2024 - arxiv.org
Predicting future human behavior is an increasingly popular topic in computer vision, driven
by the interest in applications such as autonomous vehicles, digital assistants and human …

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

H Lin, T Nagarajan, N Ballas, M Assran… - arXiv preprint arXiv …, 2024 - arxiv.org
Procedural video representation learning is an active research area where the objective is to
learn an agent which can anticipate and forecast the future given the present video input …