Text-conditioned resampler for long form video understanding
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-
trained and frozen visual encoder and large language model (LLM) to process long video …
trained and frozen visual encoder and large language model (LLM) to process long video …
Revisiting feature prediction for learning visual representations from video
This paper explores feature prediction as a stand-alone objective for unsupervised learning
from video and introduces V-JEPA, a collection of vision models trained solely using a …
from video and introduces V-JEPA, a collection of vision models trained solely using a …
A survey on deep learning techniques for action anticipation
The ability to anticipate possible future human actions is essential for a wide range of
applications, including autonomous driving and human-robot interaction. Consequently …
applications, including autonomous driving and human-robot interaction. Consequently …
Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos
Although multi-modal large language models possess impressive cross-modal reasoning
and prediction capabilities, they lack a unified and rigorous evaluation standard. In this …
and prediction capabilities, they lack a unified and rigorous evaluation standard. In this …
Human Action Anticipation: A Survey
Predicting future human behavior is an increasingly popular topic in computer vision, driven
by the interest in applications such as autonomous vehicles, digital assistants and human …
by the interest in applications such as autonomous vehicles, digital assistants and human …
TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection
Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …