Text-conditioned resampler for long form video understanding

B Korbar, Y Xian, A Tonioni, A Zisserman… - European Conference on …, 2025 - Springer
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-
trained and frozen visual encoder and large language model (LLM) to process long video …

Revisiting feature prediction for learning visual representations from video

A Bardes, Q Garrido, J Ponce, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper explores feature prediction as a stand-alone objective for unsupervised learning
from video and introduces V-JEPA, a collection of vision models trained solely using a …

A survey on deep learning techniques for action anticipation

Z Zhong, M Martin, M Voit, J Gall, J Beyerer - arXiv preprint arXiv …, 2023 - arxiv.org
The ability to anticipate possible future human actions is essential for a wide range of
applications, including autonomous driving and human-robot interaction. Consequently …

Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos

C Lai, H Wang, W Ge, X Xue - IEEE Transactions on Circuits …, 2024 - ieeexplore.ieee.org
Although multi-modal large language models possess impressive cross-modal reasoning
and prediction capabilities, they lack a unified and rigorous evaluation standard. In this …

Human Action Anticipation: A Survey

B Lai, S Toyer, T Nagarajan, R Girdhar, S Zha… - arXiv preprint arXiv …, 2024 - arxiv.org
Predicting future human behavior is an increasingly popular topic in computer vision, driven
by the interest in applications such as autonomous vehicles, digital assistants and human …

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

R Tian, Q Dai, H Hu, Z Wu - openreview.net
Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …