Multiscale video pretraining for long-term activity forecasting

B Korbar, Y Xian, A Tonioni, A Zisserman… - European Conference on …, 2025 - Springer

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-
trained and frozen visual encoder and large language model (LLM) to process long video …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Revisiting feature prediction for learning visual representations from video

A Bardes, Q Garrido, J Ponce, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper explores feature prediction as a stand-alone objective for unsupervised learning
from video and introduces V-JEPA, a collection of vision models trained solely using a …

被引用次数：40 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on deep learning techniques for action anticipation

Z Zhong, M Martin, M Voit, J Gall, J Beyerer - arXiv preprint arXiv …, 2023 - arxiv.org

The ability to anticipate possible future human actions is essential for a wide range of
applications, including autonomous driving and human-robot interaction. Consequently …

被引用次数：8 相关文章所有 3 个版本

Object-Centric Cross-Modal Knowledge Reasoning for Future Event Prediction in Videos

C Lai, H Wang, W Ge, X Xue - IEEE Transactions on Circuits …, 2024 - ieeexplore.ieee.org

Although multi-modal large language models possess impressive cross-modal reasoning
and prediction capabilities, they lack a unified and rigorous evaluation standard. In this …

[PDF] arxiv.org

Human Action Anticipation: A Survey

B Lai, S Toyer, T Nagarajan, R Girdhar, S Zha… - arXiv preprint arXiv …, 2024 - arxiv.org

Predicting future human behavior is an increasingly popular topic in computer vision, driven
by the interest in applications such as autonomous vehicles, digital assistants and human …

TinyMem: Condensing Multimodal Memory for Long-form Video Action Detection

R Tian, Q Dai, H Hu, Z Wu - openreview.net

Despite the great advances in video understanding with deep neural networks, current
solutions still struggle with input videos that last for minutes, if not hours. To mitigate this …