Deep learning-based action detection in untrimmed videos: A survey

E Vahdani, Y Tian - IEEE Transactions on Pattern Analysis and …, 2022 - ieeexplore.ieee.org
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …

Causal reasoning meets visual representation learning: A prospective study

Y Liu, YS Wei, H Yan, GB Li, L Lin - Machine Intelligence Research, 2022 - Springer
Visual representation learning is ubiquitous in various real-world applications, including
visual comprehension, video understanding, multi-modal analysis, human-computer …

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

Multiview transformers for video recognition

S Yan, X Xiong, A Arnab, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

CY Wu, Y Li, K Mangalam, H Fan… - Proceedings of the …, 2022 - openaccess.thecvf.com
While today's video recognition systems parse snapshots or short clips accurately, they
cannot connect the dots and reason across a longer range of time yet. Most existing video …

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …