Human action recognition: A taxonomy-based survey, updates, and opportunities

MG Morshed, T Sultana, A Alam, YK Lee - Sensors, 2023 - mdpi.com
Human action recognition systems use data collected from a wide range of sensors to
accurately identify and interpret human actions. One of the most challenging issues for …

Petr: Position embedding transformation for multi-view 3d object detection

Y Liu, T Wang, X Zhang, J Sun - European Conference on Computer …, 2022 - Springer
In this paper, we develop position embedding transformation (PETR) for multi-view 3D
object detection. PETR encodes the position information of 3D coordinates into image …

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Memory-and-anticipation transformer for online action understanding

J Wang, G Chen, Y Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Most existing forecasting systems are memory-based methods, which attempt to mimic
human forecasting ability by employing various memory mechanisms and have progressed …

Video-mined task graphs for keystep recognition in instructional videos

K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Selective structured state-spaces for long-form video understanding

J Wang, W Zhu, P Wang, X Yu, L Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Effective modeling of complex spatiotemporal dependencies in long-form videos remains an
open problem. The recently proposed Structured State-Space Sequence (S4) model with its …

Hiervl: Learning hierarchical video-language embeddings

K Ashutosh, R Girdhar, L Torresani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language embeddings are a promising avenue for injecting semantics into visual
representations, but existing methods capture only short-term associations between …