Human action recognition: A taxonomy-based survey, updates, and opportunities
Human action recognition systems use data collected from a wide range of sensors to
accurately identify and interpret human actions. One of the most challenging issues for …
accurately identify and interpret human actions. One of the most challenging issues for …
Petr: Position embedding transformation for multi-view 3d object detection
In this paper, we develop position embedding transformation (PETR) for multi-view 3D
object detection. PETR encodes the position information of 3D coordinates into image …
object detection. PETR encodes the position information of 3D coordinates into image …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Learning video representations from large language models
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
Video transformers: A survey
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …
them a promising tool for modeling video. However, they lack inductive biases and scale …
Memory-and-anticipation transformer for online action understanding
Most existing forecasting systems are memory-based methods, which attempt to mimic
human forecasting ability by employing various memory mechanisms and have progressed …
human forecasting ability by employing various memory mechanisms and have progressed …
Video-mined task graphs for keystep recognition in instructional videos
K Ashutosh, SK Ramakrishnan… - Advances in Neural …, 2024 - proceedings.neurips.cc
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …
task, where multiple keysteps are performed in sequence across a long video to reach a …
A simple recipe for contrastively pre-training video-first encoders beyond 16 frames
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …
dependencies. To this end we explore video-first architectures building on the common …
Selective structured state-spaces for long-form video understanding
Effective modeling of complex spatiotemporal dependencies in long-form videos remains an
open problem. The recently proposed Structured State-Space Sequence (S4) model with its …
open problem. The recently proposed Structured State-Space Sequence (S4) model with its …
Hiervl: Learning hierarchical video-language embeddings
Video-language embeddings are a promising avenue for injecting semantics into visual
representations, but existing methods capture only short-term associations between …
representations, but existing methods capture only short-term associations between …