Deep learning-based action detection in untrimmed videos: A survey
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …
applications, and is critical for video analysis. Despite the progress of action recognition …
Causal reasoning meets visual representation learning: A prospective study
Visual representation learning is ubiquitous in various real-world applications, including
visual comprehension, video understanding, multi-modal analysis, human-computer …
visual comprehension, video understanding, multi-modal analysis, human-computer …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
Multiview transformers for video recognition
Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …
fine-grained motions to events taking place over longer durations. Although transformer …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Vivit: A video vision transformer
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …
success of such models in image classification. Our model extracts spatio-temporal tokens …
Actionclip: A new paradigm for video action recognition
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition
While today's video recognition systems parse snapshots or short clips accurately, they
cannot connect the dots and reason across a longer range of time yet. Most existing video …
cannot connect the dots and reason across a longer range of time yet. Most existing video …
Revisiting the" video" in video-language understanding
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …
single image? Building on recent progress in self-supervised image-language models, we …