Attention mechanisms in computer vision: A survey
Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …
this observation, attention mechanisms were introduced into computer vision with the aim of …
Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
St-adapter: Parameter-efficient image-to-video transfer learning
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …
recently emerged with promising performance. Due to the ever-growing model size, the …
Frozen clip models are efficient video learners
Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …
a video recognition model with weights of a pretrained image model and then conducting …
Extracting motion and appearance via inter-frame attention for efficient video frame interpolation
Effectively extracting inter-frame motion and appearance information is important for video
frame interpolation (VFI). Previous works either extract both types of information in a mixed …
frame interpolation (VFI). Previous works either extract both types of information in a mixed …
Actionclip: A new paradigm for video action recognition
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
X3d: Expanding architectures for efficient video recognition
C Feichtenhofer - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
This paper presents X3D, a family of efficient video networks that progressively expand a
tiny 2D image classification architecture along multiple network axes, in space, time, width …
tiny 2D image classification architecture along multiple network axes, in space, time, width …
Temporal pyramid network for action recognition
Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling
such visual tempos of different actions facilitates their recognition. Previous works often …
such visual tempos of different actions facilitates their recognition. Previous works often …
Slowfast networks for video recognition
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway,
operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating …
operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating …
Vita-clip: Video and text adaptive clip via multimodal prompting
Adopting contrastive image-text pretrained models like CLIP towards video classification has
gained attention due to its cost-effectiveness and competitive performance. However, recent …
gained attention due to its cost-effectiveness and competitive performance. However, recent …