Spatio-temporal channel correlation networks for action classification

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer

Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

被引用次数：1741 相关文章所有 8 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：594 相关文章所有 16 个版本

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：227 相关文章所有 7 个版本

[PDF] arxiv.org

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer

Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

被引用次数：216 相关文章所有 5 个版本

[PDF] thecvf.com

Extracting motion and appearance via inter-frame attention for efficient video frame interpolation

G Zhang, Y Zhu, H Wang, Y Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Effectively extracting inter-frame motion and appearance information is important for video
frame interpolation (VFI). Previous works either extract both types of information in a mixed …

被引用次数：96 相关文章所有 6 个版本

[PDF] arxiv.org

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org

The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

被引用次数：415 相关文章所有 2 个版本

[PDF] thecvf.com

X3d: Expanding architectures for efficient video recognition

C Feichtenhofer - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

This paper presents X3D, a family of efficient video networks that progressively expand a
tiny 2D image classification architecture along multiple network axes, in space, time, width …

被引用次数：1215 相关文章所有 7 个版本

[PDF] thecvf.com

Temporal pyramid network for action recognition

C Yang, Y Xu, J Shi, B Dai… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling
such visual tempos of different actions facilitates their recognition. Previous works often …

被引用次数：473 相关文章所有 10 个版本

[PDF] thecvf.com

Slowfast networks for video recognition

C Feichtenhofer, H Fan, J Malik… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway,
operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating …

被引用次数：4025 相关文章所有 11 个版本

[PDF] thecvf.com

Vita-clip: Video and text adaptive clip via multimodal prompting

ST Wasim, M Naseer, S Khan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Adopting contrastive image-text pretrained models like CLIP towards video classification has
gained attention due to its cost-effectiveness and competitive performance. However, recent …

被引用次数：76 相关文章所有 8 个版本