Video classification with channel-separated convolutional networks

N Gupta, SK Gupta, RK Pathak, V Jain… - Artificial intelligence …, 2022 - Springer

Human activity recognition (HAR) has multifaceted applications due to its worldly usage of
acquisition devices such as smartphones, video cameras, and its ability to capture human …

被引用次数：205 相关文章所有 10 个版本

[PDF] mdpi.com

Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?

O Moutik, H Sekkat, S Tigani, A Chehri, R Saadane… - Sensors, 2023 - mdpi.com

Understanding actions in videos remains a significant challenge in computer vision, which
has been the subject of several pieces of research in the last decades. Convolutional neural …

被引用次数：57 相关文章所有 12 个版本

[PDF] neurips.cc

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc

Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

被引用次数：845 相关文章所有 6 个版本

[PDF] thecvf.com

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：674 相关文章所有 6 个版本

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：177 相关文章所有 7 个版本

[PDF] arxiv.org

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer

Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

被引用次数：172 相关文章所有 5 个版本

[PDF] arxiv.org

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

被引用次数：296 相关文章所有 6 个版本

[PDF] thecvf.com

Multiview transformers for video recognition

S Yan, X Xiong, A Arnab, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com

Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …

被引用次数：269 相关文章所有 9 个版本

[PDF] thecvf.com

Video swin transformer

Z Liu, J Ning, Y Cao, Y Wei, Z Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …

被引用次数：1634 相关文章所有 8 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1304 相关文章所有 5 个版本