3D CNNs with adaptive temporal feature resolutions

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：470 相关文章所有 16 个版本

[PDF] arxiv.org

Dynamic neural networks: A survey

Y Han, G Huang, S Song, L Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Dynamic neural network is an emerging research topic in deep learning. Compared to static
models which have fixed computational graphs and parameters at the inference stage …

被引用次数：631 相关文章所有 7 个版本

[PDF] arxiv.org

Adaptive token sampling for efficient vision transformers

M Fayyaz, SA Koohpayegani, FR Jafari… - … on Computer Vision, 2022 - Springer

While state-of-the-art vision transformer models achieve promising results in image
classification, they are computationally expensive and require many GFLOPs. Although the …

被引用次数：129 相关文章所有 10 个版本

Ams-net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition

Q Wang, Q Hu, Z Gao, P Li, Q Hu - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Effective spatio-temporal modeling as a core of video representation learning is challenged
by complex scale variations in spatio-temporal cues in videos, especially different visual …

被引用次数：6 相关文章所有 3 个版本

[PDF] thecvf.com

Efficient video action detection with token dropout and context refinement

L Chen, Z Tong, Y Song, G Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for
efficient recognition, especially in video action detection where sufficient spatiotemporal …

被引用次数：9 相关文章所有 5 个版本

[PDF] ieee.org

ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

N Gkalelis, D Daskalakis, V Mezaris - IEEE Access, 2022 - ieeexplore.ieee.org

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object
detector together with a Vision Transformer (ViT) backbone network to derive object and …

被引用次数：14 相关文章所有 4 个版本

Constructing better prototype generators with 3D CNNs for few-shot text classification

X Wang, Y Du, D Chen, X Li, X Chen, Y Lee… - Expert Systems with …, 2023 - Elsevier

Prototypical network is a key algorithm to solve few-shot problems. Previous prototypical
network based methods average sentence embeddings of the same class to obtain …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Uncovering the Unseen: Discover Hidden Intentions by Micro-Behavior Graph Reasoning

Z Zhou, W Liu, D Xu, Z Wang, J Zhao - Proceedings of the 31st ACM …, 2023 - dl.acm.org

This paper introduces a new and challenging Hidden Intention Discovery (HID) task. Unlike
existing intention recognition tasks, which are based on obvious visual representations to …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

Identity-aware graph memory network for action detection

J Ni, J Qin, D Huang - Proceedings of the 29th ACM International …, 2021 - dl.acm.org

Action detection plays an important role in high-level video understanding and media
interpretation. Many existing studies fulfill this spatio-temporal localization by modeling the …

被引用次数：9 相关文章所有 3 个版本

EPK-CLIP: External and Priori Knowledge CLIP for action recognition

Z Yang, G An, Z Zheng, S Cao, F Wang - Expert Systems with Applications, 2024 - Elsevier

Abstract Contrastive Language-Image Pretraining (CLIP) models have achieved significant
success and have markedly improved the performance of various downstream tasks …