Attention clusters: Purely attention based local feature integration for video classification

J Chen, Q Wang, HH Cheng, W Peng… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

A semantic understanding of road traffic can help people understand road traffic flow
situations and emergencies more accurately and provide a more accurate basis for anomaly …

被引用次数：86 相关文章所有 3 个版本

[PDF] thecvf.com

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

被引用次数：202 相关文章所有 6 个版本

[PDF] arxiv.org

Region attention networks for pose and occlusion robust facial expression recognition

K Wang, X Peng, J Yang, D Meng… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Occlusion and pose variations, which can change facial appearance significantly, are two
major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER …

被引用次数：726 相关文章所有 7 个版本

[PDF] thecvf.com

Tsm: Temporal shift module for efficient video understanding

J Lin, C Gan, S Han - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com

The explosive growth in video streaming gives rise to challenges on performing video
understanding at high accuracy and low computation cost. Conventional 2D CNNs are …

被引用次数：1993 相关文章所有 15 个版本

[PDF] thecvf.com

Group-aware label transfer for domain adaptive person re-identification

K Zheng, W Liu, L He, T Mei, J Luo… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract Unsupervised Domain Adaptive (UDA) person re-identification (ReID) aims at
adapting the model trained on a labeled source-domain dataset to a target-domain dataset …

被引用次数：191 相关文章所有 6 个版本

[PDF] thecvf.com

Video action transformer network

R Girdhar, J Carreira, C Doersch… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract We introduce the Action Transformer model for recognizing and localizing human
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …

被引用次数：828 相关文章所有 11 个版本

[PDF] researchgate.net

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer

In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

被引用次数：173 相关文章所有 8 个版本

[PDF] thecvf.com

Epic-fusion: Audio-visual temporal binding for egocentric action recognition

E Kazakos, A Nagrani, A Zisserman… - Proceedings of the …, 2019 - openaccess.thecvf.com

We focus on multi-modal fusion for egocentric action recognition, and propose a novel
architecture for multi-modal temporal-binding, ie the combination of modalities within a …

被引用次数：375 相关文章所有 15 个版本

[PDF] thecvf.com

Listen to look: Action recognition by previewing audio

R Gao, TH Oh, K Grauman… - Proceedings of the …, 2020 - openaccess.thecvf.com

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly
impractical. We propose a framework for efficient action recognition in untrimmed video that …

被引用次数：262 相关文章所有 7 个版本

[PDF] arxiv.org

Audiovisual slowfast networks for video recognition

F Xiao, YJ Lee, K Grauman, J Malik… - arXiv preprint arXiv …, 2020 - arxiv.org

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual
perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a …

被引用次数：229 相关文章所有 2 个版本