Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Video swin transformer

Z Liu, J Ning, Y Cao, Y Wei, Z Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

A comprehensive review of recent deep learning techniques for human activity recognition

VT Le, K Tran-Trung, VT Hoang - Computational Intelligence …, 2022 - Wiley Online Library
Human action recognition is an important field in computer vision that has attracted
remarkable attention from researchers. This survey aims to provide a comprehensive …

Long short-term transformer for online action detection

M Xu, Y Xiong, H Chen, X Li, W Xia… - Advances in Neural …, 2021 - proceedings.neurips.cc
Abstract We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm
for online action detection, which employs a long-and short-term memory mechanism to …

Perspectives and prospects on transformer architecture for cross-modal tasks with language and vision

A Shin, M Ishii, T Narihira - International journal of computer vision, 2022 - Springer
Transformer architectures have brought about fundamental changes to computational
linguistic field, which had been dominated by recurrent neural networks for many years. Its …

Video contrastive learning with global context

H Kuang, Y Zhu, Z Zhang, X Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Contrastive learning has revolutionized the self-supervised image representation learning
field and recently been adapted to the video domain. One of the greatest advantages of …

Stochastic backpropagation: A memory efficient strategy for training video models

F Cheng, M Xu, Y Xiong, H Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
We propose a memory efficient method, named Stochastic Backpropagation (SBP), for
training deep neural networks on videos. It is based on the finding that gradients from …

A*: Atrous spatial temporal action recognition for real time applications

M Kim, F Spinola, P Benz… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Deep learning has become a popular tool across various fields and is increasingly being
integrated into real-world applications such as autonomous driving cars and surveillance …

Shrinking temporal attention in transformers for video action recognition

B Li, P Xiong, C Han, T Guo - Proceedings of the AAAI Conference on …, 2022 - ojs.aaai.org
Spatiotemporal modeling in an unified architecture is key for video action recognition. This
paper proposes a Shrinking Temporal Attention Transformer (STAT), which efficiently builts …