Uniformer: Unifying convolution and self-attention for visual recognition
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …
large local redundancy and complex global dependency in these visual data. Convolution …
Video swin transformer
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …
Transformer architectures have attained top accuracy on the major video recognition …
Anticipative video transformer
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …
video modeling architecture that attends to the previously observed video in order to …
A comprehensive review of recent deep learning techniques for human activity recognition
VT Le, K Tran-Trung, VT Hoang - Computational Intelligence …, 2022 - Wiley Online Library
Human action recognition is an important field in computer vision that has attracted
remarkable attention from researchers. This survey aims to provide a comprehensive …
remarkable attention from researchers. This survey aims to provide a comprehensive …
Long short-term transformer for online action detection
Abstract We present Long Short-term TRansformer (LSTR), a temporal modeling algorithm
for online action detection, which employs a long-and short-term memory mechanism to …
for online action detection, which employs a long-and short-term memory mechanism to …
Perspectives and prospects on transformer architecture for cross-modal tasks with language and vision
Transformer architectures have brought about fundamental changes to computational
linguistic field, which had been dominated by recurrent neural networks for many years. Its …
linguistic field, which had been dominated by recurrent neural networks for many years. Its …
Video contrastive learning with global context
Contrastive learning has revolutionized the self-supervised image representation learning
field and recently been adapted to the video domain. One of the greatest advantages of …
field and recently been adapted to the video domain. One of the greatest advantages of …
Stochastic backpropagation: A memory efficient strategy for training video models
We propose a memory efficient method, named Stochastic Backpropagation (SBP), for
training deep neural networks on videos. It is based on the finding that gradients from …
training deep neural networks on videos. It is based on the finding that gradients from …
A*: Atrous spatial temporal action recognition for real time applications
Deep learning has become a popular tool across various fields and is increasingly being
integrated into real-world applications such as autonomous driving cars and surveillance …
integrated into real-world applications such as autonomous driving cars and surveillance …
Shrinking temporal attention in transformers for video action recognition
Spatiotemporal modeling in an unified architecture is key for video action recognition. This
paper proposes a Shrinking Temporal Attention Transformer (STAT), which efficiently builts …
paper proposes a Shrinking Temporal Attention Transformer (STAT), which efficiently builts …