Multiview transformers for video recognition

S Yan, X Xiong, A Arnab, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc
We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc
In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

A comparative review of graph convolutional networks for human skeleton-based action recognition

L Feng, Y Zhao, W Zhao, J Tang - Artificial Intelligence Review, 2022 - Springer
Human action recognition is one of the hottest topics in the research field, so there are many
relevant review papers illustrating the multi-modality of data, the selection of feature vectors …

Adaptive token sampling for efficient vision transformers

M Fayyaz, SA Koohpayegani, FR Jafari… - … on Computer Vision, 2022 - Springer
While state-of-the-art vision transformer models achieve promising results in image
classification, they are computationally expensive and require many GFLOPs. Although the …

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Space-time mixing attention for video transformer

A Bulat, JM Perez Rua, S Sudhakaran… - Advances in neural …, 2021 - proceedings.neurips.cc
This paper is on video recognition using Transformers. Very recent attempts in this area
have demonstrated promising results in terms of recognition accuracy, yet they have been …

Mm-vit: Multi-modal video transformer for compressed video action recognition

J Chen, CM Ho - Proceedings of the IEEE/CVF winter …, 2022 - openaccess.thecvf.com
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video
Transformer (MM-ViT), for video action recognition. Different from other schemes which …