Vision transformers for action recognition: A survey

A Ulhaq, N Akhtar, G Pogrebna, A Mian - arXiv preprint arXiv:2209.05700, 2022 - arxiv.org
Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …

AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

Ms-tct: Multi-scale temporal convtransformer for action detection

R Dai, S Das, K Kahatapitiya… - Proceedings of the …, 2022 - openaccess.thecvf.com
Action detection is an essential and challenging task, especially for densely labelled
datasets of untrimmed videos. The temporal relation is complex in those datasets, including …

Spotting temporally precise, fine-grained events in video

J Hong, H Zhang, M Gharbi, M Fisher… - European Conference on …, 2022 - Springer
We introduce the task of spotting temporally precise, fine-grained events in video (detecting
the precise moment in time events occur). Precise spotting requires models to reason …

Token turing machines

MS Ryoo, K Gopalakrishnan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We propose Token Turing Machines (TTM), a sequential, autoregressive
Transformer model with memory for real-world sequential visual understanding. Our model …

Learning an augmented rgb representation with cross-modal knowledge distillation for action detection

R Dai, S Das, F Bremond - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
In video understanding, most cross-modal knowledge distillation (KD) methods are tailored
for classification tasks, focusing on the discriminative representation of the trimmed videos …

Lac-latent action composition for skeleton-based action segmentation

D Yang, Y Wang, A Dantcheva… - Proceedings of the …, 2023 - openaccess.thecvf.com
Skeleton-based action segmentation requires recognizing composable actions in untrimmed
videos. Current approaches decouple this problem by first extracting local visual features …

Pointtad: Multi-label temporal action detection with learnable query points

J Tan, X Zhao, X Shi, B Kang… - Advances in Neural …, 2022 - proceedings.neurips.cc
Traditional temporal action detection (TAD) usually handles untrimmed videos with small
number of action instances from a single label (eg, ActivityNet, THUMOS). However, this …

Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization

TN Tang, K Kim, K Sohn - arXiv preprint arXiv:2303.09055, 2023 - arxiv.org
Temporal Action Localization (TAL) is a challenging task in video understanding that aims to
identify and localize actions within a video sequence. Recent studies have emphasized the …

Pat: Position-aware transformer for dense multi-label action detection

F Sardari, A Mustafa, PJB Jackson… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present PAT, a transformer-based network that learns complex temporal co-occurrence
action dependencies in a video by exploiting multi-scale temporal features. In existing …