Vision transformers for action recognition: A survey
Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …
Recent techniques have also proven the efficacy of transformers beyond the image domain …
AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description
J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …
in computer vision and natural language understanding due to the intricate nature of video …
Ms-tct: Multi-scale temporal convtransformer for action detection
Action detection is an essential and challenging task, especially for densely labelled
datasets of untrimmed videos. The temporal relation is complex in those datasets, including …
datasets of untrimmed videos. The temporal relation is complex in those datasets, including …
Spotting temporally precise, fine-grained events in video
We introduce the task of spotting temporally precise, fine-grained events in video (detecting
the precise moment in time events occur). Precise spotting requires models to reason …
the precise moment in time events occur). Precise spotting requires models to reason …
Token turing machines
MS Ryoo, K Gopalakrishnan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We propose Token Turing Machines (TTM), a sequential, autoregressive
Transformer model with memory for real-world sequential visual understanding. Our model …
Transformer model with memory for real-world sequential visual understanding. Our model …
Learning an augmented rgb representation with cross-modal knowledge distillation for action detection
In video understanding, most cross-modal knowledge distillation (KD) methods are tailored
for classification tasks, focusing on the discriminative representation of the trimmed videos …
for classification tasks, focusing on the discriminative representation of the trimmed videos …
Lac-latent action composition for skeleton-based action segmentation
Skeleton-based action segmentation requires recognizing composable actions in untrimmed
videos. Current approaches decouple this problem by first extracting local visual features …
videos. Current approaches decouple this problem by first extracting local visual features …
Pointtad: Multi-label temporal action detection with learnable query points
J Tan, X Zhao, X Shi, B Kang… - Advances in Neural …, 2022 - proceedings.neurips.cc
Traditional temporal action detection (TAD) usually handles untrimmed videos with small
number of action instances from a single label (eg, ActivityNet, THUMOS). However, this …
number of action instances from a single label (eg, ActivityNet, THUMOS). However, this …
Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization
Temporal Action Localization (TAL) is a challenging task in video understanding that aims to
identify and localize actions within a video sequence. Recent studies have emphasized the …
identify and localize actions within a video sequence. Recent studies have emphasized the …
Pat: Position-aware transformer for dense multi-label action detection
We present PAT, a transformer-based network that learns complex temporal co-occurrence
action dependencies in a video by exploiting multi-scale temporal features. In existing …
action dependencies in a video by exploiting multi-scale temporal features. In existing …