Pdan: Pyramid dilated attention network for action detection

A Ulhaq, N Akhtar, G Pogrebna, A Mian - arXiv preprint arXiv:2209.05700, 2022 - arxiv.org

Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …

被引用次数：40 相关文章所有 4 个版本

AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

被引用次数：83 相关文章所有 4 个版本

[PDF] thecvf.com

Ms-tct: Multi-scale temporal convtransformer for action detection

R Dai, S Das, K Kahatapitiya… - Proceedings of the …, 2022 - openaccess.thecvf.com

Action detection is an essential and challenging task, especially for densely labelled
datasets of untrimmed videos. The temporal relation is complex in those datasets, including …

被引用次数：65 相关文章所有 14 个版本

[PDF] arxiv.org

Spotting temporally precise, fine-grained events in video

J Hong, H Zhang, M Gharbi, M Fisher… - European Conference on …, 2022 - Springer

We introduce the task of spotting temporally precise, fine-grained events in video (detecting
the precise moment in time events occur). Precise spotting requires models to reason …

被引用次数：27 相关文章所有 5 个版本

[PDF] thecvf.com

Token turing machines

MS Ryoo, K Gopalakrishnan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose Token Turing Machines (TTM), a sequential, autoregressive
Transformer model with memory for real-world sequential visual understanding. Our model …

被引用次数：14 相关文章所有 9 个版本

[PDF] thecvf.com

Learning an augmented rgb representation with cross-modal knowledge distillation for action detection

R Dai, S Das, F Bremond - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

In video understanding, most cross-modal knowledge distillation (KD) methods are tailored
for classification tasks, focusing on the discriminative representation of the trimmed videos …

被引用次数：40 相关文章所有 19 个版本

[PDF] thecvf.com

Lac-latent action composition for skeleton-based action segmentation

D Yang, Y Wang, A Dantcheva… - Proceedings of the …, 2023 - openaccess.thecvf.com

Skeleton-based action segmentation requires recognizing composable actions in untrimmed
videos. Current approaches decouple this problem by first extracting local visual features …

被引用次数：4 相关文章所有 9 个版本

[PDF] neurips.cc

Pointtad: Multi-label temporal action detection with learnable query points

J Tan, X Zhao, X Shi, B Kang… - Advances in Neural …, 2022 - proceedings.neurips.cc

Traditional temporal action detection (TAD) usually handles untrimmed videos with small
number of action instances from a single label (eg, ActivityNet, THUMOS). However, this …

被引用次数：16 相关文章所有 8 个版本

[PDF] arxiv.org

Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization

TN Tang, K Kim, K Sohn - arXiv preprint arXiv:2303.09055, 2023 - arxiv.org

Temporal Action Localization (TAL) is a challenging task in video understanding that aims to
identify and localize actions within a video sequence. Recent studies have emphasized the …

被引用次数：21 相关文章所有 2 个版本

[PDF] thecvf.com

Pat: Position-aware transformer for dense multi-label action detection

F Sardari, A Mustafa, PJB Jackson… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present PAT, a transformer-based network that learns complex temporal co-occurrence
action dependencies in a video by exploiting multi-scale temporal features. In existing …

被引用次数：5 相关文章所有 7 个版本