Tokenlearner: Adaptive space-time tokenization for videos

M Ryoo, AJ Piergiovanni, A Arnab… - Advances in neural …, 2021 - proceedings.neurips.cc
In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …

Rethinking video vits: Sparse video tubes for joint image and video learning

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

A comprehensive study of deep video action recognition

Y Zhu, X Li, C Liu, M Zolfaghari, Y Xiong, C Wu… - arXiv preprint arXiv …, 2020 - arxiv.org
Video action recognition is one of the representative tasks for video understanding. Over the
last decade, we have witnessed great advancements in video action recognition thanks to …

Tokenlearner: What can 8 learned tokens do for images and videos?

MS Ryoo, AJ Piergiovanni, A Arnab… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …

Mmnet: A model-based multimodal network for human action recognition in rgb-d videos

XB Bruce, Y Liu, X Zhang, S Zhong… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Human action recognition (HAR) in RGB-D videos has been widely investigated since the
release of affordable depth sensors. Currently, unimodal approaches (eg, skeleton-based …

Transformers in action recognition: A review on temporal modeling

E Shabaninia, H Nezamabadi-pour… - arXiv preprint arXiv …, 2022 - arxiv.org
In vision-based action recognition, spatio-temporal features from different modalities are
used for recognizing activities. Temporal modeling is a long challenge of action recognition …

Cross-modal representation learning for zero-shot action recognition

CC Lin, K Lin, L Wang, Z Liu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
We present a cross-modal Transformer-based framework, which jointly encodes video data
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …

4d-net for learned multi-modal alignment

AJ Piergiovanni, V Casser, MS Ryoo… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB
sensing information, both in time. We are able to incorporate the 4D information by …

Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition

Z Yu, B Zhou, J Wan, P Wang, H Chen… - … on Image Processing, 2021 - ieeexplore.ieee.org
Gesture recognition has attracted considerable attention owing to its great potential in
applications. Although the great progress has been made recently in multi-modal learning …

Vpn++: Rethinking video-pose embeddings for understanding activities of daily living

S Das, R Dai, D Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Many attempts have been made towards combining RGB and 3D poses for the recognition
of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model …