Assemblenet++: Assembling modality representations via attention connections

M Ryoo, AJ Piergiovanni, A Arnab… - Advances in neural …, 2021 - proceedings.neurips.cc

In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …

被引用次数：158 相关文章所有 9 个版本

[PDF] thecvf.com

Rethinking video vits: Sparse video tubes for joint image and video learning

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

被引用次数：78 相关文章所有 7 个版本

[PDF] arxiv.org

A comprehensive study of deep video action recognition

Y Zhu, X Li, C Liu, M Zolfaghari, Y Xiong, C Wu… - arXiv preprint arXiv …, 2020 - arxiv.org

Video action recognition is one of the representative tasks for video understanding. Over the
last decade, we have witnessed great advancements in video action recognition thanks to …

被引用次数：232 相关文章所有 2 个版本

[PDF] arxiv.org

Tokenlearner: What can 8 learned tokens do for images and videos?

MS Ryoo, AJ Piergiovanni, A Arnab… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …

被引用次数：127 相关文章所有 2 个版本

[PDF] nsf.gov

Mmnet: A model-based multimodal network for human action recognition in rgb-d videos

XB Bruce, Y Liu, X Zhang, S Zhong… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Human action recognition (HAR) in RGB-D videos has been widely investigated since the
release of affordable depth sensors. Currently, unimodal approaches (eg, skeleton-based …

被引用次数：90 相关文章所有 5 个版本

[PDF] arxiv.org

Transformers in action recognition: A review on temporal modeling

E Shabaninia, H Nezamabadi-pour… - arXiv preprint arXiv …, 2022 - arxiv.org

In vision-based action recognition, spatio-temporal features from different modalities are
used for recognizing activities. Temporal modeling is a long challenge of action recognition …

被引用次数：15 相关文章所有 2 个版本

[PDF] thecvf.com

Cross-modal representation learning for zero-shot action recognition

CC Lin, K Lin, L Wang, Z Liu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

We present a cross-modal Transformer-based framework, which jointly encodes video data
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …

被引用次数：51 相关文章所有 5 个版本

[PDF] thecvf.com

4d-net for learned multi-modal alignment

AJ Piergiovanni, V Casser, MS Ryoo… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB
sensing information, both in time. We are able to incorporate the 4D information by …

被引用次数：79 相关文章所有 8 个版本

[PDF] ieee.org

Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition

Z Yu, B Zhou, J Wan, P Wang, H Chen… - … on Image Processing, 2021 - ieeexplore.ieee.org

Gesture recognition has attracted considerable attention owing to its great potential in
applications. Although the great progress has been made recently in multi-modal learning …

被引用次数：112 相关文章所有 7 个版本

[PDF] arxiv.org

Vpn++: Rethinking video-pose embeddings for understanding activities of daily living

S Das, R Dai, D Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Many attempts have been made towards combining RGB and 3D poses for the recognition
of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model …

被引用次数：68 相关文章所有 11 个版本