Tokenlearner: Adaptive space-time tokenization for videos
In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …
of adaptively learned tokens, and which is applicable to both image and video …
Rethinking video vits: Sparse video tubes for joint image and video learning
AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …
which can seamlessly work with both image and video inputs. By sparsely sampling the …
A comprehensive study of deep video action recognition
Video action recognition is one of the representative tasks for video understanding. Over the
last decade, we have witnessed great advancements in video action recognition thanks to …
last decade, we have witnessed great advancements in video action recognition thanks to …
Tokenlearner: What can 8 learned tokens do for images and videos?
In this paper, we introduce a novel visual representation learning which relies on a handful
of adaptively learned tokens, and which is applicable to both image and video …
of adaptively learned tokens, and which is applicable to both image and video …
Mmnet: A model-based multimodal network for human action recognition in rgb-d videos
Human action recognition (HAR) in RGB-D videos has been widely investigated since the
release of affordable depth sensors. Currently, unimodal approaches (eg, skeleton-based …
release of affordable depth sensors. Currently, unimodal approaches (eg, skeleton-based …
Transformers in action recognition: A review on temporal modeling
E Shabaninia, H Nezamabadi-pour… - arXiv preprint arXiv …, 2022 - arxiv.org
In vision-based action recognition, spatio-temporal features from different modalities are
used for recognizing activities. Temporal modeling is a long challenge of action recognition …
used for recognizing activities. Temporal modeling is a long challenge of action recognition …
Cross-modal representation learning for zero-shot action recognition
We present a cross-modal Transformer-based framework, which jointly encodes video data
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …
4d-net for learned multi-modal alignment
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB
sensing information, both in time. We are able to incorporate the 4D information by …
sensing information, both in time. We are able to incorporate the 4D information by …
Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition
Gesture recognition has attracted considerable attention owing to its great potential in
applications. Although the great progress has been made recently in multi-modal learning …
applications. Although the great progress has been made recently in multi-modal learning …
Vpn++: Rethinking video-pose embeddings for understanding activities of daily living
Many attempts have been made towards combining RGB and 3D poses for the recognition
of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model …
of Activities of Daily Living (ADL). ADL may look very similar and often necessitate to model …