Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
A comprehensive survey of vision-based human action recognition methods
Although widely used in many applications, accurate and efficient human action recognition
remains a challenging area of research in the field of computer vision. Most recent surveys …
remains a challenging area of research in the field of computer vision. Most recent surveys …
Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
Ego4d: Around the world in 3,000 hours of egocentric video
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text
We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …
Vivit: A video vision transformer
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …
success of such models in image classification. Our model extracts spatio-temporal tokens …
Attention bottlenecks for multimodal fusion
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …
from multiple modalities such as vision and audio. Machine perception models, in stark …
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Although convolutional neural networks (CNNs) have achieved great success in computer
vision, this work investigates a simpler, convolution-free backbone network useful for many …
vision, this work investigates a simpler, convolution-free backbone network useful for many …
Prompting visual-language models for efficient video understanding
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …
visual-textual representations from large-scale web data, revealing remarkable ability for …