Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
Contrastive representation learning: A framework and review
Contrastive Learning has recently received interest due to its success in self-supervised
representation learning in the computer vision domain. However, the origins of Contrastive …
representation learning in the computer vision domain. However, the origins of Contrastive …
Make-a-video: Text-to-video generation without text-video data
We propose Make-A-Video--an approach for directly translating the tremendous recent
progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple …
progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Autonomous agents have made great strides in specialist domains like Atari games and Go.
However, they typically learn tabula rasa in isolated environments with limited and manually …
However, they typically learn tabula rasa in isolated environments with limited and manually …
Expanding language-image pretrained models for general video recognition
Contrastive language-image pretraining has shown great success in learning visual-textual
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
Mvitv2: Improved multiscale vision transformers for classification and detection
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …
image and video classification, as well as object detection. We present an improved version …
St-adapter: Parameter-efficient image-to-video transfer learning
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …
recently emerged with promising performance. Due to the ever-growing model size, the …
Frozen clip models are efficient video learners
Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …
a video recognition model with weights of a pretrained image model and then conducting …
Omnivl: One foundation model for image-language and video-language tasks
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …
video-language tasks using one universal architecture. It adopts a unified transformer-based …