Temporal action segmentation: An analysis of modern techniques
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …
minutes-long videos with multiple action classes. As a long-range video understanding task …
Videoclip: Contrastive pre-training for zero-shot video-text understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
video and text understanding, without using any labels on downstream tasks. VideoCLIP …
Actbert: Learning global-local video-text representations
In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …
representations from unlabeled data. First, we leverage global action information to catalyze …
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100
This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-
KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M …
KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M …
Univl: A unified video and language pre-training model for multimodal understanding and generation
With the recent success of the pre-training technique for NLP and image-linguistic tasks,
some video-linguistic pre-training works are gradually developed to improve video-text …
some video-linguistic pre-training works are gradually developed to improve video-text …
Taco: Token-aware cascade contrastive learning for video-text alignment
Contrastive learning has been widely used to train transformer-based vision-language
models for video-text alignment and multi-modal representation learning. This paper …
models for video-text alignment and multi-modal representation learning. This paper …
Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation
This paper introduces a unified framework for video action segmentation via sequence to
sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to …
sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to …
Few-shot video classification via temporal alignment
Difficulty in collecting and annotating large-scale video data raises a growing interest in
learning models which can recognize novel classes with only a few training examples. In …
learning models which can recognize novel classes with only a few training examples. In …
Vlm: Task-agnostic video-language model pre-training for video understanding
We present a simplified, task-agnostic multi-modal pre-training approach that can accept
either video or text input, or both for a variety of end tasks. Existing pre-training are task …
either video or text input, or both for a variety of end tasks. Existing pre-training are task …
Coin: A large-scale dataset for comprehensive instructional video analysis
There are substantial instruction videos on the Internet, which enables us to acquire
knowledge for completing various tasks. However, most existing datasets for instruction …
knowledge for completing various tasks. However, most existing datasets for instruction …