Neuralnetwork-viterbi: A framework for weakly supervised video learning

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …

被引用次数：63 相关文章所有 8 个版本

[PDF] arxiv.org

Videoclip: Contrastive pre-training for zero-shot video-text understanding

H Xu, G Ghosh, PY Huang, D Okhonko… - arXiv preprint arXiv …, 2021 - arxiv.org

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …

被引用次数：555 相关文章所有 4 个版本

[PDF] thecvf.com

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

被引用次数：504 相关文章所有 10 个版本

[PDF] springer.com

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

D Damen, H Doughty, GM Farinella, A Furnari… - International Journal of …, 2022 - Springer

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-
KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M …

被引用次数：528 相关文章所有 13 个版本

[PDF] arxiv.org

Univl: A unified video and language pre-training model for multimodal understanding and generation

H Luo, L Ji, B Shi, H Huang, N Duan, T Li, J Li… - arXiv preprint arXiv …, 2020 - arxiv.org

With the recent success of the pre-training technique for NLP and image-linguistic tasks,
some video-linguistic pre-training works are gradually developed to improve video-text …

被引用次数：486 相关文章所有 2 个版本

[PDF] thecvf.com

Taco: Token-aware cascade contrastive learning for video-text alignment

J Yang, Y Bisk, J Gao - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Contrastive learning has been widely used to train transformer-based vision-language
models for video-text alignment and multi-modal representation learning. This paper …

被引用次数：150 相关文章所有 6 个版本

[PDF] arxiv.org

Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation

N Behrmann, SA Golestaneh, Z Kolter, J Gall… - European conference on …, 2022 - Springer

This paper introduces a unified framework for video action segmentation via sequence to
sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to …

被引用次数：88 相关文章所有 4 个版本

[PDF] thecvf.com

Few-shot video classification via temporal alignment

K Cao, J Ji, Z Cao, CY Chang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Difficulty in collecting and annotating large-scale video data raises a growing interest in
learning models which can recognize novel classes with only a few training examples. In …

被引用次数：287 相关文章所有 10 个版本

[PDF] arxiv.org

Vlm: Task-agnostic video-language model pre-training for video understanding

H Xu, G Ghosh, PY Huang, P Arora… - arXiv preprint arXiv …, 2021 - arxiv.org

We present a simplified, task-agnostic multi-modal pre-training approach that can accept
either video or text input, or both for a variety of end tasks. Existing pre-training are task …

被引用次数：141 相关文章所有 5 个版本

[PDF] thecvf.com

Coin: A large-scale dataset for comprehensive instructional video analysis

Y Tang, D Ding, Y Rao, Y Zheng… - Proceedings of the …, 2019 - openaccess.thecvf.com

There are substantial instruction videos on the Internet, which enables us to acquire
knowledge for completing various tasks. However, most existing datasets for instruction …

被引用次数：333 相关文章所有 6 个版本