Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Videoclip: Contrastive pre-training for zero-shot video-text understanding

H Xu, G Ghosh, PY Huang, D Okhonko… - arXiv preprint arXiv …, 2021 - arxiv.org
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot
video and text understanding, without using any labels on downstream tasks. VideoCLIP …

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

F Sener, D Chatterjee, D Shelepov… - Proceedings of the …, 2022 - openaccess.thecvf.com
Assembly101 is a new procedural activity dataset featuring 4321 videos of people
assembling and disassembling 101" take-apart" toy vehicles. Participants work without fixed …

Bridging video-text retrieval with multiple choice questions

Y Ge, Y Ge, X Liu, D Li, Y Shan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Pre-training a model to learn transferable video-text representation for retrieval has attracted
a lot of attention in recent years. Previous dominant works mainly adopt two separate …

Everything at once-multi-modal fusion transformer for video retrieval

N Shvetsova, B Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks …

Alfred: A benchmark for interpreting grounded instructions for everyday tasks

M Shridhar, J Thomason, D Gordon… - Proceedings of the …, 2020 - openaccess.thecvf.com
Abstract We present ALFRED (Action Learning From Realistic Environments and Directives),
a benchmark for learning a mapping from natural language instructions and egocentric …

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …