[HTML][HTML] Survey: Transformer based video-language pre-training
L Ruan, Q Jin - AI Open, 2022 - Elsevier
Inspired by the success of transformer-based pre-training methods on natural language
tasks and further computer vision tasks, researchers have started to apply transformer to …
tasks and further computer vision tasks, researchers have started to apply transformer to …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Deep image captioning: A review of methods, trends and future challenges
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …
content of images in human language, which requires to model semantic relationship …
Internvideo: General video foundation models via generative and discriminative learning
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …
downstream tasks in computer vision. However, most existing vision foundation models …
Less is more: Clipbert for video-and-language learning via sparse sampling
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …
dictates a neural model to learn from offline-extracted dense video features from vision …
Videogpt: Video generation using vq-vae and transformers
We present VideoGPT: a conceptually simple architecture for scaling likelihood based
generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled …
generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Next-qa: Next phase of question-answering to explaining temporal actions
We introduce NExT-QA, a rigorously designed video question answering (VideoQA)
benchmark to advance video understanding from describing to explaining the temporal …
benchmark to advance video understanding from describing to explaining the temporal …
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …
provided captions. However, such datasets are expensive and time consuming to create and …
From recognition to cognition: Visual commonsense reasoning
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …