Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
Musiclm: Generating music from text
We introduce MusicLM, a model generating high-fidelity music from text descriptions such
as" a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of …
as" a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Egoschema: A diagnostic benchmark for very long-form video language understanding
K Mangalam, R Akshulakov… - Advances in Neural …, 2024 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …
benchmark to evaluate long video understanding capabilities of modern vision and …
Google scanned objects: A high-quality dataset of 3d scanned household items
L Downs, A Francis, N Koenig, B Kinman… - … on Robotics and …, 2022 - ieeexplore.ieee.org
Interactive 3D simulations have enabled break-throughs in robotics and computer vision, but
simulating the broad diversity of environments needed for deep learning requires large …
simulating the broad diversity of environments needed for deep learning requires large …
Magvit: Masked generative video transformer
Abstract We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various
video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video …
video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video …
Merlot reserve: Neural script knowledge through vision and language and sound
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …
Dreamix: Video diffusion models are general video editors
E Molad, E Horwitz, D Valevski, AR Acha… - arXiv preprint arXiv …, 2023 - arxiv.org
Text-driven image and video diffusion models have recently achieved unprecedented
generation realism. While diffusion models have been successfully applied for image …
generation realism. While diffusion models have been successfully applied for image …
Merlot: Multimodal neural script knowledge models
As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …
reasoning across time to make inferences about the past, present, and future. We introduce …