- 学术资源搜索

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：143 相关文章所有 7 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：435 相关文章所有 16 个版本

[PDF] arxiv.org

Musiclm: Generating music from text

A Agostinelli, TI Denk, Z Borsos, J Engel… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce MusicLM, a model generating high-fidelity music from text descriptions such
as" a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of …

被引用次数：383 相关文章所有 6 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：140 相关文章所有 26 个版本

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

被引用次数：51 相关文章所有 5 个版本

[PDF] arxiv.org

Google scanned objects: A high-quality dataset of 3d scanned household items

L Downs, A Francis, N Koenig, B Kinman… - … on Robotics and …, 2022 - ieeexplore.ieee.org

Interactive 3D simulations have enabled break-throughs in robotics and computer vision, but
simulating the broad diversity of environments needed for deep learning requires large …

被引用次数：250 相关文章所有 5 个版本

[PDF] thecvf.com

Magvit: Masked generative video transformer

L Yu, Y Cheng, K Sohn, J Lezama… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various
video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video …

被引用次数：96 相关文章所有 8 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：213 相关文章所有 9 个版本

[PDF] arxiv.org

Dreamix: Video diffusion models are general video editors

E Molad, E Horwitz, D Valevski, AR Acha… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-driven image and video diffusion models have recently achieved unprecedented
generation realism. While diffusion models have been successfully applied for image …

被引用次数：139 相关文章所有 3 个版本

[PDF] neurips.cc

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc

As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

被引用次数：349 相关文章所有 7 个版本