Advancing high-resolution video-language representation with large-scale video transcriptions

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：151 相关文章所有 7 个版本

[PDF] arxiv.org

State of the art on diffusion models for visual computing

R Po, W Yifan, V Golyanik, K Aberman… - Computer Graphics …, 2024 - Wiley Online Library

The field of visual computing is rapidly advancing due to the emergence of generative
artificial intelligence (AI), which unlocks unprecedented capabilities for the generation …

被引用次数：55 相关文章所有 12 个版本

[PDF] thecvf.com

Align your latents: High-resolution video synthesis with latent diffusion models

A Blattmann, R Rombach, H Ling… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding
excessive compute demands by training a diffusion model in a compressed lower …

被引用次数：533 相关文章所有 6 个版本

[PDF] arxiv.org

Make-a-video: Text-to-video generation without text-video data

U Singer, A Polyak, T Hayes, X Yin, J An… - arXiv preprint arXiv …, 2022 - arxiv.org

We propose Make-A-Video--an approach for directly translating the tremendous recent
progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple …

被引用次数：846 相关文章所有 3 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：147 相关文章所有 26 个版本

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

被引用次数：87 相关文章所有 8 个版本

[PDF] arxiv.org

Modelscope text-to-video technical report

J Wang, H Yuan, D Chen, Y Zhang, X Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a
text-to-image synthesis model (ie, Stable Diffusion). ModelScopeT2V incorporates spatio …

被引用次数：166 相关文章所有 2 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：44 相关文章所有 3 个版本

[PDF] arxiv.org

Controlvideo: Training-free controllable text-to-video generation

Y Zhang, Y Wei, D Jiang, X Zhang, W Zuo… - arXiv preprint arXiv …, 2023 - arxiv.org

Text-driven diffusion models have unlocked unprecedented abilities in image generation,
whereas their video counterpart still lags behind due to the excessive training cost of …

被引用次数：114 相关文章所有 3 个版本

[PDF] thecvf.com

Simda: Simple diffusion adapter for efficient video generation

Z Xing, Q Dai, H Hu, Z Wu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The recent wave of AI-generated content has witnessed the great development and success
of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of …

被引用次数：35 相关文章所有 3 个版本