Multi-modal dense video captioning

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

被引用次数：18 相关文章所有 5 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：155 相关文章所有 26 个版本

[PDF] thecvf.com

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

被引用次数：74 相关文章所有 3 个版本

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

被引用次数：551 相关文章所有 8 个版本

[PDF] thecvf.com

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

被引用次数：49 相关文章所有 5 个版本

[PDF] thecvf.com

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

被引用次数：175 相关文章所有 6 个版本

[PDF] thecvf.com

Ai choreographer: Music conditioned 3d dance generation with aist++

R Li, S Yang, DA Ross… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …

被引用次数：406 相关文章所有 6 个版本

[PDF] thecvf.com

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

被引用次数：174 相关文章所有 6 个版本

[PDF] thecvf.com

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

被引用次数：20 相关文章所有 7 个版本

[PDF] thecvf.com

Vindlu: A recipe for effective video-and-language pretraining

F Cheng, X Wang, J Lei, D Crandall… - Proceedings of the …, 2023 - openaccess.thecvf.com

The last several years have witnessed remarkable progress in video-and-language (VidL)
understanding. However, most modern VidL approaches use complex and specialized …

被引用次数：61 相关文章所有 8 个版本