Video description: A comprehensive survey of deep learning approaches
Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …
understanding into automatic textual narration. It bridges the key AI fields of computer vision …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Moviechat: From dense token to sparse memory for long video understanding
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
Attention bottlenecks for multimodal fusion
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …
from multiple modalities such as vision and audio. Machine perception models, in stark …
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
End-to-end generative pretraining for multimodal video captioning
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …
Ai choreographer: Music conditioned 3d dance generation with aist++
We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …
End-to-end dense video captioning with parallel decoding
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Autoad ii: The sequel-who, when, and what in movie audio description
Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
time intervals, for the benefit of visually impaired audiences. For movies, this presents …
Vindlu: A recipe for effective video-and-language pretraining
The last several years have witnessed remarkable progress in video-and-language (VidL)
understanding. However, most modern VidL approaches use complex and specialized …
understanding. However, most modern VidL approaches use complex and specialized …