Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - arXiv preprint arXiv …, 2024 - arxiv.org
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arXiv preprint arXiv …, 2024 - arxiv.org
Language has become a prominent modality in computer vision with the rise of multi-modal
LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term …

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Z Wang, S Yu, E Stengel-Eskin, J Yoon… - arXiv preprint arXiv …, 2024 - arxiv.org
Video-language understanding tasks have focused on short video clips, often struggling with
long-form video understanding tasks. Recently, many long video-language understanding …

DrVideo: Document Retrieval Based Long Video Understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

J Park, K Ranasinghe, K Kahatapitiya, W Ryoo… - arXiv preprint arXiv …, 2024 - arxiv.org
Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …