Memory consolidation enables long-context video understanding

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - arXiv preprint arXiv …, 2024 - arxiv.org

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

被引用次数：15 相关文章所有 2 个版本

[PDF] arxiv.org

Language repository for long video understanding

K Kahatapitiya, K Ranasinghe, J Park… - arXiv preprint arXiv …, 2024 - arxiv.org

Language has become a prominent modality in computer vision with the rise of multi-modal
LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Z Wang, S Yu, E Stengel-Eskin, J Yoon… - arXiv preprint arXiv …, 2024 - arxiv.org

Video-language understanding tasks have focused on short video clips, often struggling with
long-form video understanding tasks. Recently, many long video-language understanding …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

DrVideo: Document Retrieval Based Long Video Understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arXiv preprint arXiv …, 2024 - arxiv.org

Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

被引用次数：2 相关文章

[PDF] arxiv.org

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

J Park, K Ranasinghe, K Kahatapitiya, W Ryoo… - arXiv preprint arXiv …, 2024 - arxiv.org

Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

被引用次数：4 相关文章所有 5 个版本