Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - arXiv preprint arXiv …, 2024 - arxiv.org
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

DrVideo: Document Retrieval Based Long Video Understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

Understanding Long Videos in One Multimodal Language Model Pass

K Ranasinghe, X Li, K Kahatapitiya… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs), known to contain a strong awareness of world knowledge,
have allowed recent approaches to achieve excellent performance on Long-Video …

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

J Park, K Ranasinghe, K Kahatapitiya, W Ryoo… - arXiv preprint arXiv …, 2024 - arxiv.org
Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …

HCQA@ Ego4D EgoSchema Challenge 2024

H Zhang, Y Xie, Y Feng, Z Li, M Liu, L Nie - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR
2024. To deeply integrate the powerful egocentric captioning model and question reasoning …