Memory consolidation enables long-context video understanding
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …
quadratic complexity. While various attempts have been made to extend this context, this …
Streaming long video understanding with large language models
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …
video understanding, that capably understands arbitrary-length video with a constant …
Videoagent: Long-form video understanding with large language model as agent
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
Towards Generalist Robot Learning from Internet Video: A Survey
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …
DrVideo: Document Retrieval Based Long Video Understanding
Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …
seconds, with limited exploration of techniques for handling longer videos. The increased …
Understanding Long Videos in One Multimodal Language Model Pass
Large Language Models (LLMs), known to contain a strong awareness of world knowledge,
have allowed recent approaches to achieve excellent performance on Long-Video …
have allowed recent approaches to achieve excellent performance on Long-Video …
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA
Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …
HCQA@ Ego4D EgoSchema Challenge 2024
In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR
2024. To deeply integrate the powerful egocentric captioning model and question reasoning …
2024. To deeply integrate the powerful egocentric captioning model and question reasoning …