A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org

Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Streaming long video understanding with large language models

R Qian, X Dong, P Zhang, Y Zang, S Ding, D Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Videoagent: Long-form video understanding with large language model as agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - arXiv preprint arXiv …, 2024 - arxiv.org

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

DrVideo: Document Retrieval Based Long Video Understanding

Z Ma, C Gou, H Shi, B Sun, S Li, H Rezatofighi… - arXiv preprint arXiv …, 2024 - arxiv.org

Existing methods for long video understanding primarily focus on videos only lasting tens of
seconds, with limited exploration of techniques for handling longer videos. The increased …

[PDF] arxiv.org

Understanding Long Videos in One Multimodal Language Model Pass

K Ranasinghe, X Li, K Kahatapitiya… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs), known to contain a strong awareness of world knowledge,
have allowed recent approaches to achieve excellent performance on Long-Video …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

J Park, K Ranasinghe, K Kahatapitiya, W Ryoo… - arXiv preprint arXiv …, 2024 - arxiv.org

Long-form videos that span across wide temporal intervals are highly information redundant
and contain multiple distinct events or entities that are often loosely-related. Therefore, when …

HCQA@ Ego4D EgoSchema Challenge 2024

H Zhang, Y Xie, Y Feng, Z Li, M Liu, L Nie - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we present our champion solution for Ego4D EgoSchema Challenge in CVPR
2024. To deeply integrate the powerful egocentric captioning model and question reasoning …