A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, R Basri, D Jacobs… - arXiv preprint arXiv …, 2024 - arxiv.org
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

X Chen, Y Lin, Y Zhang, W Huang - arXiv preprint arXiv:2311.14906, 2023 - arxiv.org
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively
evaluate large vision-language models in open-ended video question answering. The …

A Survey of Video Datasets for Grounded Event Understanding

K Sanders, B Van Durme - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
While existing video benchmarks largely consider specialized downstream tasks like
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …

Needle In A Multimodal Haystack

W Wang, S Zhang, Y Ren, Y Duan, T Li, S Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid advancement of multimodal large language models (MLLMs), their evaluation
has become increasingly comprehensive. However, understanding long multimodal content …

LVBench: An Extreme Long Video Understanding Benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

X He, W Feng, K Zheng, Y Lu, W Zhu, J Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of"
world models"--interpreting and reasoning about complex real-world dynamics. To assess …

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

R Bharadwaj, H Gani, M Naseer, FS Khan… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent developments in Large Multi-modal Video Models (Video-LMMs) have
significantly enhanced our ability to interpret and analyze video data. Despite their …