A simple recipe for contrastively pre-training video-first encoders beyond 16 frames
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …
dependencies. To this end we explore video-first architectures building on the common …
Internvideo2: Scaling video foundation models for multimodal video understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
Memory consolidation enables long-context video understanding
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …
quadratic complexity. While various attempts have been made to extend this context, this …
Cinepile: A long video question answering dataset and benchmark
Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …
form comprehension challenges, as many tasks derived from these datasets can be …
Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively
evaluate large vision-language models in open-ended video question answering. The …
evaluate large vision-language models in open-ended video question answering. The …
A Survey of Video Datasets for Grounded Event Understanding
K Sanders, B Van Durme - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
While existing video benchmarks largely consider specialized downstream tasks like
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …
Needle In A Multimodal Haystack
With the rapid advancement of multimodal large language models (MLLMs), their evaluation
has become increasingly comprehensive. However, understanding long multimodal content …
has become increasingly comprehensive. However, understanding long multimodal content …
LVBench: An Extreme Long Video Understanding Benchmark
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …
understanding of short videos (typically under one minute), and several evaluation datasets …
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of"
world models"--interpreting and reasoning about complex real-world dynamics. To assess …
world models"--interpreting and reasoning about complex real-world dynamics. To assess …
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
The recent developments in Large Multi-modal Video Models (Video-LMMs) have
significantly enhanced our ability to interpret and analyze video data. Despite their …
significantly enhanced our ability to interpret and analyze video data. Despite their …