Perception test: A diagnostic benchmark for multimodal video models

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org

Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Cinepile: A long video question answering dataset and benchmark

R Rawal, K Saifullah, R Basri, D Jacobs… - arXiv preprint arXiv …, 2024 - arxiv.org

Current datasets for long-form video understanding often fall short of providing genuine long-
form comprehension challenges, as many tasks derived from these datasets can be …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

X Chen, Y Lin, Y Zhang, W Huang - arXiv preprint arXiv:2311.14906, 2023 - arxiv.org

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively
evaluate large vision-language models in open-ended video question answering. The …

被引用次数：5 相关文章所有 2 个版本

[PDF] thecvf.com

A Survey of Video Datasets for Grounded Event Understanding

K Sanders, B Van Durme - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

While existing video benchmarks largely consider specialized downstream tasks like
retrieval or question-answering (QA) contemporary multimodal AI systems must be capable …

相关文章所有 2 个版本

[PDF] arxiv.org

Needle In A Multimodal Haystack

W Wang, S Zhang, Y Ren, Y Duan, T Li, S Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid advancement of multimodal large language models (MLLMs), their evaluation
has become increasingly comprehensive. However, understanding long multimodal content …

相关文章所有 2 个版本

[PDF] arxiv.org

LVBench: An Extreme Long Video Understanding Benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

相关文章所有 2 个版本

[PDF] arxiv.org

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

X He, W Feng, K Zheng, Y Lu, W Zhu, J Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of"
world models"--interpreting and reasoning about complex real-world dynamics. To assess …

相关文章所有 2 个版本

[PDF] arxiv.org

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

R Bharadwaj, H Gani, M Naseer, FS Khan… - arXiv preprint arXiv …, 2024 - arxiv.org

The recent developments in Large Multi-modal Video Models (Video-LMMs) have
significantly enhanced our ability to interpret and analyze video data. Despite their …