Gemini: a family of highly capable multimodal models

G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …

The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - Arxiv e …, 2024 - ui.adsabs.harvard.edu
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

LingoQA: Visual question answering for autonomous driving

AM Marcu, L Chen, J Hünermann, A Karnsund… - … on Computer Vision, 2024 - Springer
We introduce LingoQA, a novel dataset and benchmark for visual question answering in
autonomous driving. The dataset contains 28K unique short video scenarios, and 419K …

Video instruction tuning with synthetic data

Y Zhang, J Wu, W Li, B Li, Z Ma, Z Liu, C Li - arXiv preprint arXiv …, 2024 - arxiv.org
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Bootstap: Bootstrapped training for tracking-any-point

C Doersch, P Luc, Y Yang, D Gokay… - Proceedings of the …, 2024 - openaccess.thecvf.com
To endow models with greater understanding of physics and motion, it is useful to enable
them to perceive how solid surfaces move and deform in real scenes. This can be formalized …

Lvbench: An extreme long video understanding benchmark

W Wang, Z He, W Hong, Y Cheng, X Zhang, J Qi… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arXiv preprint arXiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …