Gemini: a family of highly capable multimodal models
G Team, R Anil, S Borgeaud, JB Alayrac, J Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable
capabilities across image, audio, video, and text understanding. The Gemini family consists …
capabilities across image, audio, video, and text understanding. The Gemini family consists …
The llama 3 herd of models
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …
presents a new set of foundation models, called Llama 3. It is a herd of language models …
Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Internvideo2: Scaling video foundation models for multimodal video understanding
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …
LingoQA: Visual question answering for autonomous driving
We introduce LingoQA, a novel dataset and benchmark for visual question answering in
autonomous driving. The dataset contains 28K unique short video scenarios, and 419K …
autonomous driving. The dataset contains 28K unique short video scenarios, and 419K …
Video instruction tuning with synthetic data
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
A simple recipe for contrastively pre-training video-first encoders beyond 16 frames
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …
dependencies. To this end we explore video-first architectures building on the common …
Bootstap: Bootstrapped training for tracking-any-point
To endow models with greater understanding of physics and motion, it is useful to enable
them to perceive how solid surfaces move and deform in real scenes. This can be formalized …
them to perceive how solid surfaces move and deform in real scenes. This can be formalized …
Lvbench: An extreme long video understanding benchmark
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several evaluation datasets …
understanding of short videos (typically under one minute), and several evaluation datasets …
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …