Foundational models defining a new era in vision: A survey and outlook
Vision systems to see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Survey of vulnerabilities in large language models revealed by adversarial attacks
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …
they integrate more deeply into complex systems, the urgency to scrutinize their security …
A survey of large language models
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
Improved baselines with visual instruction tuning
Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …
instruction tuning. In this paper we present the first systematic study to investigate the design …
Generative multimodal models are in-context learners
Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …
simple instructions which current multimodal systems largely struggle to imitate. In this work …
Next-gpt: Any-to-any multimodal llm
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …
possibilities for multi-modal AGI systems. However the progress in vision and vision …
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit
impressive skills in various domains, their ability for mathematical reasoning within visual …
impressive skills in various domains, their ability for mathematical reasoning within visual …
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …