Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Survey of vulnerabilities in large language models revealed by adversarial attacks
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …
they integrate more deeply into complex systems, the urgency to scrutinize their security …
Minigpt-4: Enhancing vision-language understanding with advanced large language models
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …
generating websites from handwritten text and identifying humorous elements within …
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …
instruction abilities across various open-ended tasks. However previous methods have …
Mm-vet: Evaluating large multimodal models for integrated capabilities
We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …
Cogvlm: Visual expert for pretrained language models
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …
from the popular shallow alignment method which maps image features into the input space …
Multimodal chain-of-thought reasoning in language models
Large language models (LLMs) have shown impressive performance on complex reasoning
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …
Monkey: Image resolution and text label are important things for large multi-modal models
Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …
struggle with high-resolution input and detailed scene understanding. Addressing these …
Multimodal c4: An open, billion-scale corpus of images interleaved with text
In-context vision and language models like Flamingo support arbitrarily interleaved
sequences of images and text as input. This format not only enables few-shot learning via …
sequences of images and text as input. This format not only enables few-shot learning via …