Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Survey of vulnerabilities in large language models revealed by adversarial attacks

E Shayegani, MAA Mamun, Y Fu, P Zaree… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as
they integrate more deeply into complex systems, the urgency to scrutinize their security …

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arXiv preprint arXiv …, 2023 - arxiv.org
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

Mm-vet: Evaluating large multimodal models for integrated capabilities

W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We propose MM-Vet, an evaluation benchmark that examines large multimodal models
(LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing …

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Multimodal chain-of-thought reasoning in language models

Z Zhang, A Zhang, M Li, H Zhao, G Karypis… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have shown impressive performance on complex reasoning
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …

Monkey: Image resolution and text label are important things for large multi-modal models

Z Li, B Yang, Q Liu, Z Ma, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) have shown promise in vision-language tasks but
struggle with high-resolution input and detailed scene understanding. Addressing these …

Multimodal c4: An open, billion-scale corpus of images interleaved with text

W Zhu, J Hessel, A Awadalla… - Advances in …, 2024 - proceedings.neurips.cc
In-context vision and language models like Flamingo support arbitrarily interleaved
sequences of images and text as input. This format not only enables few-shot learning via …