Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

A survey of deep learning for mathematical reasoning

P Lu, L Qiu, W Yu, S Welleck, KW Chang - arXiv preprint arXiv:2212.10535, 2022 - arxiv.org
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in
various fields, including science, engineering, finance, and everyday life. The development …

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

Llama-adapter: Efficient fine-tuning of language models with zero-init attention

R Zhang, J Han, C Liu, P Gao, A Zhou, X Hu… - arXiv preprint arXiv …, 2023 - arxiv.org
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …

Is ChatGPT a general-purpose natural language processing task solver?

C Qin, A Zhang, Z Zhang, J Chen, M Yasunaga… - arXiv preprint arXiv …, 2023 - arxiv.org
Spurred by advancements in scale, large language models (LLMs) have demonstrated the
ability to perform a variety of natural language processing (NLP) tasks zero-shot--ie, without …

Visual chatgpt: Talking, drawing and editing with visual foundation models

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arXiv preprint arXiv …, 2023 - arxiv.org
ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

C Li, C Wong, S Zhang, N Usuyama… - Advances in …, 2024 - proceedings.neurips.cc
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigations focus on unimodal text. Multimodal …

Chameleon: Plug-and-play compositional reasoning with large language models

P Lu, B Peng, H Cheng, M Galley… - Advances in …, 2024 - proceedings.neurips.cc
Large language models (LLMs) have achieved remarkable progress in solving various
natural language processing tasks due to emergent reasoning abilities. However, LLMs …