Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Palm-e: An embodied multimodal language model

D Driess, F Xia, MSM Sajjadi, C Lynch… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

Visual chatgpt: Talking, drawing and editing with visual foundation models

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arXiv preprint arXiv …, 2023 - arxiv.org
ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …

Vipergpt: Visual inference via python execution for reasoning

D Surís, S Menon, C Vondrick - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …

Instructpix2pix: Learning to follow image editing instructions

T Brooks, A Holynski, AA Efros - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We propose a method for editing images from human instructions: given an input image and
a written instruction that tells the model what to do, our model follows these instructions to …

A survey on multimodal large language models

S Yin, C Fu, S Zhao, K Li, X Sun, T Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …

Language models can solve computer tasks

G Kim, P Baldi, S McAleer - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Agents capable of carrying out general tasks on a computer can improve efficiency and
productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally …

Voxposer: Composable 3d value maps for robotic manipulation with language models

W Huang, C Wang, R Zhang, Y Li, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …