Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Palm-e: An embodied multimodal language model
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
Open-vocabulary panoptic segmentation with text-to-image diffusion models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
Visual chatgpt: Talking, drawing and editing with visual foundation models
ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …
remarkable conversational competency and reasoning capabilities across many domains …
Vipergpt: Visual inference via python execution for reasoning
Answering visual queries is a complex task that requires both visual processing and
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
reasoning. End-to-end models, the dominant approach for this task, do not explicitly …
Instructpix2pix: Learning to follow image editing instructions
We propose a method for editing images from human instructions: given an input image and
a written instruction that tells the model what to do, our model follows these instructions to …
a written instruction that tells the model what to do, our model follows these instructions to …
A survey on multimodal large language models
Multimodal Large Language Model (MLLM) recently has been a new rising research
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …
hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform …
Language models can solve computer tasks
Agents capable of carrying out general tasks on a computer can improve efficiency and
productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally …
productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally …
Voxposer: Composable 3d value maps for robotic manipulation with language models
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …
can be extracted for robot manipulation in the form of reasoning and planning. Despite the …