A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks with different data modalities. A PFM (eg, BERT, ChatGPT, and GPT-4) is …
downstream tasks with different data modalities. A PFM (eg, BERT, ChatGPT, and GPT-4) is …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
The cost of vision-and-language pre-training has become increasingly prohibitive due to
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …
end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and …
Image as a foreign language: Beit pretraining for vision and vision-language tasks
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
Large language models (LLMs) have notably accelerated progress towards artificial general
intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing …
intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing …
Visual chatgpt: Talking, drawing and editing with visual foundation models
ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …
remarkable conversational competency and reasoning capabilities across many domains …
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …
Qwen-vl: A frontier large vision-language model with versatile abilities
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
Kosmos-2: Grounding multimodal large language models to the world
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …