Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Language is not all you need: Aligning perception with language models
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation
Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …
generates high-fidelity photorealistic images and supports content-rich synthesis involving …
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …
language tasks. However, most existing pre-trained models only excel in either …
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …
to-image generation models, systems often fail to produce images that accurately align with …
Clipcap: Clip prefix for image captioning
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …
predicts a textual informative caption to a given input image. In this paper, we present a …
Diffsound: Discrete diffusion model for text-to-sound generation
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …
studies in this area for sound generation. In this study, we investigate generating sound …
Aligning large multi-modal model with robust instruction tuning
Despite the promising progress in multi-modal tasks, current large multi-modal models
(LMM) are prone to hallucinating inconsistent descriptions with respect to the associated …
(LMM) are prone to hallucinating inconsistent descriptions with respect to the associated …
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …
contributed significantly to recent successes in vision-and-language pre-training. However …