Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
A survey of vision-language pre-trained models
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …
years. They have dominated the mainstream techniques in natural language processing …
Chinese clip: Contrastive vision-language pretraining in chinese
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …
application of contrastive learning for vision-language pretraining. In this work, we construct …
Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark
Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …
Altclip: Altering the language encoder in clip for extended language capabilities
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …
bilingual/multilingual multimodal representation model. Starting from the pre-trained …
Fashionvil: Fashion-focused vision-and-language representation learning
Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …
has proven to be effective in boosting various downstream V+ L tasks. However, when it …
[图书][B] Foundation models for natural language processing: Pre-trained language models integrating media
G Paaß, S Giesselbach - 2023 - library.oapen.org
This open access book provides a comprehensive overview of the state of the art in research
and applications of Foundation Models and is intended for readers familiar with basic …
and applications of Foundation Models and is intended for readers familiar with basic …
One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code
People perceive the world with multiple senses (eg, through hearing sounds, reading words
and seeing objects). However, most existing AI systems only process an individual modality …
and seeing objects). However, most existing AI systems only process an individual modality …
Multimodal pretraining from monolingual to multilingual
Multimodal pretraining has made convincing achievements in various downstream tasks in
recent years. However, since the majority of the existing works construct models based on …
recent years. However, since the majority of the existing works construct models based on …