Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

A survey of vision-language pre-trained models

Y Du, Z Liu, J Li, WX Zhao - arXiv preprint arXiv:2202.10936, 2022 - arxiv.org
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arXiv preprint arXiv …, 2022 - arxiv.org
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer
Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

[图书][B] Foundation models for natural language processing: Pre-trained language models integrating media

G Paaß, S Giesselbach - 2023 - library.oapen.org
This open access book provides a comprehensive overview of the state of the art in research
and applications of Foundation Models and is intended for readers familiar with basic …

One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code

Y Dai, D Tang, L Liu, M Tan, C Zhou, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
People perceive the world with multiple senses (eg, through hearing sounds, reading words
and seeing objects). However, most existing AI systems only process an individual modality …

A roadmap for big model

S Yuan, H Zhao, S Zhao, J Leng, Y Liang… - arXiv preprint arXiv …, 2022 - arxiv.org
With the rapid development of deep learning, training Big Models (BMs) for multiple
downstream tasks becomes a popular paradigm. Researchers have achieved various …

Multimodal pretraining from monolingual to multilingual

L Zhang, L Ruan, A Hu, Q Jin - Machine Intelligence Research, 2023 - Springer
Multimodal pretraining has made convincing achievements in various downstream tasks in
recent years. However, since the majority of the existing works construct models based on …