Wenlan 2.0: Make ai imagine via a multimodal foundation model

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：182 相关文章所有 8 个版本

[PDF] arxiv.org

A survey of vision-language pre-trained models

Y Du, Z Liu, J Li, WX Zhao - arXiv preprint arXiv:2202.10936, 2022 - arxiv.org

As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …

被引用次数：216 相关文章所有 3 个版本

[PDF] arxiv.org

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arXiv preprint arXiv …, 2022 - arxiv.org

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

被引用次数：122 相关文章所有 2 个版本

[PDF] neurips.cc

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

被引用次数：120 相关文章所有 6 个版本

[PDF] arxiv.org

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

被引用次数：77 相关文章所有 5 个版本

[PDF] arxiv.org

Fashionvil: Fashion-focused vision-and-language representation learning

X Han, L Yu, X Zhu, L Zhang, YZ Song… - European conference on …, 2022 - Springer

Abstract Large-scale Vision-and-Language (V+ L) pre-training for representation learning
has proven to be effective in boosting various downstream V+ L tasks. However, when it …

被引用次数：49 相关文章所有 7 个版本

[PDF] oapen.org

[图书][B] Foundation models for natural language processing: Pre-trained language models integrating media

G Paaß, S Giesselbach - 2023 - library.oapen.org

This open access book provides a comprehensive overview of the state of the art in research
and applications of Foundation Models and is intended for readers familiar with basic …

被引用次数：48 相关文章所有 10 个版本

[PDF] arxiv.org

One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code

Y Dai, D Tang, L Liu, M Tan, C Zhou, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org

People perceive the world with multiple senses (eg, through hearing sounds, reading words
and seeing objects). However, most existing AI systems only process an individual modality …

被引用次数：29 相关文章所有 2 个版本

A roadmap for big model

S Yuan, H Zhao, S Zhao, J Leng, Y Liang… - arXiv preprint arXiv …, 2022 - arxiv.org

With the rapid development of deep learning, training Big Models (BMs) for multiple
downstream tasks becomes a popular paradigm. Researchers have achieved various …

被引用次数：21 相关文章所有 2 个版本

[PDF] mi-research.net

Multimodal pretraining from monolingual to multilingual

L Zhang, L Ruan, A Hu, Q Jin - Machine Intelligence Research, 2023 - Springer

Multimodal pretraining has made convincing achievements in various downstream tasks in
recent years. However, since the majority of the existing works construct models based on …

被引用次数：6 相关文章所有 4 个版本