Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org
Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

Which industrial sectors are affected by artificial intelligence? A bibliometric analysis of trends and Perspectives

L Espina-Romero, JG Noroño Sánchez… - Sustainability, 2023 - mdpi.com
In recent times, artificial intelligence (AI) has been generating a significant impact in various
industry sectors, which implies that companies must be ready to adjust to this promising start …

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

Ctp: Towards vision-language continual pretraining via compatible momentum contrast and topology preservation

H Zhu, Y Wei, X Liang, C Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pretraining (VLP) has shown impressive results on diverse
downstream tasks by offline training on large-scale datasets. Regarding the growing nature …

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval

H Ma, H Zhao, Z Lin, A Kale, Z Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract recommendation, and marketing services. Extensive efforts have been made to
conquer the cross-modal retrieval problem in the general domain. When it comes to E …

M5product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining

X Dong, X Zhan, Y Wu, Y Wei… - Proceedings of the …, 2022 - openaccess.thecvf.com
Despite the potential of multi-modal pre-training to learn highly discriminative feature
representations from complementary data modalities, current progress is being slowed by …

Composed image retrieval using contrastive learning and task-oriented clip-based features

A Baldrati, M Bertini, T Uricchio… - ACM Transactions on …, 2023 - dl.acm.org
Given a query composed of a reference image and a relative caption, the Composed Image
Retrieval goal is to retrieve images visually similar to the reference one that integrates the …