Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer

With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

被引用次数：182 相关文章所有 8 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：590 相关文章所有 9 个版本

[PDF] arxiv.org

Filip: Fine-grained interactive language-image pre-training

L Yao, R Huang, L Hou, G Lu, M Niu, H Xu… - arXiv preprint arXiv …, 2021 - arxiv.org

Unsupervised large-scale vision-language pre-training has shown promising advances on
various downstream tasks. Existing methods often model the cross-modal interaction either …

被引用次数：595 相关文章所有 4 个版本

[PDF] mdpi.com

Which industrial sectors are affected by artificial intelligence? A bibliometric analysis of trends and Perspectives

L Espina-Romero, JG Noroño Sánchez… - Sustainability, 2023 - mdpi.com

In recent times, artificial intelligence (AI) has been generating a significant impact in various
industry sectors, which implies that companies must be ready to adjust to this promising start …

被引用次数：26 相关文章所有 8 个版本

[PDF] neurips.cc

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

J Gu, X Meng, G Lu, L Hou, N Minzhe… - Advances in …, 2022 - proceedings.neurips.cc

Abstract Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of pre-trained cross …

被引用次数：120 相关文章所有 6 个版本

[PDF] thecvf.com

Ctp: Towards vision-language continual pretraining via compatible momentum contrast and topology preservation

H Zhu, Y Wei, X Liang, C Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Vision-Language Pretraining (VLP) has shown impressive results on diverse
downstream tasks by offline training on large-scale datasets. Regarding the growing nature …

被引用次数：27 相关文章所有 5 个版本

[PDF] arxiv.org

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arXiv preprint arXiv …, 2023 - arxiv.org

The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

被引用次数：64 相关文章所有 4 个版本

[PDF] thecvf.com

Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval

H Ma, H Zhao, Z Lin, A Kale, Z Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Abstract recommendation, and marketing services. Extensive efforts have been made to
conquer the cross-modal retrieval problem in the general domain. When it comes to E …

被引用次数：65 相关文章所有 3 个版本

[PDF] thecvf.com

M5product: Self-harmonized contrastive learning for e-commercial multi-modal pretraining

X Dong, X Zhan, Y Wu, Y Wei… - Proceedings of the …, 2022 - openaccess.thecvf.com

Despite the potential of multi-modal pre-training to learn highly discriminative feature
representations from complementary data modalities, current progress is being slowed by …

被引用次数：41 相关文章所有 6 个版本

[PDF] arxiv.org

Composed image retrieval using contrastive learning and task-oriented clip-based features

A Baldrati, M Bertini, T Uricchio… - ACM Transactions on …, 2023 - dl.acm.org

Given a query composed of a reference image and a relative caption, the Composed Image
Retrieval goal is to retrieve images visually similar to the reference one that integrates the …

被引用次数：20 相关文章所有 6 个版本