Multimodal representation learning via maximization of local mutual information

B Wang, Q Xie, J Pei, Z Chen, P Tiwari, Z Li… - ACM Computing …, 2023 - dl.acm.org

Pre-trained language models (PLMs) have been the de facto paradigm for most natural
language processing tasks. This also benefits the biomedical domain: researchers from …

被引用次数：114 相关文章所有 5 个版本

[PDF] arxiv.org

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer

Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

被引用次数：159 相关文章所有 9 个版本

[PDF] mlr.press

Contrastive learning of medical visual representations from paired images and text

Y Zhang, H Jiang, Y Miura… - Machine Learning …, 2022 - proceedings.mlr.press

Learning visual representations of medical images (eg, X-rays) is core to medical image
understanding but its progress has been held back by the scarcity of human annotations …

被引用次数：598 相关文章所有 6 个版本

[PDF] researchgate.net

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arXiv preprint arXiv …, 2023 - researchgate.net

Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

被引用次数：102 相关文章

[PDF] arxiv.org

Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, Y Li, S Wang, L Teng… - arXiv preprint arXiv …, 2023 - arxiv.org

Contrastive Language-Image Pre-training (CLIP), a straightforward yet effective pre-training
paradigm, successfully introduces semantic-rich text supervision to vision models and has …

被引用次数：16 相关文章所有 2 个版本

[PDF] thecvf.com

Multimodal variational auto-encoder based audio-visual segmentation

Y Mao, J Zhang, M Xiang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …

被引用次数：23 相关文章所有 5 个版本

[PDF] arxiv.org

Joint learning of localized representations from medical images and reports

P Müller, G Kaissis, C Zou, D Rueckert - European Conference on …, 2022 - Springer

Contrastive learning has proven effective for pre-training image models on unlabeled data
with promising results for tasks such as medical image classification. Using paired text (like …

被引用次数：59 相关文章所有 6 个版本

[PDF] arxiv.org

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

S Zhang, Y Xu, N Usuyama, H Xu, J Bagga… - arXiv preprint arXiv …, 2023 - arxiv.org

Biomedical data is inherently multimodal, comprising physical measurements and natural
language narratives. A generalist biomedical AI model needs to simultaneously process …

被引用次数：38 相关文章所有 2 个版本

[PDF] arxiv.org

A scoping review on multimodal deep learning in biomedical images and texts

Z Sun, M Lin, Q Zhu, Q Xie, F Wang, Z Lu… - Journal of Biomedical …, 2023 - Elsevier

Objective Computer-assisted diagnostic and prognostic systems of the future should be
capable of simultaneously processing multimodal data. Multimodal deep learning (MDL) …

被引用次数：8 相关文章所有 9 个版本

[PDF] neurips.cc

S-clip: Semi-supervised vision-language learning using few specialist captions

S Mo, M Kim, K Lee, J Shin - Advances in Neural …, 2023 - proceedings.neurips.cc

Vision-language models, such as contrastive language-image pre-training (CLIP), have
demonstrated impressive results in natural image domains. However, these models often …

被引用次数：6 相关文章所有 5 个版本