Learning visual representation from modality-shared contrastive language-image pre-training

H You, L Zhou, B Xiao, N Codella, Y Cheng… - … on Computer Vision, 2022 - Springer
Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn
transferable features for a range of downstream tasks by mapping multiple modalities into a …

Joint learning of localized representations from medical images and reports

P Müller, G Kaissis, C Zou, D Rueckert - European Conference on …, 2022 - Springer
Contrastive learning has proven effective for pre-training image models on unlabeled data
with promising results for tasks such as medical image classification. Using paired text (like …

Position-guided text prompt for vision-language pre-training

J Wang, P Zhou, MZ Shou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Vision-Language Pre-Training (VLP) has shown promising capabilities to align
image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we …

Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm

W Ma, S Li, JM Zhang, CH Liu, J Kang… - Proceedings of the …, 2023 - openaccess.thecvf.com
The development of vision models for real-world applications is hindered by the challenge of
annotated data scarcity, which has necessitated the adoption of data-efficient visual learning …

Radiological reports improve pre-training for localized imaging tasks on chest x-rays

P Müller, G Kaissis, C Zou, D Rueckert - International Conference on …, 2022 - Springer
Self-supervised pre-training on unlabeled images has shown promising results in the
medical domain. Recently, methods using text-supervision from companion text like …

Answer-me: Multi-task open-vocabulary visual question answering

AJ Piergiovanni, W Li, W Kuo, M Saffar… - arXiv preprint arXiv …, 2022 - arxiv.org
We present Answer-Me, a task-aware multi-task framework which unifies a variety of
question answering tasks, such as, visual question answering, visual entailment, visual …

Findit: Generalized localization with natural language queries

W Kuo, F Bertsch, W Li, AJ Piergiovanni… - … on Computer Vision, 2022 - Springer
We propose FindIt, a simple and versatile framework that unifies a variety of visual
grounding and localization tasks including referring expression comprehension, text-based …

Free-atm: Exploring unsupervised learning on diffusion-generated images with free attention masks

DJ Zhang, M Xu, C Xue, W Zhang, X Han, S Bai… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite the rapid advancement of unsupervised learning in visual representation, it requires
training on large-scale datasets that demand costly data collection, and pose additional …

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts

AJ Wang, P Zhou, MZ Shou… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning
image and text pairs, paving the way for a wide range of cross-modal learning tasks …

[PDF][PDF] Scaling up Instance Segmentation using Approximately Localized Phrases.

K Desai, I Misra, J Johnson, L van der Maaten - BMVC, 2022 - bmvc2022.mpi-inf.mpg.de
Training object detectors to segment large numbers of classes is challenging because they
require training masks for each class. A potential solution is to partially supervise detectors …