Loctex: Learning data-efficient visual representations from localized textual supervision

H You, L Zhou, B Xiao, N Codella, Y Cheng… - … on Computer Vision, 2022 - Springer

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn
transferable features for a range of downstream tasks by mapping multiple modalities into a …

被引用次数：39 相关文章所有 5 个版本

[PDF] arxiv.org

Joint learning of localized representations from medical images and reports

P Müller, G Kaissis, C Zou, D Rueckert - European Conference on …, 2022 - Springer

Contrastive learning has proven effective for pre-training image models on unlabeled data
with promising results for tasks such as medical image classification. Using paired text (like …

被引用次数：59 相关文章所有 6 个版本

[PDF] thecvf.com

Position-guided text prompt for vision-language pre-training

J Wang, P Zhou, MZ Shou… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Vision-Language Pre-Training (VLP) has shown promising capabilities to align
image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we …

被引用次数：25 相关文章所有 4 个版本

[PDF] thecvf.com

Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm

W Ma, S Li, JM Zhang, CH Liu, J Kang… - Proceedings of the …, 2023 - openaccess.thecvf.com

The development of vision models for real-world applications is hindered by the challenge of
annotated data scarcity, which has necessitated the adoption of data-efficient visual learning …

被引用次数：4 相关文章所有 3 个版本

Radiological reports improve pre-training for localized imaging tasks on chest x-rays

P Müller, G Kaissis, C Zou, D Rueckert - International Conference on …, 2022 - Springer

Self-supervised pre-training on unlabeled images has shown promising results in the
medical domain. Recently, methods using text-supervision from companion text like …

被引用次数：15 相关文章所有 3 个版本

[PDF] arxiv.org

Answer-me: Multi-task open-vocabulary visual question answering

AJ Piergiovanni, W Li, W Kuo, M Saffar… - arXiv preprint arXiv …, 2022 - arxiv.org

We present Answer-Me, a task-aware multi-task framework which unifies a variety of
question answering tasks, such as, visual question answering, visual entailment, visual …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Findit: Generalized localization with natural language queries

W Kuo, F Bertsch, W Li, AJ Piergiovanni… - … on Computer Vision, 2022 - Springer

We propose FindIt, a simple and versatile framework that unifies a variety of visual
grounding and localization tasks including referring expression comprehension, text-based …

被引用次数：15 相关文章所有 7 个版本

[PDF] arxiv.org

Free-atm: Exploring unsupervised learning on diffusion-generated images with free attention masks

DJ Zhang, M Xu, C Xue, W Zhang, X Han, S Bai… - arXiv preprint arXiv …, 2023 - arxiv.org

Despite the rapid advancement of unsupervised learning in visual representation, it requires
training on large-scale datasets that demand costly data collection, and pose additional …

被引用次数：2 相关文章

[PDF] smu.edu.sg

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts

AJ Wang, P Zhou, MZ Shou… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Vision-Language Pre-Training (VLP) has demonstrated remarkable potential in aligning
image and text pairs, paving the way for a wide range of cross-modal learning tasks …

被引用次数：1 相关文章所有 6 个版本

[PDF] mpg.de

[PDF][PDF] Scaling up Instance Segmentation using Approximately Localized Phrases.

K Desai, I Misra, J Johnson, L van der Maaten - BMVC, 2022 - bmvc2022.mpi-inf.mpg.de

Training object detectors to segment large numbers of classes is challenging because they
require training masks for each class. A potential solution is to partially supervise detectors …

被引用次数：1 相关文章所有 2 个版本