Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions

A Rahate, R Walambe, S Ramanna, K Kotecha - Information Fusion, 2022 - Elsevier
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip

Q Yu, J He, X Deng, X Shen… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …

Groupvit: Semantic segmentation emerges from text supervision

J Xu, S De Mello, S Liu, W Byeon… - Proceedings of the …, 2022 - openaccess.thecvf.com
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Scaling open-vocabulary image segmentation with image-level labels

G Ghiasi, X Gu, Y Cui, TY Lin - European Conference on Computer Vision, 2022 - Springer
We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer
Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

Contrastive learning of medical visual representations from paired images and text

Y Zhang, H Jiang, Y Miura… - Machine Learning …, 2022 - proceedings.mlr.press
Learning visual representations of medical images (eg, X-rays) is core to medical image
understanding but its progress has been held back by the scarcity of human annotations …

Open vocabulary semantic segmentation with patch aligned contrastive learning

J Mukhoti, TY Lin, O Poursaeed… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …

Airbert: In-domain pretraining for vision-and-language navigation

PL Guhur, M Tapaswi, S Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in
realistic environments using natural language instructions. Given the scarcity of domain …