Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …
video, etc., are showing better performance than individual modalities (ie, unimodal) …
Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Open-vocabulary panoptic segmentation with text-to-image diffusion models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …
objects from an open set of categories in diverse environments. One way to address this …
Groupvit: Semantic segmentation emerges from text supervision
Grouping and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …
object detection and semantic segmentation. With end-to-end deep learning systems …
Scaling open-vocabulary image segmentation with image-level labels
We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
Making the most of text semantics to improve biomedical vision–language processing
Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …
Contrastive learning of medical visual representations from paired images and text
Learning visual representations of medical images (eg, X-rays) is core to medical image
understanding but its progress has been held back by the scarcity of human annotations …
understanding but its progress has been held back by the scarcity of human annotations …
Open vocabulary semantic segmentation with patch aligned contrastive learning
Abstract We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …
Airbert: In-domain pretraining for vision-and-language navigation
Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in
realistic environments using natural language instructions. Given the scarcity of domain …
realistic environments using natural language instructions. Given the scarcity of domain …