Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - arXiv preprint arXiv …, 2023 - arxiv.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Segment and caption anything

X Huang, J Wang, Y Tang, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability
to generate regional captions. SAM presents strong generalizability to segment anything …

Sclip: Rethinking self-attention for dense vision-language inference

F Wang, J Mei, A Yuille - arXiv preprint arXiv:2312.01597, 2023 - arxiv.org
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated
strong capabilities in zero-shot classification by aligning visual representations with target …

Panoptic vision-language feature fields

H Chen, K Blomqvist, F Milano… - IEEE Robotics and …, 2024 - ieeexplore.ieee.org
Recently, methods have been proposed for 3D open-vocabulary semantic segmentation.
Such methods are able to segment scenes into arbitrary classes based on text descriptions …

TMCFN: Text-supervised multidimensional contrastive fusion network for hyperspectral and LiDAR classification

Y Yang, J Qu, W Dong, T Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The joint classification of hyperspectral images (HSIs) and LiDAR data plays a crucial role in
Earth observation missions. Most advanced methods are based on discrete label …

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

JJ Wu, ACH Chang, CY Chuang… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper addresses text-supervised semantic segmentation aiming to learn a model
capable of segmenting arbitrary visual concepts within images by using only image-text …

Self-guided open-vocabulary semantic segmentation

O Ülger, M Kulicki, Y Asano, MR Oswald - arXiv preprint arXiv:2312.04539, 2023 - arxiv.org
Vision-Language Models (VLMs) have emerged as promising tools for open-ended image
understanding tasks, including open vocabulary segmentation. Yet, direct application of …

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

Y Wang, R Sun, N Luo, Y Pan… - Proceedings of the …, 2024 - openaccess.thecvf.com
Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary
categories specified by class labels or captions. However most previous best-performing …

Tagalign: Improving vision-language alignment with multi-tag classification

Q Liu, K Zheng, W Wei, Z Tong, Y Liu, W Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
The crux of learning vision-language models is to extract semantically aligned information
from visual and linguistic data. Existing attempts usually face the problem of coarse …

Multi-modal recursive prompt learning with mixup embedding for generalization recognition

Y Jia, X Ye, Y Liu, S Guo - Knowledge-Based Systems, 2024 - Elsevier
The contrastive language-image pretraining (CLIP) model has shown promise in
generalization recognition by combining visual and textual embeddings. However, the …