Agent attention: On the integration of softmax and linear attention

D Han, T Ye, Y Han, Z Xia, S Pan, P Wan… - … on Computer Vision, 2025 - Springer
The attention module is the key component in Transformers. While the global attention
mechanism offers high expressiveness, its excessive computational cost restricts its …

Psalm: Pixelwise segmentation with large multi-modal model

Z Zhang, Y Ma, E Zhang, X Bai - European Conference on Computer …, 2025 - Springer
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the
segmentation task challenges. To overcome the limitation of the LMM being limited to textual …

The (r) evolution of multimodal large language models: A survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arXiv preprint arXiv …, 2024 - arxiv.org
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu, S Ji… - arXiv preprint arXiv …, 2024 - arxiv.org
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Efficient diffusion transformer with step-wise dynamic attention mediators

Y Pu, Z Xia, J Guo, D Han, Q Li, D Li, Y Yuan… - … on Computer Vision, 2025 - Springer
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …

Multi-object hallucination in vision-language models

X Chen, Z Ma, X Zhang, S Xu, S Qian, J Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …

Evf-sam: Early vision-language fusion for text-prompted segment anything model

Y Zhang, T Cheng, R Hu, L Liu, H Liu, L Ran… - arXiv preprint arXiv …, 2024 - arxiv.org
Segment Anything Model (SAM) has attracted widespread attention for its superior
interactive segmentation capabilities with visual prompts while lacking further exploration of …

3d-gres: Generalized 3d referring expression segmentation

C Wu, Y Liu, J Ji, Y Ma, H Wang, G Luo… - Proceedings of the …, 2024 - dl.acm.org
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

YC Chen, WH Li, C Sun, YCF Wang… - European Conference on …, 2025 - Springer
We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything
Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our …

Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation

S Yu, PH Seo, J Son - European Conference on Computer Vision, 2025 - Springer
We propose a new framework that automatically generates high-quality segmentation masks
with their referring expressions as pseudo supervisions for referring image segmentation …