Agent attention: On the integration of softmax and linear attention
The attention module is the key component in Transformers. While the global attention
mechanism offers high expressiveness, its excessive computational cost restricts its …
mechanism offers high expressiveness, its excessive computational cost restricts its …
Psalm: Pixelwise segmentation with large multi-modal model
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the
segmentation task challenges. To overcome the limitation of the LMM being limited to textual …
segmentation task challenges. To overcome the limitation of the LMM being limited to textual …
The (r) evolution of multimodal large language models: A survey
Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …
this reason, inspired by the success of large language models, significant research efforts …
Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding
Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …
image and video understanding. However, they lack reasoning abilities and cannot be …
Efficient diffusion transformer with step-wise dynamic attention mediators
This paper identifies significant redundancy in the query-key interactions within self-attention
mechanisms of diffusion transformer models, particularly during the early stages of …
mechanisms of diffusion transformer models, particularly during the early stages of …
Multi-object hallucination in vision-language models
Large vision language models (LVLMs) often suffer from object hallucination, producing
objects not present in the given images. While current benchmarks for object hallucination …
objects not present in the given images. While current benchmarks for object hallucination …
Evf-sam: Early vision-language fusion for text-prompted segment anything model
Segment Anything Model (SAM) has attracted widespread attention for its superior
interactive segmentation capabilities with visual prompts while lacking further exploration of …
interactive segmentation capabilities with visual prompts while lacking further exploration of …
3d-gres: Generalized 3d referring expression segmentation
3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific
instance within a 3D space based on a natural language description. However, current …
instance within a 3D space based on a natural language description. However, current …
SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation
We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything
Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our …
Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our …
Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image Segmentation
We propose a new framework that automatically generates high-quality segmentation masks
with their referring expressions as pseudo supervisions for referring image segmentation …
with their referring expressions as pseudo supervisions for referring image segmentation …