Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Segment everything everywhere all at once

X Zou, J Yang, H Zhang, F Li, L Li… - Advances in …, 2024 - proceedings.neurips.cc
In this work, we present SEEM, a promotable and interactive model for segmenting
everything everywhere all at once in an image. In SEEM, we propose a novel and versatile …

Segment and Recognize Anything at Any Granularity

F Li, H Zhang, P Sun, X Zou, S Liu, C Li, J Yang… - … on Computer Vision, 2025 - Springer
In this work, we introduce Semantic-SAM, an augmented image segmentation foundation for
segmenting and recognizing anything at desired granularities. Compared to the …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Llava-plus: Learning to use tools for creating multimodal agents

S Liu, H Cheng, H Liu, H Zhang, F Li, T Ren… - … on Computer Vision, 2025 - Springer
Abstract This paper presents LLaVA-Plus (Large Language and Vision Assistants that Plug
and Learn to Use Skills), a general-purpose multimodal assistant trained using an end-to …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

OMG-Seg: Is one model good enough for all segmentation?

X Li, H Yuan, W Li, H Ding, S Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this work we address various segmentation tasks each traditionally tackled by distinct or
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …

Llava-grounding: Grounded visual chat with large multimodal models

H Zhang, H Li, F Li, T Ren, X Zou, S Liu… - … on Computer Vision, 2025 - Springer
With the recent significant advancements in large multimodal models (LMMs), the
importance of their grounding capability in visual chat is increasingly recognized. Despite …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

Grounded sam: Assembling open-world models for diverse visual tasks

T Ren, S Liu, A Zeng, J Lin, K Li, H Cao, J Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …