Learning to prompt for vision-language models

Y Song, T Wang, P Cai, SK Mondal… - ACM Computing Surveys, 2023 - dl.acm.org

Few-shot learning (FSL) has emerged as an effective learning method and shows great
potential. Despite the recent creative works in tackling FSL tasks, learning valid information …

被引用次数：190 相关文章所有 3 个版本

[HTML] sciencedirect.com

[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier

Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

被引用次数：64 相关文章所有 4 个版本

[PDF] arxiv.org

An image is worth one word: Personalizing text-to-image generation using textual inversion

R Gal, Y Alaluf, Y Atzmon, O Patashnik… - arXiv preprint arXiv …, 2022 - arxiv.org

Text-to-image models offer unprecedented freedom to guide creation through natural
language. Yet, it is unclear how such freedom can be exercised to generate images of …

被引用次数：986 相关文章所有 4 个版本

[PDF] arxiv.org

Llama-adapter: Efficient fine-tuning of language models with zero-init attention

R Zhang, J Han, C Liu, P Gao, A Zhou, X Hu… - arXiv preprint arXiv …, 2023 - arxiv.org

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …

被引用次数：429 相关文章所有 3 个版本

[PDF] thecvf.com

Maple: Multi-modal prompt learning

MU Khattak, H Rasheed, M Maaz… - Proceedings of the …, 2023 - openaccess.thecvf.com

Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …

被引用次数：335 相关文章所有 10 个版本

[PDF] arxiv.org

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org

How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

被引用次数：336 相关文章所有 3 个版本

[PDF] thecvf.com

Open-vocabulary semantic segmentation with mask-adapted clip

F Liang, B Wu, X Dai, K Li, Y Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Open-vocabulary semantic segmentation aims to segment an image into semantic regions
according to text descriptions, which may not have been seen during training. Recent two …

被引用次数：263 相关文章所有 9 个版本

[PDF] arxiv.org

Visual prompt tuning

M Jia, L Tang, BC Chen, C Cardie, S Belongie… - … on Computer Vision, 2022 - Springer

The current modus operandi in adapting pre-trained models involves updating all the
backbone parameters, ie., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) …

被引用次数：1050 相关文章所有 7 个版本

[PDF] thecvf.com

Oneformer: One transformer to rule universal image segmentation

J Jain, J Li, MT Chiu, A Hassani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Universal Image Segmentation is not a new concept. Past attempts to unify image
segmentation include scene parsing, panoptic segmentation, and, more recently, new …

被引用次数：197 相关文章所有 8 个版本

[PDF] neurips.cc

Generating images with multimodal language models

JY Koh, D Fried… - Advances in Neural …, 2024 - proceedings.neurips.cc

We propose a method to fuse frozen text-only large language models (LLMs) with pre-
trained image encoder and decoder models, by mapping between their embedding spaces …

被引用次数：118 相关文章所有 8 个版本