Object detection in 20 years: A survey
Object detection, as of one the most fundamental and challenging problems in computer
vision, has received great attention in recent years. Over the past two decades, we have …
vision, has received great attention in recent years. Over the past two decades, we have …
Maple: Multi-modal prompt learning
Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Self-regulating prompts: Foundational model adaptation without forgetting
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models,
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
Detecting everything in the open world: Towards universal object detection
In this paper, we formally address universal object detection, which aims to detect every
scene and predict every category. The dependence on human annotations, the limited …
scene and predict every category. The dependence on human annotations, the limited …
RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model
Leveraging the extensive training data from SA-1B, the segment anything model (SAM)
demonstrates remarkable generalization and zero-shot capabilities. However, as a category …
demonstrates remarkable generalization and zero-shot capabilities. However, as a category …
Region-aware pretraining for open-vocabulary object detection with vision transformers
Abstract We present Region-aware Open-vocabulary Vision Transformers (RO-ViT)--a
contrastive image-text pretraining recipe to bridge the gap between image-level pretraining …
contrastive image-text pretraining recipe to bridge the gap between image-level pretraining …
Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection
Deriving reliable region-word alignment from image-text pairs is critical to learnobject-level
vision-language representations for open-vocabulary object detection. Existing methods …
vision-language representations for open-vocabulary object detection. Existing methods …
Towards open vocabulary learning: A survey
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …
advancements in various core tasks like segmentation, tracking, and detection. However …