Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Universal instance perception as object discovery and retrieval
All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …
as category names, language expressions, and target annotations, but this complete field …
Gres: Generalized referring expression segmentation
Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …
for the object described by a given language expression. Existing classic RES datasets and …
Referring multi-object tracking
Existing referring understanding tasks tend to involve the detection of a single text-referred
object. In this paper, we propose a new and general referring understanding task, termed …
object. In this paper, we propose a new and general referring understanding task, termed …
Transvg: End-to-end visual grounding with transformers
In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …
grounding, namely TransVG, to address the task of grounding a language query to the …
VLT: Vision-language transformer and query generation for referring segmentation
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …
facilitate deep interactions among multi-modal information and enhance the holistic …
Seqtr: A simple yet universal network for visual grounding
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …
Polyformer: Referring image segmentation as sequential polygon generation
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of
referring image segmentation is formulated as sequential polygon generation, and the …
referring image segmentation is formulated as sequential polygon generation, and the …
Improving visual grounding with visual-linguistic verification and iterative reasoning
Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …
Existing methods extend the generic object detection framework to this problem. They base …
Tubedetr: Spatio-temporal video grounding with transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …
given text query. This is a challenging task that requires the joint and efficient modeling of …