Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Universal instance perception as object discovery and retrieval

B Yan, Y Jiang, J Wu, D Wang, P Luo… - Proceedings of the …, 2023 - openaccess.thecvf.com
All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

Referring multi-object tracking

D Wu, W Han, T Wang, X Dong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Existing referring understanding tasks tend to involve the detection of a single text-referred
object. In this paper, we propose a new and general referring understanding task, termed …

Transvg: End-to-end visual grounding with transformers

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org
We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022 - Springer
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …

Polyformer: Referring image segmentation as sequential polygon generation

J Liu, H Ding, Z Cai, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of
referring image segmentation is formulated as sequential polygon generation, and the …

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …