A real-time cross-modality correlation filtering method for referring expression comprehension

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：186 相关文章所有 7 个版本

[PDF] thecvf.com

Universal instance perception as object discovery and retrieval

B Yan, Y Jiang, J Wu, D Wang, P Luo… - Proceedings of the …, 2023 - openaccess.thecvf.com

All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …

被引用次数：150 相关文章所有 5 个版本

[PDF] thecvf.com

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

被引用次数：140 相关文章所有 6 个版本

[PDF] thecvf.com

Referring multi-object tracking

D Wu, W Han, T Wang, X Dong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Existing referring understanding tasks tend to involve the detection of a single text-referred
object. In this paper, we propose a new and general referring understanding task, termed …

被引用次数：66 相关文章所有 5 个版本

[PDF] thecvf.com

Transvg: End-to-end visual grounding with transformers

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …

被引用次数：357 相关文章所有 6 个版本

[PDF] arxiv.org

VLT: Vision-language transformer and query generation for referring segmentation

H Ding, C Liu, S Wang, X Jiang - IEEE Transactions on Pattern …, 2022 - ieeexplore.ieee.org

We propose a Vision-Language Transformer (VLT) framework for referring segmentation to
facilitate deep interactions among multi-modal information and enhance the holistic …

被引用次数：120 相关文章所有 7 个版本

[PDF] arxiv.org

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022 - Springer

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …

被引用次数：146 相关文章所有 5 个版本

[PDF] thecvf.com

Polyformer: Referring image segmentation as sequential polygon generation

J Liu, H Ding, Z Cai, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of
referring image segmentation is formulated as sequential polygon generation, and the …

被引用次数：116 相关文章所有 9 个版本

[PDF] thecvf.com

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com

Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

被引用次数：123 相关文章所有 7 个版本

[PDF] thecvf.com

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

被引用次数：104 相关文章所有 10 个版本