Seqtr: A simple yet universal network for visual grounding

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：167 相关文章所有 26 个版本

[PDF] openreview.net

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

被引用次数：345 相关文章所有 3 个版本

[PDF] thecvf.com

Universal instance perception as object discovery and retrieval

B Yan, Y Jiang, J Wu, D Wang, P Luo… - Proceedings of the …, 2023 - openaccess.thecvf.com

All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …

被引用次数：127 相关文章所有 5 个版本

[PDF] arxiv.org

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

被引用次数：205 相关文章所有 3 个版本

[PDF] arxiv.org

Grounded sam: Assembling open-world models for diverse visual tasks

T Ren, S Liu, A Zeng, J Lin, K Li, H Cao, J Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …

被引用次数：108 相关文章所有 2 个版本

[PDF] thecvf.com

Polyformer: Referring image segmentation as sequential polygon generation

J Liu, H Ding, Z Cai, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of
referring image segmentation is formulated as sequential polygon generation, and the …

被引用次数：90 相关文章所有 9 个版本

[PDF] thecvf.com

Multi3drefer: Grounding text description to multiple 3d objects

Y Zhang, ZM Gong, AX Chang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …

被引用次数：34 相关文章所有 7 个版本

[PDF] arxiv.org

Contextual object detection with multimodal large language models

Y Zang, W Li, J Han, K Zhou, CC Loy - International Journal of Computer …, 2024 - Springer

Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-
language tasks, such as image captioning and question answering, but lack the essential …

被引用次数：44 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on open-vocabulary detection and segmentation: Past, present, and future

C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024 - ieeexplore.ieee.org

As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …

被引用次数：13 相关文章所有 7 个版本

[PDF] thecvf.com

Florence-2: Advancing a unified representation for a variety of vision tasks

B Xiao, H Wu, W Xu, X Dai, H Hu, Y Lu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce Florence-2 a novel vision foundation model with a unified prompt-based
representation for various computer vision and vision-language tasks. While existing large …

被引用次数：25 相关文章所有 3 个版本