Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Unified-io: A unified model for vision, language, and multi-modal tasks
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …
computer vision tasks, including pose estimation, object detection, depth estimation and …
Universal instance perception as object discovery and retrieval
All instance perception tasks aim at finding certain objects specified by some queries such
as category names, language expressions, and target annotations, but this complete field …
as category names, language expressions, and target annotations, but this complete field …
X-clip: End-to-end multi-grained contrastive learning for video-text retrieval
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …
development of video-text retrieval has been considerably promoted by large-scale multi …
Grounded sam: Assembling open-world models for diverse visual tasks
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …
combine with the segment anything model (SAM). This integration enables the detection and …
Polyformer: Referring image segmentation as sequential polygon generation
In this work, instead of directly predicting the pixel-level segmentation masks, the problem of
referring image segmentation is formulated as sequential polygon generation, and the …
referring image segmentation is formulated as sequential polygon generation, and the …
Multi3drefer: Grounding text description to multiple 3d objects
We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …
Contextual object detection with multimodal large language models
Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-
language tasks, such as image captioning and question answering, but lack the essential …
language tasks, such as image captioning and question answering, but lack the essential …
A survey on open-vocabulary detection and segmentation: Past, present, and future
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …
have made tremendous progress in deep learning era. Due to the expensive manual …
Florence-2: Advancing a unified representation for a variety of vision tasks
We introduce Florence-2 a novel vision foundation model with a unified prompt-based
representation for various computer vision and vision-language tasks. While existing large …
representation for various computer vision and vision-language tasks. While existing large …