Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Tubedetr: Spatio-temporal video grounding with transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …
given text query. This is a challenging task that requires the joint and efficient modeling of …
Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding
Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a
language query. Existing techniques achieve such alignment by exploiting dense boundary …
language query. Existing techniques achieve such alignment by exploiting dense boundary …
Where does it exist: Spatio-temporal video grounding for multi-form sentences
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form
Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence …
Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence …
End-to-end modeling via information tree for one-shot natural language spatial video grounding
Natural language spatial video grounding aims to detect the relevant objects in video frames
with descriptive sentences as the query. In spite of the great advances, most existing …
with descriptive sentences as the query. In spite of the great advances, most existing …
Weakly-supervised video object grounding by exploring spatio-temporal contexts
Grounding objects in visual context from natural language queries is a crucial yet
challenging vision-and-language task, which has gained increasing attention in recent …
challenging vision-and-language task, which has gained increasing attention in recent …
Visual relation grounding in videos
In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV).
The task aims at spatio-temporally localizing the given relations in the form of subject …
The task aims at spatio-temporally localizing the given relations in the form of subject …
Look at what i'm doing: Self-supervised spatial grounding of narrations in instructional videos
We introduce the task of spatially localizing narrated interactions in videos. Key to our
approach is the ability to learn to spatially localize interactions with self-supervision on a …
approach is the ability to learn to spatially localize interactions with self-supervision on a …
Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images
Grounding referring expressions in RGBD image has been an emerging field. We present a
novel task of 3D visual grounding in single-view RGBD image where the referred objects are …
novel task of 3D visual grounding in single-view RGBD image where the referred objects are …
Correspondence matters for video referring expression comprehension
We investigate the problem of video Referring Expression Comprehension (REC), which
aims to localize the referent objects described in the sentence to visual regions in the video …
aims to localize the referent objects described in the sentence to visual regions in the video …