Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding

M Li, H Wang, W Zhang, J Miao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a
language query. Existing techniques achieve such alignment by exploiting dense boundary …

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Z Zhang, Z Zhao, Y Zhao, Q Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com
In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form
Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence …

End-to-end modeling via information tree for one-shot natural language spatial video grounding

M Li, T Wang, H Zhang, S Zhang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
Natural language spatial video grounding aims to detect the relevant objects in video frames
with descriptive sentences as the query. In spite of the great advances, most existing …

Weakly-supervised video object grounding by exploring spatio-temporal contexts

X Yang, X Liu, M Jian, X Gao, M Wang - Proceedings of the 28th ACM …, 2020 - dl.acm.org
Grounding objects in visual context from natural language queries is a crucial yet
challenging vision-and-language task, which has gained increasing attention in recent …

Visual relation grounding in videos

J Xiao, X Shang, X Yang, S Tang, TS Chua - Computer Vision–ECCV …, 2020 - Springer
In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV).
The task aims at spatio-temporally localizing the given relations in the form of subject …

Look at what i'm doing: Self-supervised spatial grounding of narrations in instructional videos

R Tan, B Plummer, K Saenko, H Jin… - Advances in Neural …, 2021 - proceedings.neurips.cc
We introduce the task of spatially localizing narrated interactions in videos. Key to our
approach is the ability to learn to spatially localize interactions with self-supervision on a …

Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images

H Liu, A Lin, X Han, L Yang, Y Yu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Grounding referring expressions in RGBD image has been an emerging field. We present a
novel task of 3D visual grounding in single-view RGBD image where the referred objects are …

Correspondence matters for video referring expression comprehension

M Cao, J Jiang, L Chen, Y Zou - Proceedings of the 30th ACM …, 2022 - dl.acm.org
We investigate the problem of video Referring Expression Comprehension (REC), which
aims to localize the referent objects described in the sentence to visual regions in the video …