Not all frames are equal: Weakly-supervised video grounding with contextual similarity and...

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

被引用次数：35 相关文章所有 8 个版本

[PDF] thecvf.com

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

被引用次数：85 相关文章所有 10 个版本

[PDF] thecvf.com

Winner: Weakly-supervised hierarchical decomposition and alignment for spatio-temporal video grounding

M Li, H Wang, W Zhang, J Miao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a
language query. Existing techniques achieve such alignment by exploiting dense boundary …

被引用次数：23 相关文章所有 3 个版本

[PDF] thecvf.com

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Z Zhang, Z Zhao, Y Zhao, Q Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form
Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence …

被引用次数：107 相关文章所有 6 个版本

[PDF] arxiv.org

End-to-end modeling via information tree for one-shot natural language spatial video grounding

M Li, T Wang, H Zhang, S Zhang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org

Natural language spatial video grounding aims to detect the relevant objects in video frames
with descriptive sentences as the query. In spite of the great advances, most existing …

被引用次数：34 相关文章所有 4 个版本

Weakly-supervised video object grounding by exploring spatio-temporal contexts

X Yang, X Liu, M Jian, X Gao, M Wang - Proceedings of the 28th ACM …, 2020 - dl.acm.org

Grounding objects in visual context from natural language queries is a crucial yet
challenging vision-and-language task, which has gained increasing attention in recent …

被引用次数：54 相关文章

[PDF] arxiv.org

Visual relation grounding in videos

J Xiao, X Shang, X Yang, S Tang, TS Chua - Computer Vision–ECCV …, 2020 - Springer

In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV).
The task aims at spatio-temporally localizing the given relations in the form of subject …

被引用次数：49 相关文章所有 6 个版本

[PDF] neurips.cc

Look at what i'm doing: Self-supervised spatial grounding of narrations in instructional videos

R Tan, B Plummer, K Saenko, H Jin… - Advances in Neural …, 2021 - proceedings.neurips.cc

We introduce the task of spatially localizing narrated interactions in videos. Key to our
approach is the ability to learn to spatially localize interactions with self-supervision on a …

被引用次数：21 相关文章所有 7 个版本

[PDF] thecvf.com

Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images

H Liu, A Lin, X Han, L Yang, Y Yu… - Proceedings of the …, 2021 - openaccess.thecvf.com

Grounding referring expressions in RGBD image has been an emerging field. We present a
novel task of 3D visual grounding in single-view RGBD image where the referred objects are …

被引用次数：35 相关文章所有 8 个版本

[PDF] arxiv.org

Correspondence matters for video referring expression comprehension

M Cao, J Jiang, L Chen, Y Zou - Proceedings of the 30th ACM …, 2022 - dl.acm.org

We investigate the problem of video Referring Expression Comprehension (REC), which
aims to localize the referent objects described in the sentence to visual regions in the video …

被引用次数：16 相关文章所有 4 个版本