Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Knowing where to focus: Event-aware transformer for video grounding
Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target
moment semantically according to a sentence query. Although previous respectable works …
moment semantically according to a sentence query. Although previous respectable works …
Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training
The correlation between the vision and text is essential for video moment retrieval (VMR),
however, existing methods heavily rely on separate pre-training feature extractors for visual …
however, existing methods heavily rely on separate pre-training feature extractors for visual …
Dual learning with dynamic knowledge distillation for partially relevant video retrieval
J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …
short durations. However, in practice, videos are generally untrimmed containing much …
Skimming, locating, then perusing: A human-like framework for natural language video localization
This paper addresses the problem of natural language video localization (NLVL). Almost all
existing works follow the" only look once" framework that exploits a single model to directly …
existing works follow the" only look once" framework that exploits a single model to directly …
Partial annotation-based video moment retrieval via iterative learning
Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the
corresponding semantic-consistent moment clip in the video, which is represented as a pair …
corresponding semantic-consistent moment clip in the video, which is represented as a pair …
Hypotheses tree building for one-shot temporal sentence localization
Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific
segment according to a given sentence query. Though respectable works have made …
segment according to a given sentence query. Though respectable works have made …
Few-shot temporal sentence grounding via memory-guided semantic learning
Temporal sentence grounding (TSG) is an important yet challenging task in video-based
information retrieval. Given an untrimmed video input, it requires the machine to predict the …
information retrieval. Given an untrimmed video input, it requires the machine to predict the …
Unsupervised domain adaptation for video object grounding with cascaded debiasing learning
This paper addresses the Unsupervised Domain Adaptation (UDA) for the dense frame
prediction task-Video Object Grounding (VOG). This investigation springs from the …
prediction task-Video Object Grounding (VOG). This investigation springs from the …