A survey on video moment localization
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …
segment within a video described by a given natural language query. Beyond the task of …
Ego4d: Around the world in 3,000 hours of egocentric video
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
Univtg: Towards unified video-language temporal grounding
Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
Egocentric video-language pretraining
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …
to advance a wide range of video-text downstream tasks, has recently received increasing …
Momentdiff: Generative video moment retrieval from random to real
Video moment retrieval pursues an efficient and generalized solution to identify the specific
temporal segments within an untrimmed video that correspond to a given language …
temporal segments within an untrimmed video that correspond to a given language …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Detecting moments and highlights in videos via natural language queries
Detecting customized moments and highlights from videos given natural language (NL) user
queries is an important but under-studied topic. One of the challenges in pursuing this …
queries is an important but under-studied topic. One of the challenges in pursuing this …
Query-dependent video representation for moment retrieval and highlight detection
Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as
the demand for video understanding is drastically increased. The key objective of MR/HD is …
the demand for video understanding is drastically increased. The key objective of MR/HD is …
Relaxed transformer decoders for direct action proposal generation
Temporal action proposal generation is an important and challenging task in video
understanding, which aims at detecting all temporal segments containing action instances of …
understanding, which aims at detecting all temporal segments containing action instances of …
Unloc: A unified framework for video localization tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …