Weakly supervised temporal sentence grounding with uncertainty-guided self-training
The task of weakly supervised temporal sentence grounding aims at finding the
corresponding temporal moments of a language description in the video, given video …
corresponding temporal moments of a language description in the video, given video …
Training-free video temporal grounding using large-scale pre-trained models
Video temporal grounding aims to identify video segments within untrimmed videos that are
most relevant to a given natural language query. Existing video temporal localization models …
most relevant to a given natural language query. Existing video temporal localization models …
Compositional Substitutivity of Visual Reasoning for Visual Question Answering
Compositional generalization has received much attention in vision-and-language and
visual reasoning recently. Substitutivity, the capability to generalize to novel compositions …
visual reasoning recently. Substitutivity, the capability to generalize to novel compositions …
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding
Temporal grounding, also known as video moment retrieval, aims at locating video
segments corresponding to a given query sentence. The compositional nature of natural …
segments corresponding to a given query sentence. The compositional nature of natural …
Proposal-based Temporal Action Localization with Point-level Supervision
Point-level supervised temporal action localization (PTAL) aims at recognizing and
localizing actions in untrimmed videos where only a single point (frame) within every action …
localizing actions in untrimmed videos where only a single point (frame) within every action …
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-
language model. Designed for deployment on portable devices such as smartphones and …
language model. Designed for deployment on portable devices such as smartphones and …
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an
untrimmed long video given a natural language query. Existing methods often suffer from …
untrimmed long video given a natural language query. Existing methods often suffer from …
PTAN: Principal Token-aware Adjacent Network for Compositional Temporal Grounding
Compositional temporal grounding (CTG) aims to localize the most relevant segment from
an untrimmed video based on a given natural language sentence, and the test samples for …
an untrimmed video based on a given natural language sentence, and the test samples for …
Localizing Events in Videos with Multimodal Queries
Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …
nature of videos makes them labor-intensive and computationally demanding to process …
Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild
P Bao, C Kong, Z Shao, BP Ng, MH Er… - arXiv preprint arXiv …, 2024 - arxiv.org
Given a natural language query, video moment retrieval aims to localize the described
temporal moment in an untrimmed video. A major challenge of this task is its heavy …
temporal moment in an untrimmed video. A major challenge of this task is its heavy …