Weakly supervised temporal sentence grounding with uncertainty-guided self-training

Y Huang, L Yang, Y Sato - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
The task of weakly supervised temporal sentence grounding aims at finding the
corresponding temporal moments of a language description in the video, given video …

Training-free video temporal grounding using large-scale pre-trained models

M Zheng, X Cai, Q Chen, Y Peng, Y Liu - European Conference on …, 2025 - Springer
Video temporal grounding aims to identify video segments within untrimmed videos that are
most relevant to a given natural language query. Existing video temporal localization models …

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

C Li, Z Li, C Jing, Y Wu, M Zhai, Y Jia - European Conference on Computer …, 2025 - Springer
Compositional generalization has received much attention in vision-and-language and
visual reasoning recently. Substitutivity, the capability to generalize to novel compositions …

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Z Cheng, Y Pu, S Gong, P Kordjamshidi… - European Conference on …, 2025 - Springer
Temporal grounding, also known as video moment retrieval, aims at locating video
segments corresponding to a given query sentence. The compositional nature of natural …

Proposal-based Temporal Action Localization with Point-level Supervision

Y Yin, Y Huang, R Furuta, Y Sato - arXiv preprint arXiv:2310.05511, 2023 - arxiv.org
Point-level supervised temporal action localization (PTAL) aims at recognizing and
localizing actions in untrimmed videos where only a single point (frame) within every action …

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Y Huang, J Xu, B Pei, Y He, G Chen, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-
language model. Designed for deployment on portable devices such as smartphones and …

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

W Cai, J Huang, S Gong, H Jin, Y Liu - arXiv preprint arXiv:2406.17880, 2024 - arxiv.org
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an
untrimmed long video given a natural language query. Existing methods often suffer from …

PTAN: Principal Token-aware Adjacent Network for Compositional Temporal Grounding

Z Wei, X Jiang, Z Wang, F Shen, X Xu - Proceedings of the 2024 …, 2024 - dl.acm.org
Compositional temporal grounding (CTG) aims to localize the most relevant segment from
an untrimmed video based on a given natural language sentence, and the test samples for …

Localizing Events in Videos with Multimodal Queries

G Zhang, MLA Fok, Y Xia, Y Tang, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org
Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

P Bao, C Kong, Z Shao, BP Ng, MH Er… - arXiv preprint arXiv …, 2024 - arxiv.org
Given a natural language query, video moment retrieval aims to localize the described
temporal moment in an untrimmed video. A major challenge of this task is its heavy …