Deco: Decomposition and reconstruction for compositional temporal grounding via coarse-to-fine...

Y Huang, L Yang, Y Sato - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

The task of weakly supervised temporal sentence grounding aims at finding the
corresponding temporal moments of a language description in the video, given video …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Training-free video temporal grounding using large-scale pre-trained models

M Zheng, X Cai, Q Chen, Y Peng, Y Liu - European Conference on …, 2025 - Springer

Video temporal grounding aims to identify video segments within untrimmed videos that are
most relevant to a given natural language query. Existing video temporal localization models …

被引用次数：4 相关文章所有 7 个版本

[PDF] ecva.net

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

C Li, Z Li, C Jing, Y Wu, M Zhai, Y Jia - European Conference on Computer …, 2025 - Springer

Compositional generalization has received much attention in vision-and-language and
visual reasoning recently. Substitutivity, the capability to generalize to novel compositions …

被引用次数：2 相关文章所有 4 个版本

[PDF] arxiv.org

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Z Cheng, Y Pu, S Gong, P Kordjamshidi… - European Conference on …, 2025 - Springer

Temporal grounding, also known as video moment retrieval, aims at locating video
segments corresponding to a given query sentence. The compositional nature of natural …

Proposal-based Temporal Action Localization with Point-level Supervision

Y Yin, Y Huang, R Furuta, Y Sato - arXiv preprint arXiv:2310.05511, 2023 - arxiv.org

Point-level supervised temporal action localization (PTAL) aims at recognizing and
localizing actions in untrimmed videos where only a single point (frame) within every action …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

Y Huang, J Xu, B Pei, Y He, G Chen, L Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-
language model. Designed for deployment on portable devices such as smartphones and …

相关文章所有 2 个版本

[PDF] arxiv.org

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

W Cai, J Huang, S Gong, H Jin, Y Liu - arXiv preprint arXiv:2406.17880, 2024 - arxiv.org

Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an
untrimmed long video given a natural language query. Existing methods often suffer from …

相关文章所有 2 个版本

PTAN: Principal Token-aware Adjacent Network for Compositional Temporal Grounding

Z Wei, X Jiang, Z Wang, F Shen, X Xu - Proceedings of the 2024 …, 2024 - dl.acm.org

Compositional temporal grounding (CTG) aims to localize the most relevant segment from
an untrimmed video based on a given natural language sentence, and the test samples for …

相关文章所有 2 个版本

[PDF] arxiv.org

Localizing Events in Videos with Multimodal Queries

G Zhang, MLA Fok, Y Xia, Y Tang, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …

被引用次数：1 相关文章

[PDF] arxiv.org

Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the Wild

P Bao, C Kong, Z Shao, BP Ng, MH Er… - arXiv preprint arXiv …, 2024 - arxiv.org

Given a natural language query, video moment retrieval aims to localize the described
temporal moment in an untrimmed video. A major challenge of this task is its heavy …

相关文章所有 2 个版本