Distilling vision-language pre-training to collaborate with weakly-supervised temporal action...

C Ma, Y Yuhuan, C Ju, F Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc

Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …

被引用次数：9 相关文章所有 4 个版本

[PDF] thecvf.com

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com

Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

被引用次数：2 相关文章所有 4 个版本

[PDF] ieee.org

Temporal action localization in the deep learning era: A survey

B Wang, Y Zhao, L Yang, T Long… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

The temporal action localization research aims to discover action instances from untrimmed
videos, representing a fundamental step in the field of intelligent video understanding. With …

被引用次数：6 相关文章所有 6 个版本

[PDF] arxiv.org

Multi-modal prompting for low-shot temporal action localization

C Ju, Z Li, P Zhao, Y Zhang, X Zhang, Q Tian… - arXiv preprint arXiv …, 2023 - arxiv.org

In this paper, we consider the problem of temporal action localization under low-shot (zero-
shot & few-shot) scenario, with the goal of detecting and classifying the action instances from …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Multi-modal prototypes for open-set semantic segmentation

Y Yang, C Ma, C Ju, Y Zhang, Y Wang - arXiv preprint arXiv:2307.02003, 2023 - arxiv.org

In semantic segmentation, adapting a visual system to novel object categories at inference
time has always been both valuable and challenging. To enable such generalization …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Constraint and union for partially-supervised temporal sentence grounding

C Ju, H Wang, J Liu, C Ma, Y Zhang, P Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org

Temporal sentence grounding aims to detect the event timestamps described by the natural
language query from given untrimmed videos. The existing fully-supervised setting achieves …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Turbo: Informativity-driven acceleration plug-in for vision-language models

C Ju, H Wang, Z Li, X Chen, Z Zhai, W Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the
impressive performance. However, their expensive computation costs, ie, throughput and …

被引用次数：3 相关文章所有 3 个版本

[PDF] aaai.org

X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer

L Jing, Y Xue, X Yan, C Zheng, D Wang… - Proceedings of the …, 2024 - ojs.aaai.org

The field of 4D point cloud understanding is rapidly developing with the goal of analyzing
dynamic 3D point cloud sequences. However, it remains a challenging task due to the …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

C Ju, H Wang, H Cheng, X Chen, Z Zhai… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the
impressive performance. However, their expensive computation costs, ie, throughput and …

AttrSeg: open-vocabulary semantic segmentation via attribute decomposition-aggregation

C Ma, Y Yang, C Ju, F Zhang, Y Zhang… - … -seventh Conference on …, 2023 - openreview.net

Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …

被引用次数：3 相关文章所有 2 个版本