Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Open-vocabulary semantic segmentation via attribute decomposition-aggregation

C Ma, Y Yuhuan, C Ju, F Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …

Divide and conquer for single-frame temporal action localization

C Ju, P Zhao, S Chen, Y Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Single-frame temporal action localization (STAL) aims to localize actions in untrimmed
videos with only one timestamp annotation for each action instance. Existing methods adopt …

Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization

C Ju, K Zheng, J Liu, P Zhao, Y Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Weakly-supervised temporal action localization (WTAL) learns to detect and classify action
instances with only category labels. Most methods widely adopt the off-the-shelf …

DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization

X Tang, J Fan, C Luo, Z Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task.
Due to large-scale datasets, most existing methods use a network pretrained in other …

Audio-Visual Segmentation via Unlabeled Frame Exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

Temporal action localization in the deep learning era: A survey

B Wang, Y Zhao, L Yang, T Long… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The temporal action localization research aims to discover action instances from untrimmed
videos, representing a fundamental step in the field of intelligent video understanding. With …

Multi-modal prompting for low-shot temporal action localization

C Ju, Z Li, P Zhao, Y Zhang, X Zhang, Q Tian… - arXiv preprint arXiv …, 2023 - arxiv.org
In this paper, we consider the problem of temporal action localization under low-shot (zero-
shot & few-shot) scenario, with the goal of detecting and classifying the action instances from …

Multi-modal prototypes for open-set semantic segmentation

Y Yang, C Ma, C Ju, Y Zhang, Y Wang - arXiv preprint arXiv:2307.02003, 2023 - arxiv.org
In semantic segmentation, adapting a visual system to novel object categories at inference
time has always been both valuable and challenging. To enable such generalization …

Constraint and union for partially-supervised temporal sentence grounding

C Ju, H Wang, J Liu, C Ma, Y Zhang, P Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Temporal sentence grounding aims to detect the event timestamps described by the natural
language query from given untrimmed videos. The existing fully-supervised setting achieves …