Open-vocabulary semantic segmentation via attribute decomposition-aggregation
Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …
novel object categories at inference time. Recent works explore vision-language pre-training …
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …
Although great progress has been witnessed we experimentally reveal that current methods …
Temporal action localization in the deep learning era: A survey
B Wang, Y Zhao, L Yang, T Long… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The temporal action localization research aims to discover action instances from untrimmed
videos, representing a fundamental step in the field of intelligent video understanding. With …
videos, representing a fundamental step in the field of intelligent video understanding. With …
Multi-modal prompting for low-shot temporal action localization
In this paper, we consider the problem of temporal action localization under low-shot (zero-
shot & few-shot) scenario, with the goal of detecting and classifying the action instances from …
shot & few-shot) scenario, with the goal of detecting and classifying the action instances from …
Multi-modal prototypes for open-set semantic segmentation
In semantic segmentation, adapting a visual system to novel object categories at inference
time has always been both valuable and challenging. To enable such generalization …
time has always been both valuable and challenging. To enable such generalization …
Constraint and union for partially-supervised temporal sentence grounding
Temporal sentence grounding aims to detect the event timestamps described by the natural
language query from given untrimmed videos. The existing fully-supervised setting achieves …
language query from given untrimmed videos. The existing fully-supervised setting achieves …
Turbo: Informativity-driven acceleration plug-in for vision-language models
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the
impressive performance. However, their expensive computation costs, ie, throughput and …
impressive performance. However, their expensive computation costs, ie, throughput and …
X4d-sceneformer: Enhanced scene understanding on 4d point cloud videos through cross-modal knowledge transfer
The field of 4D point cloud understanding is rapidly developing with the goal of analyzing
dynamic 3D point cloud sequences. However, it remains a challenging task due to the …
dynamic 3D point cloud sequences. However, it remains a challenging task due to the …
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
C Ju, H Wang, H Cheng, X Chen, Z Zhai… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the
impressive performance. However, their expensive computation costs, ie, throughput and …
impressive performance. However, their expensive computation costs, ie, throughput and …
AttrSeg: open-vocabulary semantic segmentation via attribute decomposition-aggregation
C Ma, Y Yang, C Ju, F Zhang, Y Zhang… - … -seventh Conference on …, 2023 - openreview.net
Open-vocabulary semantic segmentation is a challenging task that requires segmenting
novel object categories at inference time. Recent works explore vision-language pre-training …
novel object categories at inference time. Recent works explore vision-language pre-training …