Gesture recognition in robotic surgery: a review

B van Amsterdam, MJ Clarkson… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Objective: Surgical activity recognition is a fundamental step in computer-assisted
interventions. This paper reviews the state-of-the-art in methods for automatic recognition of …

Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models

W Wu, X Wang, H Luo, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability on various visual tasks. Transferring knowledge …

Revisiting classifier: Transferring vision-language models for video recognition

W Wu, Z Sun, W Ouyang - Proceedings of the AAAI conference on …, 2023 - ojs.aaai.org
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is
an important topic in computer vision research. Along with the growth of computational …

Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization

Z Weng, X Yang, A Li, Z Wu… - … Conference on Machine …, 2023 - proceedings.mlr.press
Abstract Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive zero-
shot learning abilities for image understanding, yet limited effort has been made to …

Cross-modal representation learning for zero-shot action recognition

CC Lin, K Lin, L Wang, Z Liu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
We present a cross-modal Transformer-based framework, which jointly encodes video data
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …

Transferring vision-language models for visual recognition: A classifier perspective

W Wu, Z Sun, Y Song, J Wang, W Ouyang - International Journal of …, 2024 - Springer
Transferring knowledge from pre-trained deep models for downstream tasks, particularly
with limited labeled samples, is a fundamental problem in computer vision research. Recent …

Building an open-vocabulary video CLIP model with better architectures, optimization and data

Z Wu, Z Weng, W Peng, X Yang, A Li… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in
zero-shot image recognition, limited effort has been made exploring its potential for zero …

Multimodal open-vocabulary video classification via pre-trained vision and language models

R Qian, Y Li, Z Xu, MH Yang, S Belongie… - arXiv preprint arXiv …, 2022 - arxiv.org
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is
becoming a promising paradigm for open-vocabulary visual recognition. In this work, we …

Zero-shot action recognition with transformer-based video semantic embedding

K Doshi, Y Yilmaz - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
While video action recognition has been an active area of research for several years, zero-
shot action recognition has only recently started gaining traction. In this work, we propose a …

Alignment-uniformity aware representation learning for zero-shot video classification

S Pu, K Zhao, M Zheng - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com
Most methods tackle zero-shot video classification by aligning visual-semantic
representations within seen classes, which limits generalization to unseen classes. To …