Multimodal adaptation of clip for few-shot action recognition

P Bao, Z Shao, W Yang, BP Ng, AC Kot - European Conference on …, 2025 - Springer

Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video
according to the given language query. To eliminate the annotation costs, we make a first …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking

X Hou, J Xing, Y Qian, Y Guo, S Xin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multimodal Visual Object Tracking (VOT) has recently gained significant attention
due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which …

被引用次数：23 相关文章所有 4 个版本

[PDF] arxiv.org

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

C Cao, Y Zhang, Y Yu, Q Lv, L Min… - Proceedings of the 32nd …, 2024 - dl.acm.org

Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and
design sophisticated temporal alignment modules at feature level. However, simply fully fine …

被引用次数：2 相关文章所有 4 个版本

[PDF] acm.org

Multimodal prototype-enhanced network for few-shot action recognition

X Ni, Y Liu, H Wen, Y Ji, J Xiao, Y Yang - Proceedings of the 2024 …, 2024 - dl.acm.org

Current methods for few-shot action recognition mainly fall into the metric learning
framework following ProtoNet, which demonstrates the importance of prototypes. Although …

被引用次数：14 相关文章所有 4 个版本

[PDF] arxiv.org

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

B Li, M Liu, G Wang, Y Yu - arXiv preprint arXiv:2408.12475, 2024 - arxiv.org

In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot
action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

被引用次数：6 相关文章所有 2 个版本

[PDF] ecva.net

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

C Wu, XJ Wu, L Li, T Xu, Z Feng, J Kittler - European Conference on …, 2025 - Springer

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly
refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the …

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - Proceedings of the …, 2024 - ojs.aaai.org

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

被引用次数：6 相关文章

[PDF] openreview.net

Temporal Causal Mechanism Transfer for Few-shot Action Recognition

Y Li, G Chen, B Abramowitz, S Anzellotti, D Wei - openreview.net

The goal of few-shot action recognition is to recognize actions in video sequences for which
there exists only a few training samples. The challenge is to adapt a base model effectively …