E3m: Zero-shot spatio-temporal video grounding with expectation-maximization multimodal modulation

P Bao, Z Shao, W Yang, BP Ng, AC Kot - European Conference on …, 2025 - Springer
Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video
according to the given language query. To eliminate the annotation costs, we make a first …

Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking

X Hou, J Xing, Y Qian, Y Guo, S Xin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multimodal Visual Object Tracking (VOT) has recently gained significant attention
due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which …

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

C Cao, Y Zhang, Y Yu, Q Lv, L Min… - Proceedings of the 32nd …, 2024 - dl.acm.org
Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and
design sophisticated temporal alignment modules at feature level. However, simply fully fine …

Multimodal prototype-enhanced network for few-shot action recognition

X Ni, Y Liu, H Wen, Y Ji, J Xiao, Y Yang - Proceedings of the 2024 …, 2024 - dl.acm.org
Current methods for few-shot action recognition mainly fall into the metric learning
framework following ProtoNet, which demonstrates the importance of prototypes. Although …

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

B Li, M Liu, G Wang, Y Yu - arXiv preprint arXiv:2408.12475, 2024 - arxiv.org
In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot
action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre …

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

C Wu, XJ Wu, L Li, T Xu, Z Feng, J Kittler - European Conference on …, 2025 - Springer
The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly
refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the …

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

M Wang, J Xing, B Jiang, J Chen, J Mei, X Zuo… - Proceedings of the …, 2024 - ojs.aaai.org
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …

Temporal Causal Mechanism Transfer for Few-shot Action Recognition

Y Li, G Chen, B Abramowitz, S Anzellotti, D Wei - openreview.net
The goal of few-shot action recognition is to recognize actions in video sequences for which
there exists only a few training samples. The challenge is to adapt a base model effectively …