E3m: Zero-shot spatio-temporal video grounding with expectation-maximization multimodal modulation
Spatio-temporal video grounding aims to localize the spatio-temporal tube in a video
according to the given language query. To eliminate the annotation costs, we make a first …
according to the given language query. To eliminate the annotation costs, we make a first …
Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking
Abstract Multimodal Visual Object Tracking (VOT) has recently gained significant attention
due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which …
due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which …
Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition
Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and
design sophisticated temporal alignment modules at feature level. However, simply fully fine …
design sophisticated temporal alignment modules at feature level. However, simply fully fine …
Multimodal prototype-enhanced network for few-shot action recognition
Current methods for few-shot action recognition mainly fall into the metric learning
framework following ProtoNet, which demonstrates the importance of prototypes. Although …
framework following ProtoNet, which demonstrates the importance of prototypes. Although …
Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition
In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot
action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre …
action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre …
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …
Efficient Few-Shot Action Recognition via Multi-level Post-reasoning
The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly
refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the …
refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the …
A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …
the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction …
Temporal Causal Mechanism Transfer for Few-shot Action Recognition
The goal of few-shot action recognition is to recognize actions in video sequences for which
there exists only a few training samples. The challenge is to adapt a base model effectively …
there exists only a few training samples. The challenge is to adapt a base model effectively …