Vl-pet: Vision-and-language parameter-efficient tuning via granularity control
As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning
becomes prohibitively expensive for model training and storage. In vision-and-language …
becomes prohibitively expensive for model training and storage. In vision-and-language …
When Geoscience Meets Foundation Models: Toward a general geoscience artificial intelligence system
Artificial intelligence (AI) has significantly advanced Earth sciences, yet its full potential in to
comprehensively modeling Earth's complex dynamics remains unrealized. Geoscience …
comprehensively modeling Earth's complex dynamics remains unrealized. Geoscience …
Parameter-efficient transfer learning for remote sensing image-text retrieval
Vision-and-language pretraining (VLP) models have experienced a surge in popularity
recently. By fine-tuning them on specific datasets, significant performance improvements …
recently. By fine-tuning them on specific datasets, significant performance improvements …
Rethinking vision transformer and masked autoencoder in multimodal face anti-spoofing
Recently, vision transformer (ViT) based multimodal learning methods have been proposed
to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no …
to improve the robustness of face anti-spoofing (FAS) systems. However, there are still no …
Kvq: Kwai video quality assessment for short-form videos
Short-form UGC video platforms like Kwai and TikTok have been an emerging and
irreplaceable mainstream media form thriving on user-friendly engagement and …
irreplaceable mainstream media form thriving on user-friendly engagement and …
Cross-modal adapter for text-video retrieval
Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve
the most relevant video for a given text query. Recently, pre-trained models, eg, CLIP, show …
the most relevant video for a given text query. Recently, pre-trained models, eg, CLIP, show …
Parameter-efficient is not sufficient: Exploring parameter, memory, and time efficient adapter tuning for dense predictions
Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently,
parameter-efficient transfer learning (PETL) methods have shown promising performance in …
parameter-efficient transfer learning (PETL) methods have shown promising performance in …
End-to-end temporal action detection with 1b parameters across 1000 frames
Recently temporal action detection (TAD) has seen significant performance improvement
with end-to-end training. However due to the memory bottleneck only models with limited …
with end-to-end training. However due to the memory bottleneck only models with limited …
DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration
The visual models pretrained on large-scale benchmarks encode general knowledge and
prove effective in building more powerful representations for downstream tasks. Most …
prove effective in building more powerful representations for downstream tasks. Most …