St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

Video transformer network

D Neimark, O Bar, M Zohar… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents VTN, a transformer-based framework for video recognition. Inspired by
recent developments in vision transformers, we ditch the standard approach in video action …

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

R Liu, J Huang, G Li, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …

Smart frame selection for action recognition

SN Gowda, M Rohrbach, L Sevilla-Lara - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Video classification is computationally expensive. In this paper, we address theproblem of
frame selection to reduce the computational cost of video classification. Recent work has …

Learning de-biased representations with biased representations

H Bahng, S Chun, S Yun, J Choo… - … on Machine Learning, 2020 - proceedings.mlr.press
Many machine learning algorithms are trained and evaluated by splitting data from a single
source into training and test sets. While such focus on in-distribution learning scenarios has …

Ego4d goal-step: Toward hierarchical understanding of procedural activities

Y Song, E Byrne, T Nagarajan… - Advances in …, 2024 - proceedings.neurips.cc
Human activities are goal-oriented and hierarchical, comprising primary goals at the top
level, sequences of steps and substeps in the middle, and atomic actions at the lowest level …