St-adapter: Parameter-efficient image-to-video transfer learning
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …
recently emerged with promising performance. Due to the ever-growing model size, the …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
[PDF][PDF] Is space-time attention all you need for video understanding?
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …
Video transformer network
D Neimark, O Bar, M Zohar… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents VTN, a transformer-based framework for video recognition. Inspired by
recent developments in vision transformers, we ditch the standard approach in video action …
recent developments in vision transformers, we ditch the standard approach in video action …
Verbs in action: Improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …
and the environment through space and time. Recently, state-of-the-art video-language …
Video transformers: A survey
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …
them a promising tool for modeling video. However, they lack inductive biases and scale …
Revisiting temporal modeling for clip-based image-to-video knowledge transferring
Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …
knowledge learned from large-scale image-text data pairs, thus attracting increasing …
Smart frame selection for action recognition
Video classification is computationally expensive. In this paper, we address theproblem of
frame selection to reduce the computational cost of video classification. Recent work has …
frame selection to reduce the computational cost of video classification. Recent work has …
Learning de-biased representations with biased representations
Many machine learning algorithms are trained and evaluated by splitting data from a single
source into training and test sets. While such focus on in-distribution learning scenarios has …
source into training and test sets. While such focus on in-distribution learning scenarios has …
Ego4d goal-step: Toward hierarchical understanding of procedural activities
Human activities are goal-oriented and hierarchical, comprising primary goals at the top
level, sequences of steps and substeps in the middle, and atomic actions at the lowest level …
level, sequences of steps and substeps in the middle, and atomic actions at the lowest level …