Only time can tell: Discovering temporal data for temporal modeling

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：155 相关文章所有 7 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：854 相关文章所有 12 个版本

[PDF] mlr.press

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press

Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

被引用次数：1824 相关文章所有 4 个版本

[PDF] thecvf.com

Video transformer network

D Neimark, O Bar, M Zohar… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents VTN, a transformer-based framework for video recognition. Inspired by
recent developments in vision transformers, we ditch the standard approach in video action …

被引用次数：483 相关文章所有 9 个版本

[PDF] thecvf.com

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com

Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

被引用次数：40 相关文章所有 6 个版本

[PDF] arxiv.org

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

被引用次数：93 相关文章所有 8 个版本

[PDF] thecvf.com

Revisiting temporal modeling for clip-based image-to-video knowledge transferring

R Liu, J Huang, G Li, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Image-text pretrained models, eg, CLIP, have shown impressive general multi-modal
knowledge learned from large-scale image-text data pairs, thus attracting increasing …

被引用次数：33 相关文章所有 7 个版本

[PDF] aaai.org

Smart frame selection for action recognition

SN Gowda, M Rohrbach, L Sevilla-Lara - Proceedings of the AAAI …, 2021 - ojs.aaai.org

Video classification is computationally expensive. In this paper, we address theproblem of
frame selection to reduce the computational cost of video classification. Recent work has …

被引用次数：153 相关文章所有 6 个版本

[PDF] mlr.press

Learning de-biased representations with biased representations

H Bahng, S Chun, S Yun, J Choo… - … on Machine Learning, 2020 - proceedings.mlr.press

Many machine learning algorithms are trained and evaluated by splitting data from a single
source into training and test sets. While such focus on in-distribution learning scenarios has …

被引用次数：268 相关文章所有 11 个版本

[PDF] neurips.cc

Ego4d goal-step: Toward hierarchical understanding of procedural activities

Y Song, E Byrne, T Nagarajan… - Advances in …, 2024 - proceedings.neurips.cc

Human activities are goal-oriented and hierarchical, comprising primary goals at the top
level, sequences of steps and substeps in the middle, and atomic actions at the lowest level …

被引用次数：11 相关文章所有 4 个版本