Temporal relational reasoning in videos

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：133 相关文章所有 18 个版本

[PDF] mdpi.com

A comprehensive survey of vision-based human action recognition methods

HB Zhang, YX Zhang, B Zhong, Q Lei, L Yang, JX Du… - Sensors, 2019 - mdpi.com

Although widely used in many applications, accurate and efficient human action recognition
remains a challenging area of research in the field of computer vision. Most recent surveys …

被引用次数：366 相关文章所有 9 个版本

[PDF] thecvf.com

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：221 相关文章所有 5 个版本

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

被引用次数：231 相关文章所有 15 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1439 相关文章所有 11 个版本

[PDF] neurips.cc

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc

We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

被引用次数：354 相关文章所有 7 个版本

[PDF] thecvf.com

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

被引用次数：980 相关文章所有 9 个版本

[PDF] neurips.cc

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in …, 2021 - proceedings.neurips.cc

Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

被引用次数：237 相关文章所有 6 个版本

[PDF] thecvf.com

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

W Wang, E Xie, X Li, DP Fan, K Song… - Proceedings of the …, 2021 - openaccess.thecvf.com

Although convolutional neural networks (CNNs) have achieved great success in computer
vision, this work investigates a simpler, convolution-free backbone network useful for many …

被引用次数：1941 相关文章所有 18 个版本

[PDF] arxiv.org

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer

Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

被引用次数：106 相关文章所有 6 个版本