Temporal segment networks for action recognition in videos

Y Wang, W Song, W Tao, A Liotta, D Yang, X Li, S Gao… - Information …, 2022 - Elsevier

Affective computing conjoins the research topics of emotion recognition and sentiment
analysis, and can be realized with unimodal or multimodal data, consisting primarily of …

被引用次数：279 相关文章所有 5 个版本

[PDF] edgehill.ac.uk

A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions

SK Yadav, K Tiwari, HM Pandey, SA Akbar - Knowledge-Based Systems, 2021 - Elsevier

Human activity recognition (HAR) is one of the most important and challenging problems in
the computer vision. It has critical application in wide variety of tasks including gaming …

被引用次数：194 相关文章所有 3 个版本

[PDF] neurips.cc

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc

Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

被引用次数：760 相关文章所有 6 个版本

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：155 相关文章所有 7 个版本

[PDF] mlr.press

Simam: A simple, parameter-free attention module for convolutional neural networks

L Yang, RY Zhang, L Li, X Xie - International conference on …, 2021 - proceedings.mlr.press

In this paper, we propose a conceptually simple but very effective attention module for
Convolutional Neural Networks (ConvNets). In contrast to existing channel-wise and spatial …

被引用次数：867 相关文章所有 5 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：39 相关文章所有 3 个版本

[PDF] thecvf.com

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

被引用次数：177 相关文章所有 4 个版本

[PDF] thecvf.com

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com

The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

被引用次数：223 相关文章所有 5 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：868 相关文章所有 12 个版本

[PDF] arxiv.org

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org

The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

被引用次数：315 相关文章所有 2 个版本