Hollywood in homes: Crowdsourcing data collection for activity understanding

N Gupta, SK Gupta, RK Pathak, V Jain… - Artificial intelligence …, 2022 - Springer

Human activity recognition (HAR) has multifaceted applications due to its worldly usage of
acquisition devices such as smartphones, video cameras, and its ability to capture human …

被引用次数：210 相关文章所有 10 个版本

[PDF] arxiv.org

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

被引用次数：2497 相关文章所有 8 个版本

[PDF] thecvf.com

Diffusiondet: Diffusion model for object detection

S Chen, P Sun, Y Song, P Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …

被引用次数：383 相关文章所有 5 个版本

[PDF] thecvf.com

Sequential modeling enables scalable learning for large vision models

Y Bai, X Geng, K Mangalam, A Bar… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce a novel sequential modeling approach which enables learning a Large Vision
Model (LVM) without making use of any linguistic data. To do this we define a common …

被引用次数：74 相关文章所有 3 个版本

[PDF] neurips.cc

S4nd: Modeling images and videos as multidimensional signals with state spaces

E Nguyen, K Goel, A Gu, G Downs… - Advances in neural …, 2022 - proceedings.neurips.cc

Visual data such as images and videos are typically modeled as discretizations of inherently
continuous, multidimensional signals. Existing continuous-signal models attempt to exploit …

被引用次数：140 相关文章所有 6 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1351 相关文章所有 5 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：951 相关文章所有 12 个版本

[PDF] arxiv.org

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org

The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

被引用次数：364 相关文章所有 2 个版本

[PDF] arxiv.org

A survey of natural language generation

C Dong, Y Li, H Gong, M Chen, J Li, Y Shen… - ACM Computing …, 2022 - dl.acm.org

This article offers a comprehensive review of the research on Natural Language Generation
(NLG) over the past two decades, especially in relation to data-to-text generation and text-to …

被引用次数：192 相关文章所有 4 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：520 相关文章所有 16 个版本