Videos as space-time region graphs

S Zhang, H Tong, J Xu, R Maciejewski - Computational Social Networks, 2019 - Springer

Graphs naturally appear in numerous application domains, ranging from social analysis,
bioinformatics to computer vision. The unique capability of graphs enables capturing the …

被引用次数：1342 相关文章所有 16 个版本

[PDF] arxiv.org

A comprehensive survey of scene graphs: Generation and application

X Chang, P Ren, P Xu, Z Li, X Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …

被引用次数：316 相关文章所有 15 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：274 相关文章所有 7 个版本

[PDF] github.io

Graph neural networks: foundation, frontiers and applications

L Wu, P Cui, J Pei, L Zhao, X Guo - … of the 28th ACM SIGKDD Conference …, 2022 - dl.acm.org

The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …

被引用次数：382 相关文章所有 11 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1374 相关文章所有 5 个版本

[PDF] arxiv.org

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org

The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

被引用次数：374 相关文章所有 2 个版本

[PDF] thecvf.com

Tdn: Temporal difference networks for efficient action recognition

L Wang, Z Tong, B Ji, G Wu - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …

被引用次数：459 相关文章所有 8 个版本

Video pivoting unsupervised multi-modal machine translation

M Li, PY Huang, X Chang, J Hu, Y Yang… - … on Pattern Analysis …, 2022 - ieeexplore.ieee.org

The main challenge in the field of unsupervised machine translation (UMT) is to associate
source-target sentences in the latent space. As people who speak different languages share …

被引用次数：114 相关文章所有 7 个版本

[PDF] thecvf.com

X3d: Expanding architectures for efficient video recognition

C Feichtenhofer - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

This paper presents X3D, a family of efficient video networks that progressively expand a
tiny 2D image classification architecture along multiple network axes, in space, time, width …

被引用次数：1147 相关文章所有 7 个版本

[PDF] thecvf.com

Disentangling and unifying graph convolutions for skeleton-based action recognition

Z Liu, H Zhang, Z Chen, Z Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Spatial-temporal graphs have been widely used by skeleton-based action recognition
algorithms to model human action dynamics. To capture robust movement patterns from …

被引用次数：1083 相关文章所有 8 个版本