Graph convolutional networks: a comprehensive review
Graphs naturally appear in numerous application domains, ranging from social analysis,
bioinformatics to computer vision. The unique capability of graphs enables capturing the …
bioinformatics to computer vision. The unique capability of graphs enables capturing the …
A comprehensive survey of scene graphs: Generation and application
Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …
attributes, and relationships between objects in the scene. As computer vision technology …
Videomae v2: Scaling video masked autoencoders with dual masking
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …
generalize to a variety of downstream tasks. However, it is still challenging to train video …
Graph neural networks: foundation, frontiers and applications
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …
recent years. Graph neural networks, also known as deep learning on graphs, graph …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Actionclip: A new paradigm for video action recognition
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …
Tdn: Temporal difference networks for efficient action recognition
Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …
issue, this paper presents a new video architecture, termed as Temporal Difference Network …
Video pivoting unsupervised multi-modal machine translation
The main challenge in the field of unsupervised machine translation (UMT) is to associate
source-target sentences in the latent space. As people who speak different languages share …
source-target sentences in the latent space. As people who speak different languages share …
X3d: Expanding architectures for efficient video recognition
C Feichtenhofer - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com
This paper presents X3D, a family of efficient video networks that progressively expand a
tiny 2D image classification architecture along multiple network axes, in space, time, width …
tiny 2D image classification architecture along multiple network axes, in space, time, width …
Disentangling and unifying graph convolutions for skeleton-based action recognition
Spatial-temporal graphs have been widely used by skeleton-based action recognition
algorithms to model human action dynamics. To capture robust movement patterns from …
algorithms to model human action dynamics. To capture robust movement patterns from …