Describing videos by exploiting temporal structure

Convolutional neural network: a review of models, methodologies and applications to object detection

A Dhillon, GK Verma - Progress in Artificial Intelligence, 2020 - Springer

Deep learning has developed as an effective machine learning method that takes in
numerous layers of features or representation of the data and provides state-of-the-art …

被引用次数：1055 相关文章所有 3 个版本

[PDF] jair.org

Neural machine translation: A review

F Stahlberg - Journal of Artificial Intelligence Research, 2020 - jair.org

The field of machine translation (MT), the automatic translation of written text from one
natural language into another, has experienced a major paradigm shift in recent years …

被引用次数：439 相关文章所有 7 个版本

[PDF] thecvf.com

Ai choreographer: Music conditioned 3d dance generation with aist++

R Li, S Yang, DA Ross… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …

被引用次数：480 相关文章所有 6 个版本

[PDF] thecvf.com

X-pool: Cross-modal language-video attention for text-video retrieval

SK Gorti, N Vouitsis, J Ma, K Golestan… - Proceedings of the …, 2022 - openaccess.thecvf.com

In text-video retrieval, the objective is to learn a cross-modal similarity function between a
text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However …

被引用次数：174 相关文章所有 7 个版本

[PDF] thecvf.com

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

被引用次数：209 相关文章所有 6 个版本

[PDF] thecvf.com

Tea: Temporal excitation and aggregation for action recognition

Y Li, B Ji, X Shi, J Zhang, B Kang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Temporal modeling is key for action recognition in videos. It normally considers both short-
range motions and long-range aggregations. In this paper, we propose a Temporal …

被引用次数：596 相关文章所有 11 个版本

[PDF] thecvf.com

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

被引用次数：504 相关文章所有 10 个版本

[PDF] arxiv.org

An attentive survey of attention models

S Chaudhari, V Mithal, G Polatkan… - ACM Transactions on …, 2021 - dl.acm.org

Attention Model has now become an important concept in neural networks that has been
researched within diverse application domains. This survey provides a structured and …

被引用次数：876 相关文章所有 6 个版本

[PDF] researchgate.net

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer

In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

被引用次数：216 相关文章所有 8 个版本

[PDF] thecvf.com

Object relational graph with teacher-recommended learning for video captioning

Z Zhang, Y Shi, C Yuan, B Li, P Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Taking full advantage of the information from both vision and language is critical for the
video captioning task. Existing models lack adequate visual representation due to the …

被引用次数：364 相关文章所有 8 个版本