A review on methods and applications in multimodal deep learning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org
Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

Object relational graph with teacher-recommended learning for video captioning

Z Zhang, Y Shi, C Yuan, B Li, P Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com
Taking full advantage of the information from both vision and language is critical for the
video captioning task. Existing models lack adequate visual representation due to the …

Spatio-temporal graph for video captioning with knowledge distillation

B Pan, H Cai, DA Huang, KH Lee… - Proceedings of the …, 2020 - openaccess.thecvf.com
Video captioning is a challenging task that requires a deep understanding of visual scenes.
State-of-the-art methods generate captions using either scene-level or object-level …

Recurrent fusion network for image captioning

W Jiang, L Ma, YG Jiang, W Liu… - Proceedings of the …, 2018 - openaccess.thecvf.com
Recently, much advance has been made in image captioning, and an encoder-decoder
framework has been adopted by all the state-of-the-art models. Under this framework, an …

Semantic grouping network for video captioning

H Ryu, S Kang, H Kang, CD Yoo - … of the AAAI Conference on Artificial …, 2021 - ojs.aaai.org
This paper considers a video caption generating network referred to as Semantic Grouping
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …

Memory-attended recurrent network for video captioning

W Pei, J Zhang, X Wang, L Ke… - Proceedings of the …, 2019 - openaccess.thecvf.com
Typical techniques for video captioning follow the encoder-decoder framework, which can
only focus on one source video being processed. A potential disadvantage of such design is …