A review on methods and applications in multimodal deep learning
Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …
Video description: A survey of methods, datasets, and evaluation metrics
Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …
the contents of a given video. It has applications in human-robot interaction, helping the …
Git: A generative image-to-text transformer for vision and language
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …
vision-language tasks such as image/video captioning and question answering. While …
Swinbert: End-to-end transformers with sparse attention for video captioning
The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …
from offline-extracted dense video features. These feature extractors usually operate on …
Attention on attention for image captioning
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …
captioning, where a weighted average on encoded vectors is generated at each time step to …
Object relational graph with teacher-recommended learning for video captioning
Taking full advantage of the information from both vision and language is critical for the
video captioning task. Existing models lack adequate visual representation due to the …
video captioning task. Existing models lack adequate visual representation due to the …
Spatio-temporal graph for video captioning with knowledge distillation
Video captioning is a challenging task that requires a deep understanding of visual scenes.
State-of-the-art methods generate captions using either scene-level or object-level …
State-of-the-art methods generate captions using either scene-level or object-level …
Recurrent fusion network for image captioning
Recently, much advance has been made in image captioning, and an encoder-decoder
framework has been adopted by all the state-of-the-art models. Under this framework, an …
framework has been adopted by all the state-of-the-art models. Under this framework, an …
Semantic grouping network for video captioning
This paper considers a video caption generating network referred to as Semantic Grouping
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …
Memory-attended recurrent network for video captioning
Typical techniques for video captioning follow the encoder-decoder framework, which can
only focus on one source video being processed. A potential disadvantage of such design is …
only focus on one source video being processed. A potential disadvantage of such design is …