Less is more: Picking informative frames for video captioning

S Jabeen, X Li, MS Amin, O Bourahla, S Li… - ACM Transactions on …, 2023 - dl.acm.org

Deep Learning has implemented a wide range of applications and has become increasingly
popular in recent years. The goal of multimodal deep learning (MMDL) is to create models …

被引用次数：61 相关文章所有 7 个版本

[PDF] acm.org

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org

Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …

被引用次数：237 相关文章所有 10 个版本

[PDF] arxiv.org

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

被引用次数：416 相关文章所有 4 个版本

[PDF] thecvf.com

Swinbert: End-to-end transformers with sparse attention for video captioning

K Lin, L Li, CC Lin, F Ahmed, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com

The canonical approach to video captioning dictates a caption generation model to learn
from offline-extracted dense video features. These feature extractors usually operate on …

被引用次数：223 相关文章所有 5 个版本

[PDF] thecvf.com

Attention on attention for image captioning

L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …

被引用次数：991 相关文章所有 9 个版本

[PDF] thecvf.com

Object relational graph with teacher-recommended learning for video captioning

Z Zhang, Y Shi, C Yuan, B Li, P Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com

Taking full advantage of the information from both vision and language is critical for the
video captioning task. Existing models lack adequate visual representation due to the …

被引用次数：335 相关文章所有 8 个版本

[PDF] thecvf.com

Spatio-temporal graph for video captioning with knowledge distillation

B Pan, H Cai, DA Huang, KH Lee… - Proceedings of the …, 2020 - openaccess.thecvf.com

Video captioning is a challenging task that requires a deep understanding of visual scenes.
State-of-the-art methods generate captions using either scene-level or object-level …

被引用次数：310 相关文章所有 8 个版本

[PDF] thecvf.com

Recurrent fusion network for image captioning

W Jiang, L Ma, YG Jiang, W Liu… - Proceedings of the …, 2018 - openaccess.thecvf.com

Recently, much advance has been made in image captioning, and an encoder-decoder
framework has been adopted by all the state-of-the-art models. Under this framework, an …

被引用次数：312 相关文章所有 11 个版本

[PDF] aaai.org

Semantic grouping network for video captioning

H Ryu, S Kang, H Kang, CD Yoo - … of the AAAI Conference on Artificial …, 2021 - ojs.aaai.org

This paper considers a video caption generating network referred to as Semantic Grouping
Network (SGN) that attempts (1) to group video frames with discriminating word phrases of …

被引用次数：129 相关文章所有 8 个版本

[PDF] thecvf.com

Memory-attended recurrent network for video captioning

W Pei, J Zhang, X Wang, L Ke… - Proceedings of the …, 2019 - openaccess.thecvf.com

Typical techniques for video captioning follow the encoder-decoder framework, which can
only focus on one source video being processed. A potential disadvantage of such design is …

被引用次数：263 相关文章所有 7 个版本