Adversarial inference for multi-sentence video description

C Dong, Y Li, H Gong, M Chen, J Li, Y Shen… - ACM Computing …, 2022 - dl.acm.org

This article offers a comprehensive review of the research on Natural Language Generation
(NLG) over the past two decades, especially in relation to data-to-text generation and text-to …

被引用次数：207 相关文章所有 4 个版本

[PDF] ieee.org

Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap

S Amirian, K Rasheed, TR Taha, HR Arabnia - IEEE access, 2020 - ieeexplore.ieee.org

Methodologies that utilize Deep Learning offer great potential for applications that
automatically attempt to generate captions or descriptions about images and video frames …

被引用次数：73 相关文章所有 5 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：186 相关文章所有 26 个版本

[PDF] thecvf.com

Generating diverse and natural 3d human motions from text

C Guo, S Zou, X Zuo, S Wang, W Ji… - Proceedings of the …, 2022 - openaccess.thecvf.com

Automated generation of 3D human motions from text is a challenging problem. The
generated motions are expected to be sufficiently diverse to explore the text-grounded …

被引用次数：416 相关文章所有 6 个版本

[PDF] arxiv.org

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

C Guo, X Zuo, S Wang, L Cheng - European Conference on Computer …, 2022 - Springer

Inspired by the strong ties between vision and language, the two intimate human sensing
and communication modalities, our paper aims to explore the generation of 3D human full …

被引用次数：172 相关文章所有 8 个版本

[PDF] thecvf.com

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

被引用次数：191 相关文章所有 6 个版本

[PDF] arxiv.org

Evaluation of text generation: A survey

A Celikyilmaz, E Clark, J Gao - arXiv preprint arXiv:2006.14799, 2020 - arxiv.org

The paper surveys evaluation methods of natural language generation (NLG) systems that
have been developed in the last few years. We group NLG evaluation methods into three …

被引用次数：397 相关文章所有 2 个版本

AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video Description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

被引用次数：87 相关文章所有 4 个版本

[PDF] thecvf.com

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

被引用次数：43 相关文章所有 7 个版本

[PDF] arxiv.org

Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning

J Lei, L Wang, Y Shen, D Yu, TL Berg… - arXiv preprint arXiv …, 2020 - arxiv.org

Generating multi-sentence descriptions for videos is one of the most challenging captioning
tasks due to its high requirements for not only visual relevance but also discourse-based …

被引用次数：206 相关文章所有 6 个版本