Object counts! bringing explicit detections back into image captioning

Y Zhong, L Wang, J Chen, D Yu, Y Li - … , Glasgow, UK, August 23–28, 2020 …, 2020 - Springer

We address the challenging problem of image captioning by revisiting the representation of
image scene graph. At the core of our method lies the decomposition of a scene graph into a …

被引用次数：135 相关文章所有 8 个版本

[PDF] thecvf.com

Show, control and tell: A framework for generating controllable and grounded captions

M Cornia, L Baraldi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Current captioning approaches can describe images using black-box architectures whose
behavior is hardly controllable and explainable from the exterior. As an image can be …

被引用次数：217 相关文章所有 14 个版本

[PDF] jair.org Full View

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org

Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …

被引用次数：152 相关文章所有 8 个版本

[PDF] springer.com

Image captioning model using attention and object features to mimic human image understanding

MA Al-Malla, A Jafar, N Ghneim - Journal of Big Data, 2022 - Springer

Image captioning spans the fields of computer vision and natural language processing. The
image captioning task generalizes object detection where the descriptions are a single …

被引用次数：57 相关文章所有 10 个版本

[PDF] thecvf.com

Fast, diverse and accurate image captioning guided by part-of-speech

A Deshpande, J Aneja, L Wang… - Proceedings of the …, 2019 - openaccess.thecvf.com

Image captioning is an ambiguous problem, with many suitable captions for an image. To
address ambiguity, beam search is the de facto method for sampling multiple captions …

被引用次数：167 相关文章所有 8 个版本

[PDF] arxiv.org

Like hiking? you probably enjoy nature: Persona-grounded dialog with commonsense expansions

BP Majumder, H Jhamtani, T Berg-Kirkpatrick… - arXiv preprint arXiv …, 2020 - arxiv.org

Existing persona-grounded dialog models often fail to capture simple implications of given
persona descriptions, something which humans are able to do seamlessly. For example …

被引用次数：83 相关文章所有 7 个版本

[PDF] arxiv.org

Distilling translations with visual awareness

J Ive, P Madhyastha, L Specia - arXiv preprint arXiv:1906.07701, 2019 - arxiv.org

Previous work on multimodal machine translation has shown that visual information is only
needed in very specific cases, for example in the presence of ambiguous words where the …

被引用次数：96 相关文章所有 5 个版本

[PDF] arxiv.org

MSCTD: A multimodal sentiment chat translation dataset

Y Liang, F Meng, J Xu, Y Chen, J Zhou - arXiv preprint arXiv:2202.13645, 2022 - arxiv.org

Multimodal machine translation and textual chat translation have received considerable
attention in recent years. Although the conversation in its natural form is usually multimodal …

被引用次数：22 相关文章所有 4 个版本

Image captioning based on scene graphs: A survey

J Jia, X Ding, S Pang, X Gao, X Xin, R Hu… - Expert Systems with …, 2023 - Elsevier

Although recent developments in deep learning have brought several tasks closer to human
performance, there is still a significant gap between human and machine performance in …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

ShapeCaptioner: Generative caption network for 3D shapes by learning a mapping from parts detected in multiple views to sentences

Z Han, C Chen, YS Liu, M Zwicker - Proceedings of the 28th ACM …, 2020 - dl.acm.org

3D shape captioning is a challenging application in 3D shape understanding. Captions from
recent multi-view based methods reveal that they cannot capture part-level characteristics of …

被引用次数：42 相关文章所有 5 个版本