Comprehensive image captioning via scene graph decomposition
We address the challenging problem of image captioning by revisiting the representation of
image scene graph. At the core of our method lies the decomposition of a scene graph into a …
image scene graph. At the core of our method lies the decomposition of a scene graph into a …
Show, control and tell: A framework for generating controllable and grounded captions
Current captioning approaches can describe images using black-box architectures whose
behavior is hardly controllable and explainable from the exterior. As an image can be …
behavior is hardly controllable and explainable from the exterior. As an image can be …
Trends in integration of vision and language research: A survey of tasks, datasets, and methods
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …
growth in the last few years. This success can be partly attributed to the advancements made …
Image captioning model using attention and object features to mimic human image understanding
Image captioning spans the fields of computer vision and natural language processing. The
image captioning task generalizes object detection where the descriptions are a single …
image captioning task generalizes object detection where the descriptions are a single …
Fast, diverse and accurate image captioning guided by part-of-speech
Image captioning is an ambiguous problem, with many suitable captions for an image. To
address ambiguity, beam search is the de facto method for sampling multiple captions …
address ambiguity, beam search is the de facto method for sampling multiple captions …
Like hiking? you probably enjoy nature: Persona-grounded dialog with commonsense expansions
Existing persona-grounded dialog models often fail to capture simple implications of given
persona descriptions, something which humans are able to do seamlessly. For example …
persona descriptions, something which humans are able to do seamlessly. For example …
Distilling translations with visual awareness
Previous work on multimodal machine translation has shown that visual information is only
needed in very specific cases, for example in the presence of ambiguous words where the …
needed in very specific cases, for example in the presence of ambiguous words where the …
MSCTD: A multimodal sentiment chat translation dataset
Multimodal machine translation and textual chat translation have received considerable
attention in recent years. Although the conversation in its natural form is usually multimodal …
attention in recent years. Although the conversation in its natural form is usually multimodal …
Image captioning based on scene graphs: A survey
J Jia, X Ding, S Pang, X Gao, X Xin, R Hu… - Expert Systems with …, 2023 - Elsevier
Although recent developments in deep learning have brought several tasks closer to human
performance, there is still a significant gap between human and machine performance in …
performance, there is still a significant gap between human and machine performance in …
ShapeCaptioner: Generative caption network for 3D shapes by learning a mapping from parts detected in multiple views to sentences
3D shape captioning is a challenging application in 3D shape understanding. Captions from
recent multi-view based methods reveal that they cannot capture part-level characteristics of …
recent multi-view based methods reveal that they cannot capture part-level characteristics of …