A comprehensive survey of scene graphs: Generation and application

X Chang, P Ren, P Xu, Z Li, X Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …

Deep learning approaches on image captioning: A review

T Ghandi, H Pourreza, H Mahyar - ACM Computing Surveys, 2023 - dl.acm.org
Image captioning is a research area of immense importance, aiming to generate natural
language descriptions for visual content in the form of still images. The advent of deep …

Reltr: Relation transformer for scene graph generation

Y Cong, MY Yang, B Rosenhahn - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Different objects in the same scene are more or less related to each other, but only a limited
number of these relationships are noteworthy. Inspired by Detection Transformer, which …

Context-aware attention network for image-text retrieval

Q Zhang, Z Lei, Z Zhang, SZ Li - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the
joint embedding learning and similarity measure for each image-text pair. It remains …

Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders

N Messina, G Amato, A Esuli, F Falchi… - ACM Transactions on …, 2021 - dl.acm.org
Despite the evolution of deep-learning-based visual-textual processing systems, precise
multi-modal matching remains a challenging task. In this work, we tackle the task of cross …

Cross-modal graph matching network for image-text retrieval

Y Cheng, X Zhu, J Qian, F Wen, P Liu - ACM Transactions on Multimedia …, 2022 - dl.acm.org
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-
text matching. Generally, according to whether there exist interactions during the retrieval …

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org
Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words

C Tang, X Yang, B Wu, Z Han… - Proceedings of the …, 2023 - openaccess.thecvf.com
Shape-Text matching is an important task of high-level shape understanding. Current
methods mainly represent a 3D shape as multiple 2D rendered views, which obviously can …

Transformer reasoning network for image-text matching and retrieval

N Messina, F Falchi, A Esuli… - 2020 25th International …, 2021 - ieeexplore.ieee.org
Image-text matching is an interesting and fascinating task in modern AI research. Despite
the evolution of deep-learning-based image and text processing systems, multimodal …

Plug-and-play regulators for image-text matching

H Diao, Y Zhang, W Liu, X Ruan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Exploiting fine-grained correspondence and visual-semantic alignments has shown great
potential in image-text matching. Generally, recent approaches first employ a cross-modal …