Learning visual relation priors for image-text matching and image captioning with neural...

X Chang, P Ren, P Xu, Z Li, X Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Scene graph is a structured representation of a scene that can clearly express the objects,
attributes, and relationships between objects in the scene. As computer vision technology …

被引用次数：301 相关文章所有 15 个版本

[PDF] arxiv.org

Deep learning approaches on image captioning: A review

T Ghandi, H Pourreza, H Mahyar - ACM Computing Surveys, 2023 - dl.acm.org

Image captioning is a research area of immense importance, aiming to generate natural
language descriptions for visual content in the form of still images. The advent of deep …

被引用次数：56 相关文章所有 5 个版本

[PDF] arxiv.org

Reltr: Relation transformer for scene graph generation

Y Cong, MY Yang, B Rosenhahn - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Different objects in the same scene are more or less related to each other, but only a limited
number of these relationships are noteworthy. Inspired by Detection Transformer, which …

被引用次数：115 相关文章所有 10 个版本

[PDF] thecvf.com

Context-aware attention network for image-text retrieval

Q Zhang, Z Lei, Z Zhang, SZ Li - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the
joint embedding learning and similarity measure for each image-text pair. It remains …

被引用次数：254 相关文章所有 7 个版本

[PDF] arxiv.org

Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders

N Messina, G Amato, A Esuli, F Falchi… - ACM Transactions on …, 2021 - dl.acm.org

Despite the evolution of deep-learning-based visual-textual processing systems, precise
multi-modal matching remains a challenging task. In this work, we tackle the task of cross …

被引用次数：145 相关文章所有 9 个版本

Cross-modal graph matching network for image-text retrieval

Y Cheng, X Zhu, J Qian, F Wen, P Liu - ACM Transactions on Multimedia …, 2022 - dl.acm.org

Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-
text matching. Generally, according to whether there exist interactions during the retrieval …

被引用次数：72 相关文章所有 2 个版本

[PDF] aclanthology.org

Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval

S Sun, YC Chen, L Li, S Wang, Y Fang… - Proceedings of the 2021 …, 2021 - aclanthology.org

Multimodal pre-training has propelled great advancement in vision-and-language research.
These large-scale pre-trained models, although successful, fatefully suffer from slow …

被引用次数：85 相关文章所有 3 个版本

[PDF] thecvf.com

Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words

C Tang, X Yang, B Wu, Z Han… - Proceedings of the …, 2023 - openaccess.thecvf.com

Shape-Text matching is an important task of high-level shape understanding. Current
methods mainly represent a 3D shape as multiple 2D rendered views, which obviously can …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Transformer reasoning network for image-text matching and retrieval

N Messina, F Falchi, A Esuli… - 2020 25th International …, 2021 - ieeexplore.ieee.org

Image-text matching is an interesting and fascinating task in modern AI research. Despite
the evolution of deep-learning-based image and text processing systems, multimodal …

被引用次数：74 相关文章所有 12 个版本

[PDF] arxiv.org

Plug-and-play regulators for image-text matching

H Diao, Y Zhang, W Liu, X Ruan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Exploiting fine-grained correspondence and visual-semantic alignments has shown great
potential in image-text matching. Generally, recent approaches first employ a cross-modal …

被引用次数：13 相关文章所有 6 个版本