Cross-modal graph matching network for image-text retrieval

Y Cheng, X Zhu, J Qian, F Wen, P Liu - ACM Transactions on Multimedia …, 2022 - dl.acm.org
ACM Transactions on Multimedia Computing, Communications, and Applications …, 2022dl.acm.org
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-
text matching. Generally, according to whether there exist interactions during the retrieval
process, existing image-text retrieval methods can be classified into independent
representation matching methods and cross-interaction matching methods. The
independent representation matching methods generate the embeddings of images and
sentences independently and thus are convenient for retrieval with hand-crafted matching …
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into independent representation matching methods and cross-interaction matching methods. The independent representation matching methods generate the embeddings of images and sentences independently and thus are convenient for retrieval with hand-crafted matching measures (e.g., cosine or Euclidean distance). As to the cross-interaction matching methods, they achieve improvement by introducing the interaction-based networks for inter-relation reasoning, yet suffer the low retrieval efficiency. This article aims to develop a method that takes the advantages of cross-modal inter-relation reasoning of cross-interaction methods while being as efficient as the independent methods. To this end, we propose a graph-based Cross-modal Graph Matching Network (CGMN), which explores both intra- and inter-relations without introducing network interaction. In CGMN, graphs are used for both visual and textual representation to achieve intra-relation reasoning across regions and words, respectively. Furthermore, we propose a novel graph node matching loss to learn fine-grained cross-modal correspondence and to achieve inter-relation reasoning. Experiments on benchmark datasets MS-COCO, Flickr8K, and Flickr30K show that CGMN outperforms state-of-the-art methods in image retrieval. Moreover, CGMM is much more efficient than state-of-the-art methods using interactive matching. The code is available at https://github.com/cyh-sj/CGMN.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果