Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation
Proceedings of the 47th International ACM SIGIR Conference on Research and …, 2024•dl.acm.org
Geolocating precise locations from images presents a challenging problem in computer
vision and information retrieval. Traditional methods typically employ either classification-
dividing the Earth's surface into grid cells and classifying images accordingly, or retrieval-
identifying locations by matching images with a database of image-location pairs. However,
classification-based approaches are limited by the cell size and cannot yield precise
predictions, while retrieval-based systems usually suffer from poor search quality and …
vision and information retrieval. Traditional methods typically employ either classification-
dividing the Earth's surface into grid cells and classifying images accordingly, or retrieval-
identifying locations by matching images with a database of image-location pairs. However,
classification-based approaches are limited by the cell size and cannot yield precise
predictions, while retrieval-based systems usually suffer from poor search quality and …
Geolocating precise locations from images presents a challenging problem in computer vision and information retrieval. Traditional methods typically employ either classification-dividing the Earth's surface into grid cells and classifying images accordingly, or retrieval-identifying locations by matching images with a database of image-location pairs. However, classification-based approaches are limited by the cell size and cannot yield precise predictions, while retrieval-based systems usually suffer from poor search quality and inadequate coverage of the global landscape at varied scale and aggregation levels. To overcome these drawbacks, we present Img2Loc, a novel system that redefines image geolocalization as a text generation task. This is achieved using cutting-edge large multi-modality models (LMMs) like GPT-4V or LLaVA with retrieval augmented generation. Img2Loc first employs CLIP-based representations to generate an image-based coordinate query database. It then uniquely combines query results with images itself, forming elaborate prompts customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art models but does so without any model training. A video demonstration of the system can be accessed via this link https://drive.google.com/file/d/16A6A-mc7AyUoKHRH3_WBRToRC13sn7tU/view?usp=sharing
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果