[HTML][HTML] The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

[HTML][HTML] RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision

X Li, C Wen, Y Hu, N Zhou - … Journal of Applied Earth Observation and …, 2023 - Elsevier
Zero-shot remote sensing scene classification aims to solve the scene classification problem
on unseen categories and has attracted numerous research attention in the remote sensing …

Geochat: Grounded large vision-language model for remote sensing

K Kuckreja, MS Danish, M Naseer… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Recent advancements in Large Vision-Language Models (VLMs) have shown great
promise in natural image domains allowing users to hold a dialogue about given visual …

Rsgpt: A remote sensing vision language model and benchmark

Y Hu, J Yuan, C Wen, X Lu, X Li - arXiv preprint arXiv:2307.15266, 2023 - arxiv.org
The emergence of large-scale large language models, with GPT-4 as a prominent example,
has significantly propelled the rapid advancement of artificial general intelligence and …

Vision-language models in remote sensing: Current progress and future trends

X Li, C Wen, Y Hu, Z Yuan… - IEEE Geoscience and …, 2024 - ieeexplore.ieee.org
The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-
4) have sparked a wave of interest and research in the field of large language models …

A spatial hierarchical reasoning network for remote sensing visual question answering

Z Zhang, L Jiao, L Li, X Liu, P Chen… - … on Geoscience and …, 2023 - ieeexplore.ieee.org
For visual question answering on remote sensing (RSVQA), current methods scarcely
consider geospatial objects typically with large-scale differences and positional sensitive …

Self-supervised pretraining via multimodality images with transformer for change detection

Y Zhang, Y Zhao, Y Dong, B Du - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Self-supervised learning (SSL) has shown remarkable success in image representation
learning. Among these methods, masked image modeling and contrastive learning are the …

[HTML][HTML] Machine-to-machine visual dialoguing with ChatGPT for enriched textual image description

R Ricci, Y Bazi, F Melgani - Remote Sensing, 2024 - mdpi.com
Image captioning is a technique that enables the automatic extraction of natural language
descriptions about the contents of an image. On the one hand, information in the form of …

[HTML][HTML] Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery

Y Bazi, L Bashmal, MM Al Rahhal, R Ricci, F Melgani - Remote Sensing, 2024 - mdpi.com
In this paper, we delve into the innovative application of large language models (LLMs) and
their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) …

Large language models for captioning and retrieving remote sensing images

JD Silva, J Magalhães, D Tuia, B Martins - arXiv preprint arXiv:2402.06475, 2024 - arxiv.org
Image captioning and cross-modal retrieval are examples of tasks that involve the joint
analysis of visual and linguistic information. In connection to remote sensing imagery, these …