[HTML][HTML] The multi-modal fusion in visual question answering: a review of attention mechanisms
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …
fields of computer vision and natural language processing that requires a computer to output …
[HTML][HTML] RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision
Zero-shot remote sensing scene classification aims to solve the scene classification problem
on unseen categories and has attracted numerous research attention in the remote sensing …
on unseen categories and has attracted numerous research attention in the remote sensing …
Geochat: Grounded large vision-language model for remote sensing
Abstract Recent advancements in Large Vision-Language Models (VLMs) have shown great
promise in natural image domains allowing users to hold a dialogue about given visual …
promise in natural image domains allowing users to hold a dialogue about given visual …
Rsgpt: A remote sensing vision language model and benchmark
The emergence of large-scale large language models, with GPT-4 as a prominent example,
has significantly propelled the rapid advancement of artificial general intelligence and …
has significantly propelled the rapid advancement of artificial general intelligence and …
Vision-language models in remote sensing: Current progress and future trends
The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-
4) have sparked a wave of interest and research in the field of large language models …
4) have sparked a wave of interest and research in the field of large language models …
A spatial hierarchical reasoning network for remote sensing visual question answering
For visual question answering on remote sensing (RSVQA), current methods scarcely
consider geospatial objects typically with large-scale differences and positional sensitive …
consider geospatial objects typically with large-scale differences and positional sensitive …
Self-supervised pretraining via multimodality images with transformer for change detection
Self-supervised learning (SSL) has shown remarkable success in image representation
learning. Among these methods, masked image modeling and contrastive learning are the …
learning. Among these methods, masked image modeling and contrastive learning are the …
[HTML][HTML] Machine-to-machine visual dialoguing with ChatGPT for enriched textual image description
Image captioning is a technique that enables the automatic extraction of natural language
descriptions about the contents of an image. On the one hand, information in the form of …
descriptions about the contents of an image. On the one hand, information in the form of …
[HTML][HTML] Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery
In this paper, we delve into the innovative application of large language models (LLMs) and
their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) …
their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) …
Large language models for captioning and retrieving remote sensing images
Image captioning and cross-modal retrieval are examples of tasks that involve the joint
analysis of visual and linguistic information. In connection to remote sensing imagery, these …
analysis of visual and linguistic information. In connection to remote sensing imagery, these …