Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …
various language-related applications. Motivated by this, we target to build a unified …
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …
stem from the powerful reasoning abilities of large language models (LLMs). However the …
Probing the 3d awareness of visual foundation models
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …
strong capabilities. Not only can recent models generalize to arbitrary images for their …
The Neglected Tails in Vision-Language Models
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …
greatly across different visual concepts. For example although CLIP achieves impressive …
Bioclip: A vision foundation model for the tree of life
Images of the natural world collected by a variety of cameras from drones to individual
phones are increasingly abundant sources of biological information. There is an explosion …
phones are increasingly abundant sources of biological information. There is an explosion …
Grounding everything: Emerging localization properties in vision-language transformers
W Bousselham, F Petersen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language foundation models have shown remarkable performance in various zero-
shot settings such as image retrieval classification or captioning. But so far those models …
shot settings such as image retrieval classification or captioning. But so far those models …
SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …
text-image pairs, significantly departing from previous methods relying on real data …
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …
centric approach. While stronger language models can enhance multimodal capabilities, the …
MoDE: CLIP Data Experts via Clustering
The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …
the pairing between images and captions which tends to be noisy in web-crawled data. We …
Low-Resource Vision Challenges for Foundation Models
Low-resource settings are well-established in natural lan-guage processing where many
languages lack sufficient data for deep learning at scale. However low-resource problems …
languages lack sufficient data for deep learning at scale. However low-resource problems …