Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

J Chen, D Zhu, X Shen, X Li, Z Liu, P Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models have shown their remarkable capabilities as a general interface for
various language-related applications. Motivated by this, we target to build a unified …

Eyes wide shut? exploring the visual shortcomings of multimodal llms

S Tong, Z Liu, Y Zhai, Y Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
Is vision good enough for language? Recent advancements in multimodal models primarily
stem from the powerful reasoning abilities of large language models (LLMs). However the …

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

The Neglected Tails in Vision-Language Models

S Parashar, Z Lin, T Liu, X Dong, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) excel in zero-shot recognition but their performance varies
greatly across different visual concepts. For example although CLIP achieves impressive …

Bioclip: A vision foundation model for the tree of life

S Stevens, J Wu, MJ Thompson… - Proceedings of the …, 2024 - openaccess.thecvf.com
Images of the natural world collected by a variety of cameras from drones to individual
phones are increasingly abundant sources of biological information. There is an explosion …

Grounding everything: Emerging localization properties in vision-language transformers

W Bousselham, F Petersen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language foundation models have shown remarkable performance in various zero-
shot settings such as image retrieval classification or captioning. But so far those models …

SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

HAAK Hammoud, H Itani, F Pizzati, P Torr… - arXiv preprint arXiv …, 2024 - arxiv.org
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic
text-image pairs, significantly departing from previous methods relying on real data …

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

MoDE: CLIP Data Experts via Clustering

J Ma, PY Huang, S Xie, SW Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
The success of contrastive language-image pretraining (CLIP) relies on the supervision from
the pairing between images and captions which tends to be noisy in web-crawled data. We …

Low-Resource Vision Challenges for Foundation Models

Y Zhang, H Doughty… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Low-resource settings are well-established in natural lan-guage processing where many
languages lack sufficient data for deep learning at scale. However low-resource problems …