Unibench: Visual reasoning requires rethinking vision-language beyond scaling

H Al-Tahan, Q Garrido, R Balestriero… - arXiv preprint arXiv …, 2024 - arxiv.org
Significant research efforts have been made to scale and improve vision-language model
(VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers …

Ai safety in generative ai large language models: A survey

J Chua, Y Li, S Yang, C Wang, L Yao - arXiv preprint arXiv:2407.18369, 2024 - arxiv.org
Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are
facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) …

Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces

Z Chen, H Chen, M Imani, R Chen, F Imani - Expert Systems with …, 2024 - Elsevier
Workplace accidents due to personal protective equipment (PPE) non-compliance raise
serious safety concerns and lead to legal liabilities, financial penalties, and reputational …

Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

A Wüst, T Tobiasch, L Helff, DS Dhami… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's GPT-4o,
have emerged, seemingly demonstrating advanced reasoning capabilities across text and …

Evaluation and comparison of visual language models for transportation engineering problems

S Prajapati, T Singh, C Hegde… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent developments in vision language models (VLM) have shown great potential for
diverse applications related to image understanding. In this study, we have explored state-of …

Omnixr: Evaluating omni-modality language models on reasoning across modalities

L Chen, H Hu, M Zhang, Y Chen, Z Wang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality
Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple …

TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

R Adhikari, S Thapaliya, M Dhakal… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) have shown impressive performance in vision
tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt …

Egocentric perception of walking environments using an interactive vision-language system

H Tan, A Mihailidis, B Laschowski - bioRxiv, 2024 - biorxiv.org
Large language models can provide a more detailed contextual understanding of a scene
beyond what computer vision alone can provide, which have implications for robotics and …

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

E Vendrow, O Pantazis, A Shepard, G Brostow… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge
multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist …

Enabling Data-Driven and Empathetic Interactions: A Context-Aware 3D Virtual Agent in Mixed Reality for Enhanced Financial Customer Experience

C Xu, M Chen, P Deshpande, E Azanli… - … on Mixed and …, 2024 - ieeexplore.ieee.org
In this paper, we introduce a novel system designed to enhance customer service in the
financial and retail sectors through a context-aware 3D virtual agent, utilizing Mixed Reality …