Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Gemini pro defeated by gpt-4v: Evidence from education

GG Lee, E Latif, L Shi, X Zhai - arXiv preprint arXiv:2401.08660, 2023 - arxiv.org
This study compared the classification performance of Gemini Pro and GPT-4V in
educational settings. Employing visual question answering (VQA) techniques, the study …

GPT4Vis: what can GPT-4 do for zero-shot visual recognition?

W Wu, H Yao, M Zhang, Y Song, W Ouyang… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper does not present a novel method. Instead, it delves into an essential, yet must-
know baseline in light of the latest advancements in Generative Artificial Intelligence …

Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs

D Zhang, J Yang, H Lyu, Z Jin, Y Yao, M Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
When exploring the development of Artificial General Intelligence (AGI), a critical task for
these models involves interpreting and processing information from multiple image inputs …

Adashield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting

Y Wang, X Liu, Y Li, M Chen, C Xiao - arXiv preprint arXiv:2403.09513, 2024 - arxiv.org
With the advent and widespread deployment of Multimodal Large Language Models
(MLLMs), the imperative to ensure their safety has become increasingly pronounced …

Electionsim: Massive population election simulation powered by large language model driven agents

X Zhang, J Lin, L Sun, W Qi, Y Yang, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The massive population election simulation aims to model the preferences of specific groups
in particular election scenarios. It has garnered significant attention for its potential to …

[PDF][PDF] LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

Y Li, X Chen, B Hu, M Zhang - arXiv preprint arXiv:2402.13546, 2024 - researchgate.net
Long video understanding is a significant and ongoing challenge in the intersection of
multimedia and artificial intelligence. Employing large language models (LLMs) for …

GPT4Ego: unleashing the potential of pre-trained models for zero-shot egocentric action recognition

G Dai, X Shu, W Wu, R Yan, J Zhang - arXiv preprint arXiv:2401.10039, 2024 - arxiv.org
Vision-Language Models (VLMs), pre-trained on large-scale datasets, have shown
impressive performance in various visual recognition tasks. This advancement paves the …

Representation bias in political sample simulations with large language models

W Qi, H Lyu, J Luo - arXiv preprint arXiv:2407.11409, 2024 - arxiv.org
This study seeks to identify and quantify biases in simulating political samples with Large
Language Models, specifically focusing on vote choice and public opinion. Using the GPT …

AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue

Y Tang, D Shimada, J Bi, C Xu - arXiv preprint arXiv:2403.16276, 2024 - arxiv.org
In everyday communication, humans frequently use speech and gestures to refer to specific
areas or objects, a process known as Referential Dialogue (RD). While prior studies have …