A comprehensive survey of hallucination in large language, image, video and audio foundation models

P Sahoo, P Meharia, A Ghosh, S Saha… - Findings of the …, 2024 - aclanthology.org
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

Ppllava: Varied video sequence understanding with prompt guidance

R Liu, H Tang, H Liu, Y Ge, Y Shan, C Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The past year has witnessed the significant advancement of video-based large language
models. However, the challenge of developing a unified model for both short and long video …

Videollamb: Long-context video understanding with recurrent memory bridges

Y Wang, C Xie, Y Liu, Z Zheng - arXiv preprint arXiv:2409.01071, 2024 - arxiv.org
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

E Song, W Chai, T Ye, JN Hwang, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs

R Liao, M Erler, H Wang, G Zhai, G Zhang, Y Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
In the video-language domain, recent works in leveraging zero-shot Large Language Model-
based reasoning for video understanding have become competitive challengers to previous …

VideoLLM-online: Online Video Large Language Model for Streaming Video

J Chen, Z Lv, S Wu, KQ Lin, C Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …

K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences

Z Li, X Liu, D Fu, J Li, Q Gu, K Keutzer… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of visual generative models necessitates efficient and reliable
evaluation methods. Arena platform, which gathers user votes on model comparisons, can …

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies

Y Gao, L Fischer, A Lintner, S Ebling - arXiv preprint arXiv:2410.08860, 2024 - arxiv.org
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …

Matchtime: Towards automatic soccer game commentary generation

J Rao, H Wu, C Liu, Y Wang, W Xie - arXiv preprint arXiv:2406.18530, 2024 - arxiv.org
Soccer is a globally popular sport with a vast audience, in this paper, we consider
constructing an automatic soccer game commentary model to improve the audiences' …

Artificial intelligence for biomedical video generation

L Li, J Qiu, A Saha, L Li, P Li, M He, Z Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent subfield of Artificial Intelligence Generated Content (AIGC), video generation
has achieved notable advancements in recent years. The introduction of Sora-alike models …