A comprehensive survey of hallucination in large language, image, video and audio foundation models
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …
video domains has shown remarkable capabilities in diverse tasks. However, the …
Ppllava: Varied video sequence understanding with prompt guidance
The past year has witnessed the significant advancement of video-based large language
models. However, the challenge of developing a unified model for both short and long video …
models. However, the challenge of developing a unified model for both short and long video …
Videollamb: Long-context video understanding with recurrent memory bridges
Recent advancements in large-scale video-language models have shown significant
potential for real-time planning and detailed interactions. However, their high computational …
potential for real-time planning and detailed interactions. However, their high computational …
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Recently, integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs
In the video-language domain, recent works in leveraging zero-shot Large Language Model-
based reasoning for video understanding have become competitive challengers to previous …
based reasoning for video understanding have become competitive challengers to previous …
VideoLLM-online: Online Video Large Language Model for Streaming Video
Abstract Large Language Models (LLMs) have been enhanced with vision capabilities
enabling them to comprehend images videos and interleaved vision-language content …
enabling them to comprehend images videos and interleaved vision-language content …
K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences
The rapid advancement of visual generative models necessitates efficient and reliable
evaluation methods. Arena platform, which gathers user votes on model comparisons, can …
evaluation methods. Arena platform, which gathers user votes on model comparisons, can …
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …
persons and persons with visual impairments in accessing digital media content on …
Matchtime: Towards automatic soccer game commentary generation
Soccer is a globally popular sport with a vast audience, in this paper, we consider
constructing an automatic soccer game commentary model to improve the audiences' …
constructing an automatic soccer game commentary model to improve the audiences' …
Artificial intelligence for biomedical video generation
As a prominent subfield of Artificial Intelligence Generated Content (AIGC), video generation
has achieved notable advancements in recent years. The introduction of Sora-alike models …
has achieved notable advancements in recent years. The introduction of Sora-alike models …