Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arXiv preprint arXiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Recent advances in speech language models: A survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

WavChat: A Survey of Spoken Dialogue Models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

[PDF][PDF] Baichuan-omni technical report

Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arXiv preprint arXiv …, 2024 - researchgate.net
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

X Wang, Y Li, C Fu, L Xie, K Li, X Sun, L Ma - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of large language models has brought many new smart applications,
especially the excellent multimodal human-computer interaction in GPT-4o has brought …

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arXiv preprint arXiv:2411.18211, 2024 - arxiv.org
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …