Vita: Towards open-source interactive omni multimodal llm

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arXiv preprint arXiv …, 2024 - arxiv.org

Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

被引用次数：25 相关文章所有 3 个版本

[PDF] arxiv.org

Longvila: Scaling long-context visual language models for long videos

F Xue, Y Chen, D Li, Q Hu, L Zhu, X Li, Y Fang… - arXiv preprint arXiv …, 2024 - arxiv.org

Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …

被引用次数：25 相关文章所有 4 个版本

[PDF] arxiv.org

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arXiv preprint arXiv:2409.12961, 2024 - arxiv.org

Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

Recent advances in speech language models: A survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

WavChat: A Survey of Spoken Dialogue Models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

被引用次数：4 相关文章所有 2 个版本

[PDF] researchgate.net

[PDF][PDF] Baichuan-omni technical report

Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arXiv preprint arXiv …, 2024 - researchgate.net

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

X Wang, Y Li, C Fu, L Xie, K Li, X Sun, L Ma - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid development of large language models has brought many new smart applications,
especially the excellent multimodal human-computer interaction in GPT-4o has brought …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

S Chen, X Lan, Y Yuan, Z Jie, L Ma - arXiv preprint arXiv:2411.18211, 2024 - arxiv.org

Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …

被引用次数：3 相关文章所有 2 个版本