Mme-survey: A comprehensive survey on evaluation of multimodal llms
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …
Models (MLLMs) have garnered increased attention from both industry and academia …
Kangaroo: A powerful video-language model supporting long-context video input
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Longvila: Scaling long-context visual language models for long videos
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …
Recent advances in speech language models: A survey
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …
their capabilities in text-based interactions. However, natural human interaction often relies …
Emova: Empowering language models to see, hear and speak with vivid emotions
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
WavChat: A Survey of Spoken Dialogue Models
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …
have captured significant attention in the speech domain. Compared to traditional three-tier …
[PDF][PDF] Baichuan-omni technical report
Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arXiv preprint arXiv …, 2024 - researchgate.net
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …
Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm
The rapid development of large language models has brought many new smart applications,
especially the excellent multimodal human-computer interaction in GPT-4o has brought …
especially the excellent multimodal human-computer interaction in GPT-4o has brought …
Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability
Rapid development of large language models (LLMs) has significantly advanced multimodal
large language models (LMMs), particularly in vision-language tasks. However, existing …
large language models (LMMs), particularly in vision-language tasks. However, existing …