Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Large language models (LLMs) have dramatically enhanced the field of language
intelligence, as demonstrably evidenced by their formidable empirical performance across a …
intelligence, as demonstrably evidenced by their formidable empirical performance across a …
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face
Solving complicated AI tasks with different domains and modalities is a key step toward
artificial general intelligence. While there are numerous AI models available for various …
artificial general intelligence. While there are numerous AI models available for various …
Video-llama: An instruction-tuned audio-visual language model for video understanding
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …
(LLMs) with the capability of understanding both visual and auditory content in the video …
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …
instruction abilities across various open-ended tasks. However previous methods have …
Kosmos-2: Grounding multimodal large language models to the world
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
Next-gpt: Any-to-any multimodal llm
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …
Multimodal c4: An open, billion-scale corpus of images interleaved with text
In-context vision and language models like Flamingo support arbitrarily interleaved
sequences of images and text as input. This format not only enables few-shot learning via …
sequences of images and text as input. This format not only enables few-shot learning via …
Mvbench: A comprehensive multi-modal video understanding benchmark
With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …
Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation
Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …
Longnet: Scaling transformers to 1,000,000,000 tokens
Scaling sequence length has become a critical demand in the era of large language models.
However, existing methods struggle with either computational complexity or model …
However, existing methods struggle with either computational complexity or model …