Subhojit Som, Xia Song, and Furu Wei

Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents

Z Zhang, Y Yao, A Zhang, X Tang, X Ma, Z He… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have dramatically enhanced the field of language
intelligence, as demonstrably evidenced by their formidable empirical performance across a …

被引用次数：26 相关文章所有 2 个版本

[PDF] neurips.cc

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

Y Shen, K Song, X Tan, D Li, W Lu… - Advances in Neural …, 2024 - proceedings.neurips.cc

Solving complicated AI tasks with different domains and modalities is a key step toward
artificial general intelligence. While there are numerous AI models available for various …

被引用次数：788 相关文章所有 8 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：461 相关文章所有 3 个版本

[PDF] thecvf.com

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

被引用次数：175 相关文章所有 4 个版本

[PDF] arxiv.org

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

被引用次数：413 相关文章所有 2 个版本

[PDF] arxiv.org

Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

被引用次数：296 相关文章所有 4 个版本

[PDF] neurips.cc

Multimodal c4: An open, billion-scale corpus of images interleaved with text

W Zhu, J Hessel, A Awadalla… - Advances in …, 2024 - proceedings.neurips.cc

In-context vision and language models like Flamingo support arbitrarily interleaved
sequences of images and text as input. This format not only enables few-shot learning via …

被引用次数：120 相关文章所有 7 个版本

[PDF] thecvf.com

Mvbench: A comprehensive multi-modal video understanding benchmark

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

With the rapid development of Multi-modal Large Language Models (MLLMs) a number of
diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities …

被引用次数：91 相关文章所有 4 个版本

[PDF] thecvf.com

Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation

T Wu, G Yang, Z Li, K Zhang, Z Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Despite recent advances in text-to-3D generative methods there is a notable absence of
reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as …

被引用次数：35 相关文章所有 3 个版本

[PDF] arxiv.org

Longnet: Scaling transformers to 1,000,000,000 tokens

J Ding, S Ma, L Dong, X Zhang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling sequence length has become a critical demand in the era of large language models.
However, existing methods struggle with either computational complexity or model …

被引用次数：108 相关文章所有 3 个版本