Scaling autoregressive multi-modal models: Pretraining and instruction tuning

D Zhang, Y Yu, C Li, J Dong, D Su, C Chu… - arXiv preprint arXiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

被引用次数：87 相关文章所有 2 个版本

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

被引用次数：110 相关文章所有 3 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：142 相关文章所有 6 个版本

[PDF] thecvf.com

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

被引用次数：55 相关文章所有 3 个版本

[PDF] thecvf.com

4d-fy: Text-to-4d generation using hybrid score distillation sampling

S Bahmani, I Skorokhodov, V Rong… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-
video models to generate dynamic 3D scenes. However current text-to-4D methods face a …

被引用次数：45 相关文章所有 7 个版本

[PDF] arxiv.org

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

被引用次数：137 相关文章所有 3 个版本

[PDF] arxiv.org

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

被引用次数：72 相关文章所有 4 个版本

[PDF] thecvf.com

Capsfusion: Rethinking image-text data at scale

Q Yu, Q Sun, X Zhang, Y Cui, F Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large multimodal models demonstrate remarkable generalist ability to perform diverse
multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute …

被引用次数：20 相关文章所有 3 个版本

[PDF] arxiv.org

Emu: Enhancing image generation models using photogenic needles in a haystack

X Dai, J Hou, CY Ma, S Tsai, J Wang, R Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Training text-to-image models with web scale image-text pairs enables the generation of a
wide range of visual concepts from text. However, these pre-trained models often face …

被引用次数：94 相关文章所有 2 个版本

[PDF] thecvf.com

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Y Huang, L Xie, X Wang, Z Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com

Current instruction-based image editing methods such as InstructPix2Pix often fail to
produce satisfactory results in complex scenarios due to their dependence on the simple …

被引用次数：24 相关文章所有 3 个版本