Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Scalable 3d captioning with pretrained models
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …
This approach utilizes pretrained models from image captioning, image-text alignment, and …
Ulip-2: Towards scalable multimodal pre-training for 3d understanding
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …
representation learning by aligning multimodal features across 3D shapes their 2D …
Honeybee: Locality-enhanced projector for multimodal llm
Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …
role in bridging pre-trained vision encoders with LLMs enabling profound visual …
Onellm: One framework to align all modalities with language
Multimodal large language models (MLLMs) have gained significant attention due to their
strong multimodal understanding capability. However existing works rely heavily on modality …
strong multimodal understanding capability. However existing works rely heavily on modality …
Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …
possibilities for various applications in the field of human-machine interactions. However …
Gpt4point: A unified framework for point-language understanding and generation
Abstract Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
comprehension and image generation but their understanding of the 3D world is notably …
comprehension and image generation but their understanding of the 3D world is notably …
Lidar-llm: Exploring the potential of large language models for 3d lidar understanding
Recently, Large Language Models (LLMs) and Multimodal Large Language Models
(MLLMs) have shown promise in instruction following and 2D image understanding. While …
(MLLMs) have shown promise in instruction following and 2D image understanding. While …
An embodied generalist agent in 3d world
Leveraging massive knowledge and learning schemes from large language models (LLMs),
recent machine learning models show notable successes in building generalist agents that …
recent machine learning models show notable successes in building generalist agents that …