A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

J Gui, T Chen, J Zhang, Q Cao, Z Sun… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Deep supervised learning algorithms typically require a large volume of labeled data to
achieve satisfactory performance. However, the process of collecting and labeling such data …

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Openshape: Scaling up 3d shape representation towards open-world understanding

M Liu, R Shi, K Kuang, Y Zhu, X Li… - Advances in neural …, 2024 - proceedings.neurips.cc
We introduce OpenShape, a method for learning multi-modal joint representations of text,
image, and point clouds. We adopt the commonly used multi-modal contrastive learning …

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

Pointgpt: Auto-regressively generative pre-training from point clouds

G Chen, M Wang, Y Yang, K Yu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models (LLMs) based on the generative pre-training transformer (GPT)
have demonstrated remarkable effectiveness across a diverse range of downstream tasks …

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

[HTML][HTML] DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding

X Ning, Z Yu, L Li, W Li, P Tiwari - Information Fusion, 2024 - Elsevier
Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not
present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has …

Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip

J Zhang, R Dong, K Ma - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Training a 3D scene understanding model requires complicated human annotations, which
are laborious to collect and result in a model only encoding close-set object semantics. In …

Pointmamba: A simple state space model for point cloud analysis

D Liang, X Zhou, X Wang, X Zhu, W Xu, Z Zou… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have become one of the foundational architectures in point cloud analysis
tasks due to their excellent global modeling ability. However, the attention mechanism has …

Uni3d: Exploring unified 3d representation at scale

J Zhou, J Wang, B Ma, YS Liu, T Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling up representations for images or text has been extensively investigated in the past
few years and has led to revolutions in learning vision and language. However, scalable …