Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

R Zhang, X Hu, B Li, S Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

Pimae: Point cloud and image interactive masked autoencoders for 3d object detection

A Chen, K Zhang, R Zhang, Z Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked Autoencoders learn strong visual representations and achieve state-of-the-art
results in several independent modalities, yet very few works have addressed their …

Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning

X Zhu, R Zhang, B He, Z Guo, Z Zeng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained models have shown promising open-world performance for both
vision and language tasks. However, their transferred capacity on 3D point clouds is still …

Not all features matter: Enhancing few-shot clip with adaptive prior refinement

X Zhu, R Zhang, B He, A Zhou… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …

Calip: Zero-shot enhancement of clip with parameter-free attention

Z Guo, R Zhang, L Qiu, X Ma, X Miao, X He… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

A survey on deep learning based segmentation, detection and classification for 3D point clouds

PK Vinodkumar, D Karabulut, E Avots, C Ozcinar… - Entropy, 2023 - mdpi.com
The computer vision, graphics, and machine learning research groups have given a
significant amount of focus to 3D object recognition (segmentation, detection, and …

Eda: Explicit text-decoupling and dense alignment for 3d visual grounding

Y Wu, X Cheng, R Zhang, Z Cheng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …