Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners
Visual recognition in low-data regimes requires deep neural networks to learn generalized
representations from limited training samples. Recently, CLIP-based methods have shown …
representations from limited training samples. Recently, CLIP-based methods have shown …
Ulip-2: Towards scalable multimodal pre-training for 3d understanding
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …
representation learning by aligning multimodal features across 3D shapes their 2D …
Dreamllm: Synergistic multimodal comprehension and creation
This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …
Pimae: Point cloud and image interactive masked autoencoders for 3d object detection
Masked Autoencoders learn strong visual representations and achieve state-of-the-art
results in several independent modalities, yet very few works have addressed their …
results in several independent modalities, yet very few works have addressed their …
Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning
Large-scale pre-trained models have shown promising open-world performance for both
vision and language tasks. However, their transferred capacity on 3D point clouds is still …
vision and language tasks. However, their transferred capacity on 3D point clouds is still …
Not all features matter: Enhancing few-shot clip with adaptive prior refinement
Abstract The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its
application to diverse downstream vision tasks. To improve its capacity on downstream …
application to diverse downstream vision tasks. To improve its capacity on downstream …
Calip: Zero-shot enhancement of clip with parameter-free attention
Abstract Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with promising zero-shot performance. To further improve its downstream …
representations with promising zero-shot performance. To further improve its downstream …
Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …
A survey on deep learning based segmentation, detection and classification for 3D point clouds
PK Vinodkumar, D Karabulut, E Avots, C Ozcinar… - Entropy, 2023 - mdpi.com
The computer vision, graphics, and machine learning research groups have given a
significant amount of focus to 3D object recognition (segmentation, detection, and …
significant amount of focus to 3D object recognition (segmentation, detection, and …
Eda: Explicit text-decoupling and dense alignment for 3d visual grounding
Abstract 3D visual grounding aims to find the object within point clouds mentioned by free-
form natural language descriptions with rich semantic cues. However, existing methods …
form natural language descriptions with rich semantic cues. However, existing methods …