Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners R Zhang, X Hu, B Li, S Huang, H Deng, Y Qiao, P Gao, H Li Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2023 | 104 | 2023 |
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models P Xu, W Shao, K Zhang, P Gao, S Liu, M Lei, F Meng, S Huang, Y Qiao, ... arXiv preprint arXiv:2306.09265, 2023 | 98 | 2023 |
Multi-modal sensor fusion for auto driving perception: A survey K Huang, B Shi, X Li, X Li, S Huang, Y Li arXiv preprint arXiv:2202.02703, 2022 | 93 | 2022 |
Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models Z Lin, C Liu, R Zhang, P Gao, L Qiu, H Xiao, H Qiu, C Lin, W Shao, ... arXiv preprint arXiv:2311.07575, 2023 | 90 | 2023 |
Instruct2act: Mapping multi-modality instructions to robotic actions with large language model S Huang, Z Jiang, H Dong, Y Qiao, P Gao, H Li arXiv preprint arXiv:2305.11176, 2023 | 66 | 2023 |
Sphinx-x: Scaling data and parameters for a family of multi-modal large language models P Gao, R Zhang, C Liu, L Qiu, S Huang, W Lin, S Zhao, S Geng, Z Lin, ... arXiv preprint arXiv:2402.05935, 2024 | 30 | 2024 |
Tiny lvlm-ehub: Early multimodal experiments with bard W Shao, Y Hu, P Gao, M Lei, K Zhang, F Meng, P Xu, S Huang, H Li, ... arXiv preprint arXiv:2308.03729, 2023 | 21 | 2023 |
Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill W Cai, S Huang, G Cheng, Y Long, P Gao, C Sun, H Dong arXiv preprint arXiv:2309.10309, 2023 | 8 | 2023 |
SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification S Huang, B Zhang, B Shi, H Li, Y Li, P Gao Proceedings of the 31st ACM International Conference on Multimedia, 8644-8652, 2023 | 5 | 2023 |
Adas: A simple active-and-adaptive baseline for cross-domain 3d semantic segmentation B Fei, S Huang, J Yuan, B Shi, B Zhang, T Chen, M Dou, Y Qiao arXiv preprint arXiv: 2212.10390, 2022 | 5 | 2022 |
Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models S Huang, I Ponomarenko, Z Jiang, X Li, X Hu, P Gao, H Li, H Dong arXiv preprint arXiv:2403.11289, 2024 | 2 | 2024 |
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models X Lu, Q Liu, Y Xu, A Zhou, S Huang, B Zhang, J Yan, H Li arXiv preprint arXiv:2402.14800, 2024 | 2 | 2024 |
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want W Lin, X Wei, R An, P Gao, B Zou, Y Luo, S Huang, S Zhang, H Li arXiv preprint arXiv:2403.20271, 2024 | 1 | 2024 |
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Q Lu, W Shao, Z Liu, F Meng, B Li, B Chen, S Huang, K Zhang, Y Qiao, ... arXiv preprint arXiv:2406.08451, 2024 | | 2024 |
A3VLM: Actionable Articulation-Aware Vision Language Model S Huang, H Chang, Y Liu, Y Zhu, H Dong, P Gao, A Boularias, H Li arXiv preprint arXiv:2406.07549, 2024 | | 2024 |