Binding touch to everything: Learning unified multimodal tactile representations

Y Liu, W Chen, Y Bai, X Liang, G Li, W Gao… - arXiv preprint arXiv …, 2024 - arxiv.org

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

被引用次数：17 相关文章所有 3 个版本

[PDF] arxiv.org

Foundation models in robotics: Applications, challenges, and the future

R Firoozi, J Tucker, S Tian… - … Journal of Robotics …, 2023 - journals.sagepub.com

We survey applications of pretrained foundation models in robotics. Traditional deep
learning models in robotics are trained on small datasets tailored for specific tasks, which …

被引用次数：108 相关文章所有 2 个版本

[PDF] thecvf.com

Tactile-augmented radiance fields

Y Dou, F Yang, Y Liu, A Loquercio… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present a scene representation that brings vision and touch into a shared 3D space
which we call a tactile-augmented radiance field. This representation capitalizes on two key …

被引用次数：16 相关文章所有 3 个版本

[PDF] thecvf.com

Iterated learning improves compositionality in large vision-language models

C Zheng, J Zhang, A Kembhavi… - Proceedings of the …, 2024 - openaccess.thecvf.com

A fundamental characteristic common to both human vision and natural language is their
compositional nature. Yet despite the performance gains contributed by large vision and …

被引用次数：12 相关文章所有 3 个版本

[PDF] ecva.net

Augundo: Scaling up augmentations for monocular depth completion and estimation

Y Wu, TY Liu, H Park, S Soatto, D Lao… - European Conference on …, 2025 - Springer

Unsupervised depth completion and estimation methods are trained by minimizing
reconstruction error. Block artifacts from resampling, intensity saturation, and occlusions are …

被引用次数：4 相关文章所有 4 个版本

[PDF] thecvf.com

Wordepth: Variational language prior for monocular depth estimation

Z Zeng, D Wang, F Yang, H Park… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Three-dimensional (3D) reconstruction from a single image is an ill-posed problem
with inherent ambiguities ie scale. Predicting a 3D scene from text description (s) is similarly …

被引用次数：20 相关文章所有 3 个版本

[PDF] thecvf.com

Test-Time Adaptation for Depth Completion

H Park, A Gupta, A Wong - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

It is common to observe performance degradation when transferring models trained on
some (source) datasets to target testing data due to a domain gap between them. Existing …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Neurobind: Towards unified multimodal representations for neural signals

F Yang, C Feng, D Wang, T Wang, Z Zeng, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Understanding neural activity and information representation is crucial for advancing
knowledge of brain function and cognition. Neural activity, measured through techniques …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

A touch, vision, and language dataset for multimodal alignment

L Fu, G Datta, H Huang, WCH Panitch, J Drake… - arXiv preprint arXiv …, 2024 - arxiv.org

Touch is an important sensing modality for humans, but it has not yet been incorporated into
a multimodal generative language model. This is partially due to the difficulty of obtaining …

被引用次数：14 相关文章所有 4 个版本

[PDF] acm.org

Gradient-less federated gradient boosting tree with learnable learning rates

C Ma, X Qiu, D Beutel, N Lane - Proceedings of the 3rd Workshop on …, 2023 - dl.acm.org

The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme
Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context …

被引用次数：13 相关文章所有 4 个版本