Omnivore: A single model for many visual modalities

A Ulhaq, N Akhtar, G Pogrebna, A Mian - arXiv preprint arXiv:2209.05700, 2022 - arxiv.org

Vision transformers are emerging as a powerful tool to solve computer vision problems.
Recent techniques have also proven the efficacy of transformers beyond the image domain …

被引用次数：39 相关文章所有 4 个版本

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

被引用次数：455 相关文章所有 7 个版本

[PDF] ieee.org

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：361 相关文章所有 9 个版本

[PDF] gla.ac.uk

Beyond supervised learning for pervasive healthcare

X Gu, F Deligianni, J Han, X Liu, W Chen… - IEEE Reviews in …, 2023 - ieeexplore.ieee.org

The integration of machine/deep learning and sensing technologies is transforming
healthcare and medical practice. However, inherent limitations in healthcare data, namely …

被引用次数：9 相关文章所有 4 个版本

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

被引用次数：104 相关文章所有 7 个版本

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：155 相关文章所有 7 个版本

[PDF] neurips.cc

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

VW Liang, Y Zhang, Y Kwon… - Advances in Neural …, 2022 - proceedings.neurips.cc

We present modality gap, an intriguing geometric phenomenon of the representation space
of multi-modal models. Specifically, we show that different data modalities (eg images and …

被引用次数：219 相关文章所有 7 个版本

[PDF] arxiv.org

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer

Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

被引用次数：148 相关文章所有 5 个版本

[PDF] arxiv.org

Multimae: Multi-modal multi-task masked autoencoders

R Bachmann, D Mizrahi, A Atanov, A Zamir - European Conference on …, 2022 - Springer

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders
(MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can …

被引用次数：194 相关文章所有 6 个版本

[PDF] thecvf.com

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

被引用次数：173 相关文章所有 4 个版本