Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Imagebind: One embedding space to bind them all
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …
Simple open-vocabulary object detection
Combining simple architectures with large-scale pre-training has led to massive
improvements in image classification. For object detection, pre-training and scaling …
improvements in image classification. For object detection, pre-training and scaling …
Conditional prompt learning for vision-language models
With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential
to investigate ways to adapt these models to downstream datasets. A recently proposed …
to investigate ways to adapt these models to downstream datasets. A recently proposed …
Decomposing nerf for editing via feature field distillation
S Kobayashi, E Matsumoto… - Advances in Neural …, 2022 - proceedings.neurips.cc
Emerging neural radiance fields (NeRF) are a promising scene representation for computer
graphics, enabling high-quality 3D reconstruction and novel view synthesis from image …
graphics, enabling high-quality 3D reconstruction and novel view synthesis from image …
Expanding language-image pretrained models for general video recognition
Contrastive language-image pretraining has shown great success in learning visual-textual
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
Lit: Zero-shot transfer with locked-image text tuning
This paper presents contrastive-tuning, a simple method employing contrastive training to
align image and text models while still taking advantage of their pre-training. In our empirical …
align image and text models while still taking advantage of their pre-training. In our empirical …
Learning video representations from large language models
We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …