Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

W Huang, P Abbeel, D Pathak… - … conference on machine …, 2022 - proceedings.mlr.press
Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …

Lavt: Language-aware vision transformer for referring image segmentation

Z Yang, J Wang, Y Tang, K Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Referring image segmentation is a fundamental vision-language task that aims to segment
out an object referred to by a natural language expression from an image. One of the key …

Interactive language: Talking to robots in real time

C Lynch, A Wahid, J Tompson, T Ding… - IEEE Robotics and …, 2023 - ieeexplore.ieee.org
We present a framework for building interactive, real-time, natural language-instructable
robots in the real world, and we open source related assets (dataset, environment …

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arXiv preprint arXiv …, 2021 - arxiv.org
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

Large language models as commonsense knowledge for large-scale task planning

Z Zhao, WS Lee, D Hsu - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Large-scale task planning is a major challenge. Recent work exploits large language
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer
The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

Invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …