Reinforced cross-modal matching and self-supervised imitation learning for vision-language...

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

被引用次数：169 相关文章所有 10 个版本

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：134 相关文章所有 2 个版本

[PDF] mlr.press

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

W Huang, P Abbeel, D Pathak… - … conference on machine …, 2022 - proceedings.mlr.press

Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …

被引用次数：797 相关文章所有 5 个版本

[PDF] thecvf.com

Lavt: Language-aware vision transformer for referring image segmentation

Z Yang, J Wang, Y Tang, K Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Referring image segmentation is a fundamental vision-language task that aims to segment
out an object referred to by a natural language expression from an image. One of the key …

被引用次数：237 相关文章所有 10 个版本

[PDF] arxiv.org

Interactive language: Talking to robots in real time

C Lynch, A Wahid, J Tompson, T Ding… - IEEE Robotics and …, 2023 - ieeexplore.ieee.org

We present a framework for building interactive, real-time, natural language-instructable
robots in the real world, and we open source related assets (dataset, environment …

被引用次数：146 相关文章所有 3 个版本

[PDF] arxiv.org

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arXiv preprint arXiv …, 2021 - arxiv.org

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

被引用次数：383 相关文章所有 3 个版本

[PDF] neurips.cc

Large language models as commonsense knowledge for large-scale task planning

Z Zhao, WS Lee, D Hsu - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Large-scale task planning is a major challenge. Recent work exploits large language
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …

被引用次数：81 相关文章所有 7 个版本

[PDF] neurips.cc

History aware multimodal transformer for vision-and-language navigation

S Chen, PL Guhur, C Schmid… - Advances in neural …, 2021 - proceedings.neurips.cc

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …

被引用次数：187 相关文章所有 8 个版本

[PDF] springer.com

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer

The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …

被引用次数：287 相关文章所有 7 个版本

[PDF] thecvf.com

Invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Abstract Video Question Answering (VideoQA) is the task of answering questions about a
video. At its core is understanding the alignments between visual scenes in video and …

被引用次数：104 相关文章所有 5 个版本