Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Unified-io: A unified model for vision, language, and multi-modal tasks

J Lu, C Clark, R Zellers, R Mottaghi… - The Eleventh …, 2022 - openreview.net
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …

A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer
Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

Reasoning with language model prompting: A survey

S Qiao, Y Ou, N Zhang, X Chen, Y Yao, S Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
Reasoning, as an essential ability for complex problem-solving, can provide back-end
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …

Large language models are visual reasoning coordinators

L Chen, B Li, S Shen, J Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc
Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …

Clip-event: Connecting text and images with event structures

M Li, R Xu, S Wang, L Zhou, X Lin… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision-language (V+ L) pretraining models have achieved great success in
supporting multimedia applications by understanding the alignments between images and …

Large language models and knowledge graphs: Opportunities and challenges

JZ Pan, S Razniewski, JC Kalo, S Singhania… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have taken Knowledge Representation--and the world--by
storm. This inflection point marks a shift from explicit knowledge representation to a renewed …

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …