Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Unified-io: A unified model for vision, language, and multi-modal tasks
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical
computer vision tasks, including pose estimation, object detection, depth estimation and …
computer vision tasks, including pose estimation, object detection, depth estimation and …
A-okvqa: A benchmark for visual question answering using world knowledge
Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …
for the development of AI models that can jointly reason over visual and natural language …
Reasoning with language model prompting: A survey
Reasoning, as an essential ability for complex problem-solving, can provide back-end
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …
support for various real-world applications, such as medical diagnosis, negotiation, etc. This …
Large language models are visual reasoning coordinators
Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
Clip-event: Connecting text and images with event structures
Abstract Vision-language (V+ L) pretraining models have achieved great success in
supporting multimedia applications by understanding the alignments between images and …
supporting multimedia applications by understanding the alignments between images and …
Large language models and knowledge graphs: Opportunities and challenges
Large Language Models (LLMs) have taken Knowledge Representation--and the world--by
storm. This inflection point marks a shift from explicit knowledge representation to a renewed …
storm. This inflection point marks a shift from explicit knowledge representation to a renewed …
Document understanding dataset and evaluation (dude)
J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …
embrace the challenge of creating more practically-oriented benchmarks. Document …
Trends in integration of vision and language research: A survey of tasks, datasets, and methods
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …
growth in the last few years. This success can be partly attributed to the advancements made …