查看文章

thecvf.com 中的 [PDF]

Mutan: Multimodal tucker fusion for visual question answering

作者

Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, Nicolas Thome

发表日期

2017/10/1

期刊

ICCV 2017 Proc. IEEE International Conference Computer Vision

简介

Bilinear models provide an appealing framework for mixing and merging information in Visual Question Answering (VQA) tasks. They help to learn high level associations between question meaning and visual concepts in the image, but they suffer from huge dimensionality issues. We introduce MUTAN, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations. Additionally to the Tucker framework, we design a low-rank matrix-based decomposition to explicitly constrain the interaction rank. With MUTAN, we control the complexity of the merging scheme while keeping nice interpretable fusion relations. We show how the Tucker decomposition framework generalizes some of the latest VQA architectures, providing state-of-the-art results.

引用总数

被引用次数：730

201720182019202020212022202320247 66 99 115 137 125 113 67

学术搜索中的文章

Mutan: Multimodal tucker fusion for visual question answering

H Ben-Younes, R Cadene, M Cord, N Thome - Proceedings of the IEEE international conference on …, 2017

被引用次数：730 相关文章所有 16 个版本