Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

X Wang, J Wu, J Chen, L Li… - Proceedings of the …, 2019 - openaccess.thecvf.com
We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …

Findings of the second shared task on multimodal machine translation and multilingual image description

D Elliott, S Frank, L Barrault, F Bougares… - arXiv preprint arXiv …, 2017 - arxiv.org
We present the results from the second shared task on multimodal machine translation and
multilingual image description. Nine teams submitted 19 systems to two tasks. The …

A novel graph-based multi-modal fusion encoder for neural machine translation

Y Yin, F Meng, J Su, C Zhou, Z Yang, J Zhou… - arXiv preprint arXiv …, 2020 - arxiv.org
Multi-modal neural machine translation (NMT) aims to translate source sentences into a
target language paired with images. However, dominant multi-modal NMT models do not …

Trends in integration of vision and language research: A survey of tasks, datasets, and methods

A Mogadala, M Kalimuthu, D Klakow - Journal of Artificial Intelligence …, 2021 - jair.org
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …

Probing the need for visual context in multimodal machine translation

O Caglayan, P Madhyastha, L Specia… - arXiv preprint arXiv …, 2019 - arxiv.org
Current work on multimodal machine translation (MMT) has suggested that the visual
modality is either unnecessary or only marginally beneficial. We posit that this is a …

Neural machine translation with universal visual representation

Z Zhang, K Chen, R Wang, M Utiyama… - International …, 2020 - openreview.net
Though visual information has been introduced for enhancing neural machine translation
(NMT), its effectiveness strongly relies on the availability of large amounts of bilingual …

Dynamic context-guided capsule network for multimodal machine translation

H Lin, F Meng, J Su, Y Yin, Z Yang, Y Ge… - Proceedings of the 28th …, 2020 - dl.acm.org
Multimodal machine translation (MMT), which mainly focuses on enhancing text-only
translation with visual features, has attracted considerable attention from both computer …

Neural machine translation with phrase-level universal visual representations

Q Fang, Y Feng - arXiv preprint arXiv:2203.10299, 2022 - arxiv.org
Multimodal machine translation (MMT) aims to improve neural machine translation (NMT)
with additional visual information, but most existing MMT methods require paired input of …

Distilling translations with visual awareness

J Ive, P Madhyastha, L Specia - arXiv preprint arXiv:1906.07701, 2019 - arxiv.org
Previous work on multimodal machine translation has shown that visual information is only
needed in very specific cases, for example in the presence of ambiguous words where the …

Multimodal machine translation through visuals and speech

U Sulubacak, O Caglayan, SA Grönroos, A Rouhe… - Machine …, 2020 - Springer
Multimodal machine translation involves drawing information from more than one modality,
based on the assumption that the additional modalities will contain useful alternative views …