Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Fast convergence of detr with spatially modulated co-attention

P Gao, M Zheng, X Wang, J Dai… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Abstract The recently proposed Detection Transformer (DETR) model successfully applies
Transformer to objects detection and achieves comparable performance with two-stage …

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org
Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

End-to-end object detection with adaptive clustering transformer

M Zheng, P Gao, R Zhang, K Li, X Wang, H Li… - arXiv preprint arXiv …, 2020 - arxiv.org
End-to-end Object Detection with Transformer (DETR) proposes to perform object detection
with Transformer and achieve comparable performance with two-stage object detection like …

Normalized and geometry-aware self-attention network for image captioning

L Guo, J Liu, X Zhu, P Yao, S Lu… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Self-attention (SA) network has shown profound value in image captioning. In this paper, we
improve SA from two aspects to promote the performance of image captioning. First, we …

Greedy gradient ensemble for robust visual question answering

X Han, S Wang, C Su, Q Huang… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract Language bias is a critical issue in Visual Question Answering (VQA), where
models often exploit dataset biases for the final decision without considering the image …

Container: Context aggregation network

P Gao, J Lu, H Li, R Mottaghi, A Kembhavi - arXiv preprint arXiv …, 2021 - arxiv.org
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of
effective and efficient variations. Recently, Transformers--originally introduced in natural …

Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm

MM Islam, T Iqbal - 2020 IEEE/RSJ International Conference on …, 2020 - ieeexplore.ieee.org
To fluently collaborate with people, robots need the ability to recognize human activities
accurately. Although modern robots are equipped with various sensors, robust human …

Unshuffling data for improved generalization in visual question answering

D Teney, E Abbasnejad… - Proceedings of the …, 2021 - openaccess.thecvf.com
Generalization beyond the training distribution is a core challenge in machine learning. The
common practice of mixing and shuffling examples when training neural networks may not …

Re-attention for visual question answering

W Guo, Y Zhang, J Yang, X Yuan - IEEE Transactions on Image …, 2021 - ieeexplore.ieee.org
A simultaneous understanding of questions and images is crucial in Visual Question
Answering (VQA). While the existing models have achieved satisfactory performance by …