Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Fast convergence of detr with spatially modulated co-attention
Abstract The recently proposed Detection Transformer (DETR) model successfully applies
Transformer to objects detection and achieves comparable performance with two-stage …
Transformer to objects detection and achieves comparable performance with two-stage …
Lxmert: Learning cross-modality encoder representations from transformers
Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …
semantics, and, most importantly, the alignment and relationships between these two …
End-to-end object detection with adaptive clustering transformer
End-to-end Object Detection with Transformer (DETR) proposes to perform object detection
with Transformer and achieve comparable performance with two-stage object detection like …
with Transformer and achieve comparable performance with two-stage object detection like …
Normalized and geometry-aware self-attention network for image captioning
Self-attention (SA) network has shown profound value in image captioning. In this paper, we
improve SA from two aspects to promote the performance of image captioning. First, we …
improve SA from two aspects to promote the performance of image captioning. First, we …
Greedy gradient ensemble for robust visual question answering
Abstract Language bias is a critical issue in Visual Question Answering (VQA), where
models often exploit dataset biases for the final decision without considering the image …
models often exploit dataset biases for the final decision without considering the image …
Container: Context aggregation network
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of
effective and efficient variations. Recently, Transformers--originally introduced in natural …
effective and efficient variations. Recently, Transformers--originally introduced in natural …
Hamlet: A hierarchical multimodal attention-based human activity recognition algorithm
To fluently collaborate with people, robots need the ability to recognize human activities
accurately. Although modern robots are equipped with various sensors, robust human …
accurately. Although modern robots are equipped with various sensors, robust human …
Unshuffling data for improved generalization in visual question answering
D Teney, E Abbasnejad… - Proceedings of the …, 2021 - openaccess.thecvf.com
Generalization beyond the training distribution is a core challenge in machine learning. The
common practice of mixing and shuffling examples when training neural networks may not …
common practice of mixing and shuffling examples when training neural networks may not …
Re-attention for visual question answering
A simultaneous understanding of questions and images is crucial in Visual Question
Answering (VQA). While the existing models have achieved satisfactory performance by …
Answering (VQA). While the existing models have achieved satisfactory performance by …