Deep learning modelling techniques: current progress, applications, advantages, and challenges

SF Ahmed, MSB Alam, M Hassan, MR Rozbu… - Artificial Intelligence …, 2023 - Springer
Deep learning (DL) is revolutionizing evidence-based decision-making techniques that can
be applied across various sectors. Specifically, it possesses the ability to utilize two or more …

Domain generalization: A survey

K Zhou, Z Liu, Y Qiao, T Xiang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generalization to out-of-distribution (OOD) data is a capability natural to humans yet
challenging for machines to reproduce. This is because most learning algorithms strongly …

Fast inference from transformers via speculative decoding

Y Leviathan, M Kalman… - … Conference on Machine …, 2023 - proceedings.mlr.press
Inference from large autoregressive models like Transformers is slow-decoding K tokens
takes K serial runs of the model. In this work we introduce speculative decoding-an …

Flatten transformer: Vision transformer using focused linear attention

D Han, X Pan, Y Han, S Song… - Proceedings of the …, 2023 - openaccess.thecvf.com
The quadratic computation complexity of self-attention has been a persistent challenge
when applying Transformer models to vision tasks. Linear attention, on the other hand, offers …

Simmim: A simple framework for masked image modeling

Z Xie, Z Zhang, Y Cao, Y Lin, J Bao… - Proceedings of the …, 2022 - openaccess.thecvf.com
This paper presents SimMIM, a simple framework for masked image modeling. We have
simplified recently proposed relevant approaches, without the need for special designs …

A-vit: Adaptive tokens for efficient vision transformer

H Yin, A Vahdat, JM Alvarez, A Mallya… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer
ViT for images of different complexity. A-ViT achieves this by automatically reducing the …

Not all patches are what you need: Expediting vision transformers via token reorganizations

Y Liang, C Ge, Z Tong, Y Song, J Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head
self-attention (MHSA) among them. Complete leverage of these image tokens brings …

Are multimodal transformers robust to missing modality?

M Ma, J Ren, L Zhao, D Testuggine… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multimodal data collected from the real world are often imperfect due to missing modalities.
Therefore multimodal models that are robust against modal-incomplete data are highly …

Adaptive rotated convolution for rotated object detection

Y Pu, Y Wang, Z Xia, Y Han, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Rotated object detection aims to identify and locate objects in images with arbitrary
orientation. In this scenario, the oriented directions of objects vary considerably across …

A dynamic multi-scale voxel flow network for video prediction

X Hu, Z Huang, A Huang, J Xu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
The performance of video prediction has been greatly boosted by advanced deep neural
networks. However, most of the current methods suffer from large model sizes and require …