Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?

H Hajimolahoseini, W Ahmed, A Wen, Y Liu - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present a comprehensive study and propose several novel techniques for
implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or …

SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection

F Ataiefard, W Ahmed, H Hajimolahoseini… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision transformers are known to be more computationally and data-intensive than CNN
models. These transformer models such as ViT, require all the input image tokens to learn …

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Z Khan, M Khaquan, O Tafveez, AA Raza - arXiv preprint arXiv …, 2024 - arxiv.org
The Transformer architecture has revolutionized deep learning through its Self-Attention
mechanism, which effectively captures contextual information. However, the memory …