Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
H Hajimolahoseini, W Ahmed, A Wen, Y Liu - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present a comprehensive study and propose several novel techniques for
implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or …
implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or …
SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection
Vision transformers are known to be more computationally and data-intensive than CNN
models. These transformer models such as ViT, require all the input image tokens to learn …
models. These transformer models such as ViT, require all the input image tokens to learn …
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Z Khan, M Khaquan, O Tafveez, AA Raza - arXiv preprint arXiv …, 2024 - arxiv.org
The Transformer architecture has revolutionized deep learning through its Self-Attention
mechanism, which effectively captures contextual information. However, the memory …
mechanism, which effectively captures contextual information. However, the memory …