Position information in transformers: An overview

P Dufter, M Schmitt, H Schütze - Computational Linguistics, 2022 - direct.mit.edu
Transformers are arguably the main workhorse in recent natural language processing
research. By definition, a Transformer is invariant with respect to reordering of the input …

The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc
Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

Theoretical limitations of self-attention in neural sequence models

M Hahn - Transactions of the Association for Computational …, 2020 - direct.mit.edu
Transformers are emerging as the new workhorse of NLP, showing great success across
tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention …

On the ability and limitations of transformers to recognize formal languages

S Bhattamishra, K Ahuja, N Goyal - arXiv preprint arXiv:2009.11264, 2020 - arxiv.org
Transformers have supplanted recurrent models in a large number of NLP tasks. However,
the differences in their abilities to model different syntactic properties remain largely …

Self-attention networks can process bounded hierarchical languages

S Yao, B Peng, C Papadimitriou… - arXiv preprint arXiv …, 2021 - arxiv.org
Despite their impressive performance in NLP, self-attention networks were recently proved
to be limited for processing formal languages with hierarchical structure, such as $\mathsf …

Uncertainty-aware curriculum learning for neural machine translation

Y Zhou, B Yang, DF Wong, Y Wan… - Proceedings of the 58th …, 2020 - aclanthology.org
Neural machine translation (NMT) has proven to be facilitated by curriculum learning which
presents examples in an easy-to-hard order at different training stages. The keys lie in the …

Self-attention with cross-lingual position representation

L Ding, L Wang, D Tao - arXiv preprint arXiv:2004.13310, 2020 - arxiv.org
Position encoding (PE), an essential part of self-attention networks (SANs), is used to
preserve the word order information for natural language processing tasks, generating fixed …

How effective is BERT without word ordering? implications for language understanding and data privacy

J Hessel, A Schofield - Proceedings of the 59th Annual Meeting of …, 2021 - aclanthology.org
Ordered word sequences contain the rich structures that define language. However, it's often
not clear if or how modern pretrained language models utilize these structures. We show …

On the computational power of transformers and its implications in sequence modeling

S Bhattamishra, A Patel, N Goyal - arXiv preprint arXiv:2006.09286, 2020 - arxiv.org
Transformers are being used extensively across several sequence modeling tasks.
Significant research effort has been devoted to experimentally probe the inner workings of …

Rethinking the value of transformer components

W Wang, Z Tu - arXiv preprint arXiv:2011.03803, 2020 - arxiv.org
Transformer becomes the state-of-the-art translation model, while it is not well studied how
each intermediate component contributes to the model performance, which poses significant …