Assessing the ability of self-attention networks to learn word order

P Dufter, M Schmitt, H Schütze - Computational Linguistics, 2022 - direct.mit.edu

Transformers are arguably the main workhorse in recent natural language processing
research. By definition, a Transformer is invariant with respect to reordering of the input …

被引用次数：164 相关文章所有 8 个版本

[PDF] neurips.cc

The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc

Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

被引用次数：65 相关文章所有 6 个版本

[PDF] mit.edu

Theoretical limitations of self-attention in neural sequence models

M Hahn - Transactions of the Association for Computational …, 2020 - direct.mit.edu

Transformers are emerging as the new workhorse of NLP, showing great success across
tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention …

被引用次数：253 相关文章所有 14 个版本

[PDF] arxiv.org

On the ability and limitations of transformers to recognize formal languages

S Bhattamishra, K Ahuja, N Goyal - arXiv preprint arXiv:2009.11264, 2020 - arxiv.org

Transformers have supplanted recurrent models in a large number of NLP tasks. However,
the differences in their abilities to model different syntactic properties remain largely …

被引用次数：120 相关文章所有 6 个版本

[PDF] arxiv.org

Self-attention networks can process bounded hierarchical languages

S Yao, B Peng, C Papadimitriou… - arXiv preprint arXiv …, 2021 - arxiv.org

Despite their impressive performance in NLP, self-attention networks were recently proved
to be limited for processing formal languages with hierarchical structure, such as $\mathsf …

被引用次数：76 相关文章所有 7 个版本

[PDF] aclanthology.org

Uncertainty-aware curriculum learning for neural machine translation

Y Zhou, B Yang, DF Wong, Y Wan… - Proceedings of the 58th …, 2020 - aclanthology.org

Neural machine translation (NMT) has proven to be facilitated by curriculum learning which
presents examples in an easy-to-hard order at different training stages. The keys lie in the …

被引用次数：93 相关文章所有 3 个版本

[PDF] arxiv.org

Self-attention with cross-lingual position representation

L Ding, L Wang, D Tao - arXiv preprint arXiv:2004.13310, 2020 - arxiv.org

Position encoding (PE), an essential part of self-attention networks (SANs), is used to
preserve the word order information for natural language processing tasks, generating fixed …

被引用次数：79 相关文章所有 5 个版本

[PDF] aclanthology.org

How effective is BERT without word ordering? implications for language understanding and data privacy

J Hessel, A Schofield - Proceedings of the 59th Annual Meeting of …, 2021 - aclanthology.org

Ordered word sequences contain the rich structures that define language. However, it's often
not clear if or how modern pretrained language models utilize these structures. We show …

被引用次数：38 相关文章所有 3 个版本

[PDF] arxiv.org

On the computational power of transformers and its implications in sequence modeling

S Bhattamishra, A Patel, N Goyal - arXiv preprint arXiv:2006.09286, 2020 - arxiv.org

Transformers are being used extensively across several sequence modeling tasks.
Significant research effort has been devoted to experimentally probe the inner workings of …

被引用次数：63 相关文章所有 3 个版本

[PDF] arxiv.org

Rethinking the value of transformer components

W Wang, Z Tu - arXiv preprint arXiv:2011.03803, 2020 - arxiv.org

Transformer becomes the state-of-the-art translation model, while it is not well studied how
each intermediate component contributes to the model performance, which poses significant …

被引用次数：39 相关文章所有 3 个版本