Embers of autoregression: Understanding large language models through the problem they are trained to solve

RT McCoy, S Yao, D Friedman, M Hardy… - arXiv preprint arXiv …, 2023 - arxiv.org
The widespread adoption of large language models (LLMs) makes it important to recognize
their strengths and limitations. We argue that in order to develop a holistic understanding of …

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org
In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

[PDF][PDF] Hierarchical Text Classification: a review of current research

A Zangari, M Marcuzzo, M Schiavinato… - EXPERT SYSTEMS …, 2023 - iris.unive.it
It is often the case that collections of documents are annotated with hierarchically-structured
concepts. However, the benefits of this structure are rarely taken into account by …

BERT-defense: A probabilistic model based on BERT to combat cognitively inspired orthographic adversarial attacks

Y Keller, J Mackensen, S Eger - arXiv preprint arXiv:2106.01452, 2021 - arxiv.org
Adversarial attacks expose important blind spots of deep learning systems. While word-and
sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input …

Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and overlap across languages

T Limisiewicz, J Balhar, D Mareček - arXiv preprint arXiv:2305.17179, 2023 - arxiv.org
Multilingual language models have recently gained attention as a promising solution for
representing multiple languages in a single model. In this paper, we propose new criteria to …

Interpreting the robustness of neural NLP models to textual perturbations

Y Zhang, L Pan, S Tan, MY Kan - arXiv preprint arXiv:2110.07159, 2021 - arxiv.org
Modern Natural Language Processing (NLP) models are known to be sensitive to input
perturbations and their performance can decrease when applied to real-world, noisy data …

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

S Zhang, V Chaudhary, N Goyal, J Cross… - arXiv preprint arXiv …, 2022 - arxiv.org
A multilingual tokenizer is a fundamental component of multilingual neural machine
translation. It is trained from a multilingual corpus. Since a skewed data distribution is …

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org
Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …