Embers of autoregression: Understanding large language models through the problem they are trained to solve
The widespread adoption of large language models (LLMs) makes it important to recognize
their strengths and limitations. We argue that in order to develop a holistic understanding of …
their strengths and limitations. We argue that in order to develop a holistic understanding of …
Clippo: Image-and-language understanding from pixels only
M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …
such as the Transformer architecture. However, multimodal models still often consist of many …
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …
can be analyzed and generated at many granularities. Until recently, most natural language …
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models
In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …
with a language modelling objective on large-scale text-only data. With such pretraining …
[PDF][PDF] Hierarchical Text Classification: a review of current research
A Zangari, M Marcuzzo, M Schiavinato… - EXPERT SYSTEMS …, 2023 - iris.unive.it
It is often the case that collections of documents are annotated with hierarchically-structured
concepts. However, the benefits of this structure are rarely taken into account by …
concepts. However, the benefits of this structure are rarely taken into account by …
BERT-defense: A probabilistic model based on BERT to combat cognitively inspired orthographic adversarial attacks
Adversarial attacks expose important blind spots of deep learning systems. While word-and
sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input …
sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input …
Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and overlap across languages
T Limisiewicz, J Balhar, D Mareček - arXiv preprint arXiv:2305.17179, 2023 - arxiv.org
Multilingual language models have recently gained attention as a promising solution for
representing multiple languages in a single model. In this paper, we propose new criteria to …
representing multiple languages in a single model. In this paper, we propose new criteria to …
Interpreting the robustness of neural NLP models to textual perturbations
Modern Natural Language Processing (NLP) models are known to be sensitive to input
perturbations and their performance can decrease when applied to real-world, noisy data …
perturbations and their performance can decrease when applied to real-world, noisy data …
How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
A multilingual tokenizer is a fundamental component of multilingual neural machine
translation. It is trained from a multilingual corpus. Since a skewed data distribution is …
translation. It is trained from a multilingual corpus. Since a skewed data distribution is …
Incorporating context into subword vocabularies
S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org
Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …
a corpus, without considering information about co-occurrence or context. Nevertheless, the …