Robust open-vocabulary translation from visual text representations

Embers of autoregression: Understanding large language models through the problem they are trained to solve

RT McCoy, S Yao, D Friedman, M Hardy… - arXiv preprint arXiv …, 2023 - arxiv.org

The widespread adoption of large language models (LLMs) makes it important to recognize
their strengths and limitations. We argue that in order to develop a holistic understanding of …

被引用次数：66 相关文章所有 3 个版本

[PDF] thecvf.com

Clippo: Image-and-language understanding from pixels only

M Tschannen, B Mustafa… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Multimodal models are becoming increasingly effective, in part due to unified components,
such as the Transformer architecture. However, multimodal models still often consist of many …

被引用次数：23 相关文章所有 6 个版本

[PDF] arxiv.org

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：70 相关文章所有 5 个版本

[PDF] jair.org Full View

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org

In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

被引用次数：3 相关文章所有 6 个版本

[PDF] unive.it

[PDF][PDF] Hierarchical Text Classification: a review of current research

A Zangari, M Marcuzzo, M Schiavinato… - EXPERT SYSTEMS …, 2023 - iris.unive.it

It is often the case that collections of documents are annotated with hierarchically-structured
concepts. However, the benefits of this structure are rarely taken into account by …

被引用次数：4 相关文章

[PDF] arxiv.org

BERT-defense: A probabilistic model based on BERT to combat cognitively inspired orthographic adversarial attacks

Y Keller, J Mackensen, S Eger - arXiv preprint arXiv:2106.01452, 2021 - arxiv.org

Adversarial attacks expose important blind spots of deep learning systems. While word-and
sentence-level attack scenarios mostly deal with finding semantic paraphrases of the input …

被引用次数：29 相关文章所有 4 个版本

[PDF] arxiv.org

Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and overlap across languages

T Limisiewicz, J Balhar, D Mareček - arXiv preprint arXiv:2305.17179, 2023 - arxiv.org

Multilingual language models have recently gained attention as a promising solution for
representing multiple languages in a single model. In this paper, we propose new criteria to …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

Interpreting the robustness of neural NLP models to textual perturbations

Y Zhang, L Pan, S Tan, MY Kan - arXiv preprint arXiv:2110.07159, 2021 - arxiv.org

Modern Natural Language Processing (NLP) models are known to be sensitive to input
perturbations and their performance can decrease when applied to real-world, noisy data …

被引用次数：18 相关文章所有 9 个版本

[PDF] arxiv.org

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

S Zhang, V Chaudhary, N Goyal, J Cross… - arXiv preprint arXiv …, 2022 - arxiv.org

A multilingual tokenizer is a fundamental component of multilingual neural machine
translation. It is trained from a multilingual corpus. Since a skewed data distribution is …

被引用次数：12 相关文章所有 4 个版本

[PDF] arxiv.org

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org

Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …

被引用次数：11 相关文章所有 4 个版本