No language left behind: Scaling human-centered machine translation

MR Costa-jussà, J Cross, O Çelebi, M Elbayad… - arXiv preprint arXiv …, 2022 - arxiv.org
Driven by the goal of eradicating language barriers on a global scale, machine translation
has solidified itself as a key focus of artificial intelligence research today. However, such …

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

Language varieties of Italy: Technology challenges and opportunities

A Ramponi - Transactions of the Association for Computational …, 2024 - direct.mit.edu
Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which
implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its …

[HTML][HTML] Scaling neural machine translation to 200 languages

NLLB Team - Nature, 2024 - pmc.ncbi.nlm.nih.gov
The development of neural techniques has opened up new avenues for research in
machine translation. Today, neural machine translation (NMT) systems can leverage highly …

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages

T Nguyen, C Van Nguyen, VD Lai, H Man… - arXiv preprint arXiv …, 2023 - arxiv.org
The driving factors behind the development of large language models (LLMs) with
impressive learning capabilities are their colossal model sizes and extensive training …

ChatGPT MT: Competitive for high-(but not low-) resource languages

NR Robinson, P Ogayo, DR Mortensen… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) implicitly learn to perform a range of language tasks,
including machine translation (MT). Previous studies explore aspects of LLMs' MT …

Advancing neural encoding of portuguese with transformer albertina pt

J Rodrigues, L Gomes, J Silva, A Branco… - EPIA Conference on …, 2023 - Springer
To advance the neural encoding of Portuguese (PT), and a fortiori the technological
preparation of this language for the digital age, we developed a Transformer-based …

Croissantllm: A truly bilingual french-english language model

M Faysse, P Fernandes, NM Guerreiro… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce CroissantLLM, a 1.3 B language model pretrained on a set of 3T English and
French tokens, to bring to the research and industrial community a high-performance, fully …

Turkishbertweet: Fast and reliable large language model for social media analysis

A Najafi, O Varol - Expert Systems with Applications, 2024 - Elsevier
Turkish is one of the most spoken languages in the world; however, it is still among the low-
resource languages. Wide us of this language on social media platforms such as Twitter …