No language left behind: Scaling human-centered machine translation
Driven by the goal of eradicating language barriers on a global scale, machine translation
has solidified itself as a key focus of artificial intelligence research today. However, such …
has solidified itself as a key focus of artificial intelligence research today. However, such …
Madlad-400: A multilingual and document-level large audited dataset
We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …
Datasets for large language models: A comprehensive survey
This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …
Language varieties of Italy: Technology challenges and opportunities
A Ramponi - Transactions of the Association for Computational …, 2024 - direct.mit.edu
Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which
implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its …
implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its …
[HTML][HTML] Scaling neural machine translation to 200 languages
NLLB Team - Nature, 2024 - pmc.ncbi.nlm.nih.gov
The development of neural techniques has opened up new avenues for research in
machine translation. Today, neural machine translation (NMT) systems can leverage highly …
machine translation. Today, neural machine translation (NMT) systems can leverage highly …
Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages
The driving factors behind the development of large language models (LLMs) with
impressive learning capabilities are their colossal model sizes and extensive training …
impressive learning capabilities are their colossal model sizes and extensive training …
ChatGPT MT: Competitive for high-(but not low-) resource languages
Large language models (LLMs) implicitly learn to perform a range of language tasks,
including machine translation (MT). Previous studies explore aspects of LLMs' MT …
including machine translation (MT). Previous studies explore aspects of LLMs' MT …
Advancing neural encoding of portuguese with transformer albertina pt
To advance the neural encoding of Portuguese (PT), and a fortiori the technological
preparation of this language for the digital age, we developed a Transformer-based …
preparation of this language for the digital age, we developed a Transformer-based …
Croissantllm: A truly bilingual french-english language model
We introduce CroissantLLM, a 1.3 B language model pretrained on a set of 3T English and
French tokens, to bring to the research and industrial community a high-performance, fully …
French tokens, to bring to the research and industrial community a high-performance, fully …
Turkishbertweet: Fast and reliable large language model for social media analysis
Turkish is one of the most spoken languages in the world; however, it is still among the low-
resource languages. Wide us of this language on social media platforms such as Twitter …
resource languages. Wide us of this language on social media platforms such as Twitter …