Towards a cleaner document-oriented multilingual crawled corpus

MR Costa-jussà, J Cross, O Çelebi, M Elbayad… - arXiv preprint arXiv …, 2022 - arxiv.org

Driven by the goal of eradicating language barriers on a global scale, machine translation
has solidified itself as a key focus of artificial intelligence research today. However, such …

被引用次数：686 相关文章所有 2 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：75 相关文章所有 6 个版本

[PDF] arxiv.org

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org

This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

被引用次数：33 相关文章所有 4 个版本

[PDF] mit.edu

Language varieties of Italy: Technology challenges and opportunities

A Ramponi - Transactions of the Association for Computational …, 2024 - direct.mit.edu

Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which
implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its …

被引用次数：8 相关文章所有 6 个版本

[HTML] nih.gov

[HTML][HTML] Scaling neural machine translation to 200 languages

NLLB Team - Nature, 2024 - pmc.ncbi.nlm.nih.gov

The development of neural techniques has opened up new avenues for research in
machine translation. Today, neural machine translation (NMT) systems can leverage highly …

被引用次数：14 相关文章

[PDF] arxiv.org

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages

T Nguyen, C Van Nguyen, VD Lai, H Man… - arXiv preprint arXiv …, 2023 - arxiv.org

The driving factors behind the development of large language models (LLMs) with
impressive learning capabilities are their colossal model sizes and extensive training …

被引用次数：62 相关文章所有 3 个版本

[PDF] arxiv.org

ChatGPT MT: Competitive for high-(but not low-) resource languages

NR Robinson, P Ogayo, DR Mortensen… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) implicitly learn to perform a range of language tasks,
including machine translation (MT). Previous studies explore aspects of LLMs' MT …

被引用次数：56 相关文章所有 4 个版本

[PDF] arxiv.org

Advancing neural encoding of portuguese with transformer albertina pt

J Rodrigues, L Gomes, J Silva, A Branco… - EPIA Conference on …, 2023 - Springer

To advance the neural encoding of Portuguese (PT), and a fortiori the technological
preparation of this language for the digital age, we developed a Transformer-based …

被引用次数：44 相关文章所有 5 个版本

[PDF] arxiv.org

Croissantllm: A truly bilingual french-english language model

M Faysse, P Fernandes, NM Guerreiro… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce CroissantLLM, a 1.3 B language model pretrained on a set of 3T English and
French tokens, to bring to the research and industrial community a high-performance, fully …

被引用次数：20 相关文章所有 8 个版本

[PDF] arxiv.org

Turkishbertweet: Fast and reliable large language model for social media analysis

A Najafi, O Varol - Expert Systems with Applications, 2024 - Elsevier

Turkish is one of the most spoken languages in the world; however, it is still among the low-
resource languages. Wide us of this language on social media platforms such as Twitter …

被引用次数：8 相关文章所有 3 个版本