Quality at a glance: An audit of web-crawled multilingual datasets

B Min, H Ross, E Sulem, APB Veyseh… - ACM Computing …, 2023 - dl.acm.org

Large, pre-trained language models (PLMs) such as BERT and GPT have drastically
changed the Natural Language Processing (NLP) field. For numerous NLP tasks …

被引用次数：748 相关文章所有 5 个版本

[PDF] arxiv.org

Neural machine translation for low-resource languages: A survey

S Ranathunga, ESA Lee, M Prifti Skenduli… - ACM Computing …, 2023 - dl.acm.org

Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since
the early 2000s and has already entered a mature phase. While considered the most widely …

被引用次数：209 相关文章所有 6 个版本

[PDF] arxiv.org

Palm 2 technical report

R Anil, AM Dai, O Firat, M Johnson, D Lepikhin… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and
reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is …

被引用次数：1171 相关文章所有 2 个版本

[HTML] springer.com

[HTML][HTML] Auditing large language models: a three-layered approach

J Mökander, J Schuett, HR Kirk, L Floridi - AI and Ethics, 2023 - Springer

Large language models (LLMs) represent a major advance in artificial intelligence (AI)
research. However, the widespread use of LLMs is also coupled with significant ethical and …

被引用次数：157 相关文章所有 6 个版本

[PDF] neurips.cc

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc

As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

被引用次数：143 相关文章所有 21 个版本

[PDF] arxiv.org

In-context examples selection for machine translation

S Agrawal, C Zhou, M Lewis, L Zettlemoyer… - arXiv preprint arXiv …, 2022 - arxiv.org

Large-scale generative models show an impressive ability to perform a wide range of
Natural Language Processing (NLP) tasks using in-context learning, where a few examples …

被引用次数：147 相关文章所有 5 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：54 相关文章所有 6 个版本

[PDF] neurips.cc

The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data only

G Penedo, Q Malartic, D Hesslow… - Advances in …, 2023 - proceedings.neurips.cc

Large language models are commonly trained on a mixture of filtered web data and
curated``high-quality''corpora, such as social media conversations, books, or technical …

被引用次数：32 相关文章所有 4 个版本

[PDF] arxiv.org

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee, A Roberts… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …

被引用次数：70 相关文章所有 3 个版本

[HTML] mit.edu

[HTML][HTML] Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：80 相关文章所有 10 个版本