- 学术资源搜索

Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

被引用次数：296 相关文章所有 3 个版本

[PDF] arxiv.org

Recent advances in natural language processing via large pre-trained language models: A survey

B Min, H Ross, E Sulem, APB Veyseh… - ACM Computing …, 2023 - dl.acm.org

Large, pre-trained language models (PLMs) such as BERT and GPT have drastically
changed the Natural Language Processing (NLP) field. For numerous NLP tasks …

被引用次数：729 相关文章所有 5 个版本

[PDF] arxiv.org

On the opportunities and risks of foundation models

R Bommasani, DA Hudson, E Adeli, R Altman… - arXiv preprint arXiv …, 2021 - arxiv.org

AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …

被引用次数：3579 相关文章所有 2 个版本

[PDF] jmlr.org

Foundation models and fair use

P Henderson, X Li, D Jurafsky, T Hashimoto… - Journal of Machine …, 2023 - jmlr.org

Existing foundation models are trained on copyrighted material. Deploying these models
can pose both legal and ethical risks when data creators fail to receive appropriate …

被引用次数：98 相关文章所有 4 个版本

[PDF] neurips.cc

The bigscience roots corpus: A 1.6 tb composite multilingual dataset

H Laurençon, L Saulnier, T Wang… - Advances in …, 2022 - proceedings.neurips.cc

As language models grow ever larger, the need for large-scale high-quality text datasets has
never been more pressing, especially in multilingual settings. The BigScience workshop, a 1 …

被引用次数：143 相关文章所有 21 个版本

[PDF] neurips.cc

Madlad-400: A multilingual and document-level large audited dataset

S Kudugunta, I Caswell, B Zhang… - Advances in …, 2024 - proceedings.neurips.cc

We introduce MADLAD-400, a manually audited, general domain 3T token monolingual
dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations …

被引用次数：54 相关文章所有 6 个版本

[PDF] arxiv.org

A pretrainer's guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity

S Longpre, G Yauney, E Reif, K Lee, A Roberts… - arXiv preprint arXiv …, 2023 - arxiv.org

Pretraining is the preliminary and fundamental step in developing capable language models
(LM). Despite this, pretraining data design is critically under-documented and often guided …

被引用次数：69 相关文章所有 3 个版本

[PDF] mdpi.com

A comprehensive study of ChatGPT: advancements, limitations, and ethical considerations in natural language processing and cybersecurity

M Alawida, S Mejri, A Mehmood, B Chikhaoui… - Information, 2023 - mdpi.com

This paper presents an in-depth study of ChatGPT, a state-of-the-art language model that is
revolutionizing generative text. We provide a comprehensive analysis of its architecture …

被引用次数：84 相关文章所有 6 个版本

[PDF] acm.org

Sparks: Inspiration for science writing using language models

KI Gero, V Liu, L Chilton - Proceedings of the 2022 ACM Designing …, 2022 - dl.acm.org

Large-scale language models are rapidly improving, performing well on a wide variety of
tasks with little to no customization. In this work we investigate how language models can …

被引用次数：154 相关文章所有 9 个版本

[PDF] mlr.press

Cramming: Training a Language Model on a single GPU in one day.

J Geiping, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press

Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

被引用次数：63 相关文章所有 7 个版本