How to train BERT with an academic budget

B Min, H Ross, E Sulem, APB Veyseh… - ACM Computing …, 2023 - dl.acm.org

Large, pre-trained language models (PLMs) such as BERT and GPT have drastically
changed the Natural Language Processing (NLP) field. For numerous NLP tasks …

被引用次数：892 相关文章所有 5 个版本

[PDF] acm.org

Biases in large language models: origins, inventory, and discussion

R Navigli, S Conia, B Ross - ACM Journal of Data and Information …, 2023 - dl.acm.org

In this article, we introduce and discuss the pervasive issue of bias in the large language
models that are currently at the core of mainstream approaches to Natural Language …

被引用次数：182 相关文章所有 4 个版本

[PDF] neurips.cc

Data selection for language models via importance resampling

SM Xie, S Santurkar, T Ma… - Advances in Neural …, 2023 - proceedings.neurips.cc

Selecting a suitable pretraining dataset is crucial for both general-domain (eg, GPT-3) and
domain-specific (eg, Codex) language models (LMs). We formalize this problem as selecting …

被引用次数：108 相关文章所有 5 个版本

[PDF] springer.com

Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models

Z Lin, S Guan, W Zhang, H Zhang, Y Li… - Artificial Intelligence …, 2024 - Springer

Recently, large language models (LLMs) have attracted considerable attention due to their
remarkable capabilities. However, LLMs' generation of biased or hallucinatory content …

被引用次数：7 相关文章所有 3 个版本

[PDF] neurips.cc

Monarch mixer: A simple sub-quadratic gemm-based architecture

D Fu, S Arora, J Grogan, I Johnson… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Machine learning models are increasingly being scaled in both sequence length
and model dimension to reach longer contexts and better performance. However, existing …

被引用次数：38 相关文章所有 6 个版本

[PDF] arxiv.org

Sophia: A scalable stochastic second-order optimizer for language model pre-training

H Liu, Z Li, D Hall, P Liang, T Ma - arXiv preprint arXiv:2305.14342, 2023 - arxiv.org

Given the massive cost of language model pre-training, a non-trivial improvement of the
optimization algorithm would lead to a material reduction on the time and cost of training …

被引用次数：98 相关文章所有 4 个版本

[PDF] arxiv.org

Should you mask 15% in masked language modeling?

A Wettig, T Gao, Z Zhong, D Chen - arXiv preprint arXiv:2202.08005, 2022 - arxiv.org

Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that
more masking would leave insufficient context to learn good representations; this masking …

被引用次数：147 相关文章所有 4 个版本

[PDF] mlr.press

Cramming: Training a Language Model on a single GPU in one day.

J Geiping, T Goldstein - International Conference on …, 2023 - proceedings.mlr.press

Recent trends in language modeling have focused on increasing performance through
scaling, and have resulted in an environment where training language models is out of …

被引用次数：71 相关文章所有 7 个版本

[PDF] neurips.cc

No train no gain: Revisiting efficient training algorithms for transformer-based language models

J Kaddour, O Key, P Nawrot… - Advances in Neural …, 2024 - proceedings.neurips.cc

The computation necessary for training Transformer-based language models has
skyrocketed in recent years. This trend has motivated research on efficient training …

被引用次数：20 相关文章所有 5 个版本

[PDF] arxiv.org

M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization

C Liu, S Cheng, C Chen, M Qiao, W Zhang… - … Conference on Medical …, 2023 - Springer

Medical vision-language models enable co-learning and integrating features from medical
imaging and clinical text. However, these models are not easy to train and the latent …

被引用次数：51 相关文章所有 4 个版本