MosaicBERT: A bidirectional encoder optimized for fast pretraining

J Portes, A Trott, S Havens, D King… - Advances in …, 2023 - proceedings.neurips.cc
Although BERT-style encoder models are heavily used in NLP research, many researchers
do not pretrain their own BERTs from scratch due to the high cost of training. In the past half …

u-P: The Unit-Scaled Maximal Update Parametrization

C Blake, C Eichenberg, J Dean, L Balles… - arXiv preprint arXiv …, 2024 - arxiv.org
The Maximal Update Parametrization ($\mu $ P) aims to make the optimal hyperparameters
(HPs) of a model independent of its size, allowing them to be swept using a cheap proxy …

Inside the cerebras wafer-scale cluster

S Lie - IEEE Micro, 2024 - ieeexplore.ieee.org
The compute and memory demands of machine learning have driven the industry to use
clusters of thousands of GPUs to train state-of-the-art models. However, scaling performance …

Does Transformer Interpretability Transfer to RNNs?

G Paulo, T Marshall, N Belrose - arXiv preprint arXiv:2404.05971, 2024 - arxiv.org
Recent advances in recurrent neural network architectures, such as Mamba and RWKV,
have enabled RNNs to match or exceed the performance of equal-size transformers in terms …