MosaicBERT: A bidirectional encoder optimized for fast pretraining
Although BERT-style encoder models are heavily used in NLP research, many researchers
do not pretrain their own BERTs from scratch due to the high cost of training. In the past half …
do not pretrain their own BERTs from scratch due to the high cost of training. In the past half …
u-P: The Unit-Scaled Maximal Update Parametrization
The Maximal Update Parametrization ($\mu $ P) aims to make the optimal hyperparameters
(HPs) of a model independent of its size, allowing them to be swept using a cheap proxy …
(HPs) of a model independent of its size, allowing them to be swept using a cheap proxy …
Inside the cerebras wafer-scale cluster
S Lie - IEEE Micro, 2024 - ieeexplore.ieee.org
The compute and memory demands of machine learning have driven the industry to use
clusters of thousands of GPUs to train state-of-the-art models. However, scaling performance …
clusters of thousands of GPUs to train state-of-the-art models. However, scaling performance …
Does Transformer Interpretability Transfer to RNNs?
Recent advances in recurrent neural network architectures, such as Mamba and RWKV,
have enabled RNNs to match or exceed the performance of equal-size transformers in terms …
have enabled RNNs to match or exceed the performance of equal-size transformers in terms …