Btlm-3b-8k: 7b parameter performance in a 3b parameter model

文章

学术资源搜索

获得 4 条结果（用时0.03秒）

我的图书馆

Btlm-3b-8k: 7b parameter performance in a 3b parameter model

在引用文章中搜索

[PDF] neurips.cc

MosaicBERT: A bidirectional encoder optimized for fast pretraining

J Portes, A Trott, S Havens, D King… - Advances in …, 2023 - proceedings.neurips.cc

Although BERT-style encoder models are heavily used in NLP research, many researchers
do not pretrain their own BERTs from scratch due to the high cost of training. In the past half …

被引用次数：9 相关文章所有 5 个版本

[PDF] arxiv.org

u-P: The Unit-Scaled Maximal Update Parametrization

C Blake, C Eichenberg, J Dean, L Balles… - arXiv preprint arXiv …, 2024 - arxiv.org

The Maximal Update Parametrization ($\mu $ P) aims to make the optimal hyperparameters
(HPs) of a model independent of its size, allowing them to be swept using a cheap proxy …

被引用次数：4 相关文章所有 2 个版本

[PDF] hotchips.org

Inside the cerebras wafer-scale cluster

S Lie - IEEE Micro, 2024 - ieeexplore.ieee.org

The compute and memory demands of machine learning have driven the industry to use
clusters of thousands of GPUs to train state-of-the-art models. However, scaling performance …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Does Transformer Interpretability Transfer to RNNs?

G Paulo, T Marshall, N Belrose - arXiv preprint arXiv:2404.05971, 2024 - arxiv.org

Recent advances in recurrent neural network architectures, such as Mamba and RWKV,
have enabled RNNs to match or exceed the performance of equal-size transformers in terms …

被引用次数：3 相关文章所有 2 个版本