Lessons from the trenches on reproducible evaluation of language models

A survey on stability of learning with limited labelled data and its sensitivity to the effects of randomness

B Pecher, I Srba, M Bielikova - ACM Computing Surveys, 2024 - dl.acm.org

Learning with limited labelled data, such as prompting, in-context learning, fine-tuning, meta-
learning, or few-shot learning, aims to effectively train a model using only a small amount of …

被引用次数：3 相关文章

[PDF] aclanthology.org

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - arXiv preprint arXiv …, 2024 - arxiv.org

Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

被引用次数：41 相关文章

[PDF] arxiv.org

Open problems in technical ai governance

A Reuel, B Bucknall, S Casper, T Fist, L Soder… - arXiv preprint arXiv …, 2024 - arxiv.org

AI progress is creating a growing range of risks and opportunities, but it is often unclear how
they should be navigated. In many cases, the barriers and uncertainties faced are at least …

被引用次数：17 相关文章所有 4 个版本

[PDF] cnr.it

[PDF][PDF] CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian

G Attanasio, P Basile, F Borazio, D Croce… - Proceedings of the …, 2024 - clic2024.ilc.cnr.it

The rapid development of Large Language Models (LLMs) has called for robust benchmarks
to assess their abilities, track progress, and compare iterations. While existing benchmarks …

被引用次数：21 相关文章所有 2 个版本

[PDF] arxiv.org

Lab-bench: Measuring capabilities of language models for biology research

JM Laurent, JD Janizek, M Ruzo, MM Hinks… - arXiv preprint arXiv …, 2024 - arxiv.org

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-
augmented systems have the potential to rapidly accelerate scientific discovery across …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Composable interventions for language models

A Kolbeinsson, K O'Brien, T Huang, S Gao… - arXiv preprint arXiv …, 2024 - arxiv.org

Test-time interventions for language models can enhance factual accuracy, mitigate harmful
outputs, and improve model efficiency without costly retraining. But despite a flood of new …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org

Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

LLM Stability: A detailed analysis with some surprises

B Atil, A Chittams, L Fu, F Ture, L Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

LLM (large language model) practitioners commonly notice that outputs can vary for the
same inputs, but we have been unable to find work that evaluates LLM stability as the main …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

A Reuel, A Hardy, C Smith, M Lamparth… - arXiv preprint arXiv …, 2024 - arxiv.org

AI models are increasingly prevalent in high-stakes environments, necessitating thorough
assessment of their capabilities and risks. Benchmarks are popular for measuring these …

被引用次数：1 相关文章所有 3 个版本