Beyond static datasets: A deep interaction approach to llm evaluation

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

J Huang, EJ Li, MH Lam, T Liang, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Decision-making, a complicated task requiring various types of abilities, presents an
excellent framework for assessing Large Language Models (LLMs). Our research …

被引用次数：21 相关文章所有 3 个版本

[PDF] arxiv.org

Near to mid-term risks and opportunities of open source generative ai

F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati… - arXiv preprint arXiv …, 2024 - arxiv.org

In the next few years, applications of Generative AI are expected to revolutionize a number
of different areas, ranging from science & medicine to education. The potential for these …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

A Beyer, K Chalamalasetti, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org

It has been established in recent work that Large Language Models (LLMs) can be
prompted to" self-play" conversational games that probe certain capabilities (general …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games

N Herr, F Acero, R Raileanu, M Pérez-Ortiz… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have been increasingly used in real-world settings, yet their
strategic decision-making abilities remain largely unexplored. To fully benefit from the …

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org

What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …

[PDF] arxiv.org

Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform

M Cheng, H Zhang, J Yang, Q Liu, L Li… - … Proceedings of the …, 2024 - dl.acm.org

Large language model evaluation plays a pivotal role in the enhancement of its capacity.
Previously, numerous methods for evaluating large language models have been proposed …

被引用次数：3 相关文章所有 3 个版本

[PDF] openreview.net

Large Language Models are Bad Game Theoretic Reasoners: Evaluating Performance and Bias in Two-Player Non-Zero-Sum Games

N Herr, F Acero, R Raileanu, M Perez-Ortiz… - ICML 2024 Workshop on … - openreview.net

Large Language Models (LLMs) have been increasingly used in real-world settings, yet their
strategic abilities remain largely unexplored. Game theory provides a good framework for …