Gameeval: Evaluating llms on conversational games

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

J Huang, EJ Li, MH Lam, T Liang, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Decision-making, a complicated task requiring various types of abilities, presents an
excellent framework for assessing Large Language Models (LLMs). Our research …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Beyond static datasets: A deep interaction approach to llm evaluation

J Li, R Li, Q Liu - arXiv preprint arXiv:2309.04369, 2023 - arxiv.org

Large Language Models (LLMs) have made progress in various real-world tasks, which
stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on large language model-based game agents

S Hu, T Huang, F Ilhan, S Tekin, G Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

The development of game agents holds a critical role in advancing towards Artificial General
Intelligence (AGI). The progress of LLMs and their multimodal counterparts (MLLMs) offers …

被引用次数：11 相关文章所有 2 个版本

[PDF] arxiv.org

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Y Zhang, S Mao, T Ge, X Wang, A de Wynter… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents a comprehensive survey of the current status and opportunities for
Large Language Models (LLMs) in strategic reasoning, a sophisticated form of reasoning …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

A Beyer, K Chalamalasetti, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org

It has been established in recent work that Large Language Models (LLMs) can be
prompted to" self-play" conversational games that probe certain capabilities (general …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

Z Liu, A Anand, P Zhou, J Huang, J Zhao - arXiv preprint arXiv:2406.12203, 2024 - arxiv.org

Large language models (LLMs) have demonstrated the potential to mimic human social
intelligence. However, most studies focus on simplistic and static self-report or performance …

被引用次数：1 相关文章

[PDF] github.io

[PDF][PDF] BERALL: Towards Generating Retrieval-augmented State-based Interactive Fiction Games

R Chambers, N Tack, E Pearson… - The 4th Wordplay …, 2024 - wordplay-workshop.github.io

Interactive fiction (IF) games are a genre of games where the player interacts with the
fictional world via text-based commands, solving puzzles primarily by exploring the world …

被引用次数：1 相关文章

[PDF] mdpi.com

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

O Topsakal, JB Harper - Electronics, 2024 - mdpi.com

This study investigates the strategic decision-making abilities of large language models
(LLMs) via the game of Tic-Tac-Toe, renowned for its straightforward rules and definitive …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

S Hakimov, Y Abdullayeva, K Koshti, A Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org

While the situation has improved for text-only models, it again seems to be the case currently
that multimodal (text and image) models develop faster than ways to evaluate them. In this …

[PDF] arxiv.org

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org

What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …