How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

J Huang, EJ Li, MH Lam, T Liang, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Decision-making, a complicated task requiring various types of abilities, presents an
excellent framework for assessing Large Language Models (LLMs). Our research …

Beyond static datasets: A deep interaction approach to llm evaluation

J Li, R Li, Q Liu - arXiv preprint arXiv:2309.04369, 2023 - arxiv.org
Large Language Models (LLMs) have made progress in various real-world tasks, which
stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are …

A survey on large language model-based game agents

S Hu, T Huang, F Ilhan, S Tekin, G Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
The development of game agents holds a critical role in advancing towards Artificial General
Intelligence (AGI). The progress of LLMs and their multimodal counterparts (MLLMs) offers …

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Y Zhang, S Mao, T Ge, X Wang, A de Wynter… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents a comprehensive survey of the current status and opportunities for
Large Language Models (LLMs) in strategic reasoning, a sophisticated form of reasoning …

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

A Beyer, K Chalamalasetti, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org
It has been established in recent work that Large Language Models (LLMs) can be
prompted to" self-play" conversational games that probe certain capabilities (general …

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

Z Liu, A Anand, P Zhou, J Huang, J Zhao - arXiv preprint arXiv:2406.12203, 2024 - arxiv.org
Large language models (LLMs) have demonstrated the potential to mimic human social
intelligence. However, most studies focus on simplistic and static self-report or performance …

[PDF][PDF] BERALL: Towards Generating Retrieval-augmented State-based Interactive Fiction Games

R Chambers, N Tack, E Pearson… - The 4th Wordplay …, 2024 - wordplay-workshop.github.io
Interactive fiction (IF) games are a genre of games where the player interacts with the
fictional world via text-based commands, solving puzzles primarily by exploring the world …

Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe

O Topsakal, JB Harper - Electronics, 2024 - mdpi.com
This study investigates the strategic decision-making abilities of large language models
(LLMs) via the game of Tic-Tac-Toe, renowned for its straightforward rules and definitive …

Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

S Hakimov, Y Abdullayeva, K Koshti, A Schmidt… - arXiv preprint arXiv …, 2024 - arxiv.org
While the situation has improved for text-only models, it again seems to be the case currently
that multimodal (text and image) models develop faster than ways to evaluate them. In this …

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

N Bhavsar, J Jordan, S Hakimov… - arXiv preprint arXiv …, 2024 - arxiv.org
What makes a good Large Language Model (LLM)? That it performs well on the relevant
benchmarks--which hopefully measure, with some validity, the presence of capabilities that …