A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

MTR Laskar, S Alqahtani, MS Bari, M Rahman… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently gained significant attention due to their
remarkable capabilities in performing diverse tasks across various domains. However, a …

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Z Kasner, O Dušek - Proceedings of the 62nd Annual Meeting of …, 2024 - aclanthology.org
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text
(D2T) generation, ie, generating coherent and relevant text from structured data. To avoid …

On scalable oversight with weak LLMs judging strong LLMs

Z Kenton, NY Siegel, J Kramár, J Brown-Cohen… - arXiv preprint arXiv …, 2024 - arxiv.org
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI.
In this paper we study debate, where two AI's compete to convince a judge; consultancy …

Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey

B Jiang, Y Xie, X Wang, WJ Su, CJ Taylor… - arXiv preprint arXiv …, 2024 - arxiv.org
Rationality is the quality of being guided by reason, characterized by logical thinking and
decision-making that align with evidence and logical rules. This quality is essential for …

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

A Sinha, C Migozzi, A Rey, C Zhang - arXiv preprint arXiv:2408.09269, 2024 - arxiv.org
Research on multi-modal contrastive learning strategies for audio and text has rapidly
gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which …

Traditional Methods Outperform Generative LLMs at Forecasting Credit Ratings

F Drinkall, JB Pierrehumbert, S Zohren - arXiv preprint arXiv:2407.17624, 2024 - arxiv.org
Large Language Models (LLMs) have been shown to perform well for many downstream
tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre …

Are Large Language Models Actually Good at Text Style Transfer?

S Mukherjee, AK Ojha, O Dušek - arXiv preprint arXiv:2406.05885, 2024 - arxiv.org
We analyze the performance of large language models (LLMs) on Text Style Transfer (TST),
specifically focusing on sentiment transfer and text detoxification across three languages …

Can LLM be a Personalized Judge?

YR Dong, T Hu, N Collier - arXiv preprint arXiv:2406.11657, 2024 - arxiv.org
Ensuring that large language models (LLMs) reflect diverse user values and preferences is
crucial as their user bases expand globally. It is therefore encouraging to see the growing …

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

B Murugadoss, C Poelitz, I Drosos, V Le… - arXiv preprint arXiv …, 2024 - arxiv.org
LLMs-as-a-judge is a recently popularized method which replaces human judgements in
task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to …

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

H Wei, S He, T Xia, A Wong, J Lin, M Han - arXiv preprint arXiv …, 2024 - arxiv.org
Alignment approaches such as RLHF and DPO are actively investigated to align large
language models (LLMs) with human preferences. Commercial large language models …