A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Large Language Models (LLMs) have recently gained significant attention due to their
remarkable capabilities in performing diverse tasks across various domains. However, a …
remarkable capabilities in performing diverse tasks across various domains. However, a …
Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text
(D2T) generation, ie, generating coherent and relevant text from structured data. To avoid …
(D2T) generation, ie, generating coherent and relevant text from structured data. To avoid …
On scalable oversight with weak LLMs judging strong LLMs
Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI.
In this paper we study debate, where two AI's compete to convince a judge; consultancy …
In this paper we study debate, where two AI's compete to convince a judge; consultancy …
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Rationality is the quality of being guided by reason, characterized by logical thinking and
decision-making that align with evidence and logical rules. This quality is essential for …
decision-making that align with evidence and logical rules. This quality is essential for …
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs
Research on multi-modal contrastive learning strategies for audio and text has rapidly
gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which …
gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which …
Traditional Methods Outperform Generative LLMs at Forecasting Credit Ratings
F Drinkall, JB Pierrehumbert, S Zohren - arXiv preprint arXiv:2407.17624, 2024 - arxiv.org
Large Language Models (LLMs) have been shown to perform well for many downstream
tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre …
tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre …
Are Large Language Models Actually Good at Text Style Transfer?
We analyze the performance of large language models (LLMs) on Text Style Transfer (TST),
specifically focusing on sentiment transfer and text detoxification across three languages …
specifically focusing on sentiment transfer and text detoxification across three languages …
Can LLM be a Personalized Judge?
Ensuring that large language models (LLMs) reflect diverse user values and preferences is
crucial as their user bases expand globally. It is therefore encouraging to see the growing …
crucial as their user bases expand globally. It is therefore encouraging to see the growing …
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
LLMs-as-a-judge is a recently popularized method which replaces human judgements in
task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to …
task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to …
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Alignment approaches such as RLHF and DPO are actively investigated to align large
language models (LLMs) with human preferences. Commercial large language models …
language models (LLMs) with human preferences. Commercial large language models …