From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai

M Nauta, J Trienes, S Pathak, E Nguyen… - ACM Computing …, 2023 - dl.acm.org
The rising popularity of explainable artificial intelligence (XAI) to understand high-performing
black boxes raised the question of how to evaluate explanations of machine learning (ML) …

Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

L Huang, W Yu, W Ma, W Zhong, Z Feng… - ACM Transactions on …, 2023 - dl.acm.org
The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …

G-eval: Nlg evaluation using gpt-4 with better human alignment

Y Liu, D Iter, Y Xu, S Wang, R Xu, C Zhu - arXiv preprint arXiv:2303.16634, 2023 - arxiv.org
The quality of texts generated by natural language generation (NLG) systems is hard to
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …

Benchmarking large language models for news summarization

T Zhang, F Ladhak, E Durmus, P Liang… - Transactions of the …, 2024 - direct.mit.edu
Large language models (LLMs) have shown promise for automatic summarization but the
reasons behind their successes are poorly understood. By conducting a human evaluation …

Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

Gptscore: Evaluate as you desire

J Fu, SK Ng, Z Jiang, P Liu - arXiv preprint arXiv:2302.04166, 2023 - arxiv.org
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models
that are capable of producing high-caliber text, images, and other outputs through the …

Is chatgpt a good nlg evaluator? a preliminary study

J Wang, Y Liang, F Meng, Z Sun, H Shi, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, the emergence of ChatGPT has attracted wide attention from the computational
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …

MTEB: Massive text embedding benchmark

N Muennighoff, N Tazi, L Magne, N Reimers - arXiv preprint arXiv …, 2022 - arxiv.org
Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …