From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai
The rising popularity of explainable artificial intelligence (XAI) to understand high-performing
black boxes raised the question of how to evaluate explanations of machine learning (ML) …
black boxes raised the question of how to evaluate explanations of machine learning (ML) …
Pre-trained language models for text generation: A survey
Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …
A survey of large language models
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions
The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …
natural language processing (NLP), fueling a paradigm shift in information acquisition …
G-eval: Nlg evaluation using gpt-4 with better human alignment
The quality of texts generated by natural language generation (NLG) systems is hard to
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …
measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE …
Benchmarking large language models for news summarization
Large language models (LLMs) have shown promise for automatic summarization but the
reasons behind their successes are poorly understood. By conducting a human evaluation …
reasons behind their successes are poorly understood. By conducting a human evaluation …
Holistic evaluation of language models
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …
technologies, but their capabilities, limitations, and risks are not well understood. We present …
Gptscore: Evaluate as you desire
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models
that are capable of producing high-caliber text, images, and other outputs through the …
that are capable of producing high-caliber text, images, and other outputs through the …
Is chatgpt a good nlg evaluator? a preliminary study
Recently, the emergence of ChatGPT has attracted wide attention from the computational
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …
MTEB: Massive text embedding benchmark
Text embeddings are commonly evaluated on a small set of datasets from a single task not
covering their possible applications to other tasks. It is unclear whether state-of-the-art …
covering their possible applications to other tasks. It is unclear whether state-of-the-art …