Leveraging large language models for nlg evaluation: Advances and challenges

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arXiv preprint arXiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv e-prints, 2024 - ui.adsabs.harvard.edu
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

Holistic evaluation for interleaved text-and-image generation

M Liu, Z Xu, Z Lin, T Ashby, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Interleaved text-and-image generation has been an intriguing research direction, where the
models are required to generate both images and text pieces in an arbitrary order. Despite …

X-ace: Explainable and multi-factor audio captioning evaluation

Q Wang, JC Gu, ZH Ling - Findings of the Association for …, 2024 - aclanthology.org
Automated audio captioning (AAC) aims to generate descriptions based on audio input,
attracting exploration of emerging audio language models (ALMs). However, current …

Are LLM-based Evaluators Confusing NLG Quality Criteria?

X Hu, M Gao, S Hu, Y Zhang, Y Chen, T Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Some prior work has shown that LLMs perform well in NLG evaluation for different tasks.
However, we discover that LLMs seem to confuse different evaluation criteria, which reduces …

CheckEval: Robust Evaluation Framework using Large Language Model via Checklist

Y Lee, J Kim, J Kim, H Cho, P Kang - arXiv preprint arXiv:2403.18771, 2024 - arxiv.org
We introduce CheckEval, a novel evaluation framework using Large Language Models,
addressing the challenges of ambiguity and inconsistency in current evaluation methods …

FormalAlign: Automated Alignment Evaluation for Autoformalization

J Lu, Y Wan, Y Huang, J Xiong, Z Liu, Z Guo - arXiv preprint arXiv …, 2024 - arxiv.org
Autoformalization aims to convert informal mathematical proofs into machine-verifiable
formats, bridging the gap between natural and formal languages. However, ensuring …

CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

J Yao, X Yi, X Xie - arXiv preprint arXiv:2407.10725, 2024 - arxiv.org
The rapid progress in Large Language Models (LLMs) poses potential risks such as
generating unethical content. Assessing LLMs' values can help expose their misalignment …

SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

R Ghosh, T Yao, L Chen, S Hasan, T Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Model (LLM) integrations into applications like Microsoft365 suite and
Google Workspace for creating/processing documents, emails, presentations, etc. has led to …