X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with...

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arXiv preprint arXiv …, 2024 - arxiv.org

Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

被引用次数：3 相关文章所有 3 个版本

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv e-prints, 2024 - ui.adsabs.harvard.edu

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

被引用次数：40 相关文章

[PDF] arxiv.org

Holistic evaluation for interleaved text-and-image generation

M Liu, Z Xu, Z Lin, T Ashby, J Rimchala, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Interleaved text-and-image generation has been an intriguing research direction, where the
models are required to generate both images and text pieces in an arbitrary order. Despite …

被引用次数：3 相关文章所有 3 个版本

[PDF] aclanthology.org

X-ace: Explainable and multi-factor audio captioning evaluation

Q Wang, JC Gu, ZH Ling - Findings of the Association for …, 2024 - aclanthology.org

Automated audio captioning (AAC) aims to generate descriptions based on audio input,
attracting exploration of emerging audio language models (ALMs). However, current …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Are LLM-based Evaluators Confusing NLG Quality Criteria?

X Hu, M Gao, S Hu, Y Zhang, Y Chen, T Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Some prior work has shown that LLMs perform well in NLG evaluation for different tasks.
However, we discover that LLMs seem to confuse different evaluation criteria, which reduces …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

CheckEval: Robust Evaluation Framework using Large Language Model via Checklist

Y Lee, J Kim, J Kim, H Cho, P Kang - arXiv preprint arXiv:2403.18771, 2024 - arxiv.org

We introduce CheckEval, a novel evaluation framework using Large Language Models,
addressing the challenges of ambiguity and inconsistency in current evaluation methods …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org