Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence...

J Schneider - Artificial Intelligence Review, 2024 - Springer

Generative AI (GenAI) represents a shift from AI's ability to “recognize” to its ability to
“generate” solutions for a wide range of tasks. As generated solutions and applications grow …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv preprint arXiv:2401.07103, 2024 - arxiv.org

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

被引用次数：23 相关文章所有 2 个版本

[PDF] aclanthology.org

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Z Kasner, O Dušek - Proceedings of the 62nd Annual Meeting of …, 2024 - aclanthology.org

We analyze the behaviors of open large language models (LLMs) on the task of data-to-text
(D2T) generation, ie, generating coherent and relevant text from structured data. To avoid …

被引用次数：3 相关文章

[PDF] arxiv.org

CopyBench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation

T Chen, A Asai, N Mireshghallah, S Min… - arXiv preprint arXiv …, 2024 - arxiv.org

Evaluating the degree of reproduction of copyright-protected content by language models
(LMs) is of significant interest to the AI and legal communities. Although both literal and non …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models

C Qu, S Dai, X Wei, H Cai, S Wang, D Yin, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, the integration of external tools with Large Language Models (LLMs) has emerged
as a promising approach to overcome the inherent constraints of their pre-training data …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings

S Kweon, J Kim, H Kwak, D Cha, H Yoon, K Kim… - arXiv preprint arXiv …, 2024 - arxiv.org

This study introduces EHRNoteQA, a novel patient-specific question answering benchmark
tailored for evaluating Large Language Models (LLMs) in clinical environments. Based on …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Are LLM-based Evaluators Confusing NLG Quality Criteria?

X Hu, M Gao, S Hu, Y Zhang, Y Chen, T Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Some prior work has shown that LLMs perform well in NLG evaluation for different tasks.
However, we discover that LLMs seem to confuse different evaluation criteria, which reduces …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Y Li, S Qin, J Ye, S Ma, Y Li, L Qin, X Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, Large Language Models (LLMs) have been widely studied by researchers for their
roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese …

被引用次数：3 相关文章所有 2 个版本

[PDF] aclanthology.org

Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation

FMP Del Arco, D Nozza, D Hovy - Proceedings of the 3rd …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) exhibit remarkable text classification capabilities,
excelling in zero-and few-shot learning (ZSL and FSL) scenarios. However, since they are …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

I Ziegler, A Köksal, D Elliott, H Schütze - arXiv preprint arXiv:2409.02098, 2024 - arxiv.org

Building high-quality datasets for specialized tasks is a time-consuming and resource-
intensive process that often requires specialized domain knowledge. We propose Corpus …