[HTML][HTML] Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda

J Schneider - Artificial Intelligence Review, 2024 - Springer
Generative AI (GenAI) represents a shift from AI's ability to “recognize” to its ability to
“generate” solutions for a wide range of tasks. As generated solutions and applications grow …

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv preprint arXiv:2401.07103, 2024 - arxiv.org
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Z Kasner, O Dušek - Proceedings of the 62nd Annual Meeting of …, 2024 - aclanthology.org
We analyze the behaviors of open large language models (LLMs) on the task of data-to-text
(D2T) generation, ie, generating coherent and relevant text from structured data. To avoid …

CopyBench: Measuring literal and non-literal reproduction of copyright-protected text in language model generation

T Chen, A Asai, N Mireshghallah, S Min… - arXiv preprint arXiv …, 2024 - arxiv.org
Evaluating the degree of reproduction of copyright-protected content by language models
(LMs) is of significant interest to the AI and legal communities. Although both literal and non …

COLT: Towards Completeness-Oriented Tool Retrieval for Large Language Models

C Qu, S Dai, X Wei, H Cai, S Wang, D Yin, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, the integration of external tools with Large Language Models (LLMs) has emerged
as a promising approach to overcome the inherent constraints of their pre-training data …

EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings

S Kweon, J Kim, H Kwak, D Cha, H Yoon, K Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
This study introduces EHRNoteQA, a novel patient-specific question answering benchmark
tailored for evaluating Large Language Models (LLMs) in clinical environments. Based on …

Are LLM-based Evaluators Confusing NLG Quality Criteria?

X Hu, M Gao, S Hu, Y Zhang, Y Chen, T Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Some prior work has shown that LLMs perform well in NLG evaluation for different tasks.
However, we discover that LLMs seem to confuse different evaluation criteria, which reduces …

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Y Li, S Qin, J Ye, S Ma, Y Li, L Qin, X Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, Large Language Models (LLMs) have been widely studied by researchers for their
roles in various downstream NLP tasks. As a fundamental task in the NLP field, Chinese …

Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation

FMP Del Arco, D Nozza, D Hovy - Proceedings of the 3rd …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) exhibit remarkable text classification capabilities,
excelling in zero-and few-shot learning (ZSL and FSL) scenarios. However, since they are …

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

I Ziegler, A Köksal, D Elliott, H Schütze - arXiv preprint arXiv:2409.02098, 2024 - arxiv.org
Building high-quality datasets for specialized tasks is a time-consuming and resource-
intensive process that often requires specialized domain knowledge. We propose Corpus …