Lateval: An interactive llms evaluation benchmark with incomplete information from lateral thinking puzzles

S Huang, S Ma, Y Li, M Huang, W Zou, W Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the continuous evolution and refinement of LLMs, they are endowed with impressive
logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they …

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

Z Xu, Y Li, R Ding, X Wang, B Chen, Y Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point
and hot topic in current LLMs research. Previous work has noted that due to the extremely …

EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction

J Ye, S Qin, Y Li, X Cheng, L Qin, HT Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited
scenario, where they ignore the interaction between corrections and explanations. To bridge …

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

X Wu, J Yang, L Chai, G Zhang, J Liu, X Du… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Large Language Models (LLMs) have markedly enhanced the
interpretation and processing of tabular data, introducing previously unimaginable …

CLEME2. 0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

J Ye, Z Xu, Y Li, X Cheng, L Song, Q Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The paper focuses on improving the interpretability of Grammatical Error Correction (GEC)
metrics, which receives little attention in previous studies. To bridge the gap, we propose …

TiNID: A Transfer and Interpretable LLM-Enhanced Framework for New Intent Discovery

S Zhang, C Yan, J Yang, W Zhang, C Ren, T Li… - … Conference on Machine …, 2024 - Springer
Abstract New Intent Discovery (NID) is an essential task in open-world learning, tasked with
the identification and classification of both known and novel intents using a combination of …

大语言模型评估技术研究进展.

赵睿卓, 曲紫畅, 陈国英, 王坤龙… - … Ju Cai Ji Yu Chu Li, 2024 - search.ebscohost.com
随着大语言模型的广泛应用, 针对大语言模型的评估工作变得至关重要. 除了大语言模型在下游
任务上的表现情况需要评估外, 其存在的一些潜在风险更需要评估, 例如大语言模型可能违背 …

Can a Hallucinating Model help in Reducing Human" Hallucination"?

SS Sundaram, B Alwar - arXiv preprint arXiv:2405.00843, 2024 - arxiv.org
The prevalence of unwarranted beliefs, spanning pseudoscience, logical fallacies, and
conspiracy theories, presents substantial societal hurdles and the risk of disseminating …

[PDF][PDF] Sentiment Analysis in E-Commerce: Causal Reasoning Through LLMs and CausalBench

A Bibi, W Burgard - researchgate.net
In the evolving landscape of e-commerce, understanding customer sentiment is critical for
tailoring business strategies and improving customer experiences. Traditional sentiment …

[PDF][PDF] Evaluating Causal Reasoning in LLMs for Customer Feedback Analysis with CausalBench

M Asghar, W Burgard - researchgate.net
In the realm of customer feedback analysis, accurately understanding the causal
relationships between feedback elements and customer behavior is essential for making …