When llms meet cunning questions: A fallacy understanding benchmark for large language models

Lateval: An interactive llms evaluation benchmark with incomplete information from lateral thinking puzzles

S Huang, S Ma, Y Li, M Huang, W Zou, W Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the continuous evolution and refinement of LLMs, they are endowed with impressive
logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

Z Xu, Y Li, R Ding, X Wang, B Chen, Y Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org

How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point
and hot topic in current LLMs research. Previous work has noted that due to the extremely …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

EXCGEC: A Benchmark of Edit-wise Explainable Chinese Grammatical Error Correction

J Ye, S Qin, Y Li, X Cheng, L Qin, HT Zheng… - arXiv preprint arXiv …, 2024 - arxiv.org

Existing studies explore the explainability of Grammatical Error Correction (GEC) in a limited
scenario, where they ignore the interaction between corrections and explanations. To bridge …

相关文章所有 2 个版本

[PDF] arxiv.org

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

X Wu, J Yang, L Chai, G Zhang, J Liu, X Du… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in Large Language Models (LLMs) have markedly enhanced the
interpretation and processing of tabular data, introducing previously unimaginable …

相关文章所有 2 个版本

[PDF] arxiv.org

CLEME2. 0: Towards More Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction

J Ye, Z Xu, Y Li, X Cheng, L Song, Q Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The paper focuses on improving the interpretability of Grammatical Error Correction (GEC)
metrics, which receives little attention in previous studies. To bridge the gap, we propose …

相关文章所有 2 个版本

TiNID: A Transfer and Interpretable LLM-Enhanced Framework for New Intent Discovery

S Zhang, C Yan, J Yang, W Zhang, C Ren, T Li… - … Conference on Machine …, 2024 - Springer

Abstract New Intent Discovery (NID) is an essential task in open-world learning, tasked with
the identification and classification of both known and novel intents using a combination of …

相关文章所有 2 个版本

大语言模型评估技术研究进展.

赵睿卓，曲紫畅，陈国英，王坤龙… - … Ju Cai Ji Yu Chu Li, 2024 - search.ebscohost.com

随着大语言模型的广泛应用, 针对大语言模型的评估工作变得至关重要. 除了大语言模型在下游
任务上的表现情况需要评估外, 其存在的一些潜在风险更需要评估, 例如大语言模型可能违背 …

[PDF] arxiv.org

Can a Hallucinating Model help in Reducing Human" Hallucination"?

SS Sundaram, B Alwar - arXiv preprint arXiv:2405.00843, 2024 - arxiv.org

The prevalence of unwarranted beliefs, spanning pseudoscience, logical fallacies, and
conspiracy theories, presents substantial societal hurdles and the risk of disseminating …

相关文章所有 2 个版本

[PDF] researchgate.net

[PDF][PDF] Sentiment Analysis in E-Commerce: Causal Reasoning Through LLMs and CausalBench

A Bibi, W Burgard - researchgate.net

In the evolving landscape of e-commerce, understanding customer sentiment is critical for
tailoring business strategies and improving customer experiences. Traditional sentiment …

[PDF] researchgate.net

[PDF][PDF] Evaluating Causal Reasoning in LLMs for Customer Feedback Analysis with CausalBench

M Asghar, W Burgard - researchgate.net

In the realm of customer feedback analysis, accurately understanding the causal
relationships between feedback elements and customer behavior is essential for making …