CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks

Kocommongen v2: A benchmark for navigating korean commonsense reasoning challenges in large language models

J Seo, J Lee, C Park, ST Hong, S Lee… - Findings of the …, 2024 - aclanthology.org

The evolution of large language models (LLMs) has culminated in a multitask model
paradigm where prompts drive the generation of user-specific outputs. However, this …

被引用次数：2 相关文章

[PDF] arxiv.org

EmoBench: Evaluating the Emotional Intelligence of Large Language Models

S Sabour, S Liu, Z Zhang, JM Liu, J Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advances in Large Language Models (LLMs) have highlighted the need for robust,
comprehensive, and challenging benchmarks. Yet, research on evaluating their Emotional …

被引用次数：14 相关文章所有 3 个版本

[PDF] arxiv.org

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

HT Su, YC Hsu, X Lin, XQ Shi, Y Niu, HY Hsu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have
shown significant multi-step reasoning capabilities in factual content like mathematics …

CogLM: Tracking Cognitive Development of Large Language Models

X Wang, P Yuan, S Feng, Y Li, B Pan, H Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

Piaget's Theory of Cognitive Development (PTC) posits that the development of cognitive
levels forms the foundation for human learning across various abilities. As Large Language …

Testing Memory Capabilities in Large Language Models with the Sequence Order Recall Task

M Pink, VA Vo, Q Wu, J Mu, JS Turek, U Hasson… - Latinx in AI@ NeurIPS … - openreview.net

Many benchmarks focus on evaluating Large Language Models (LLMs) on facts and
semantic relations, primarily assessing their semantic memory. However, some memories in …