Don't make your llm an evaluation benchmark cheater

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：2245 相关文章所有 4 个版本

[PDF] arxiv.org

Datasets for large language models: A comprehensive survey

Y Liu, J Cao, C Liu, K Ding, L Jin - arXiv preprint arXiv:2402.18041, 2024 - arxiv.org

This paper embarks on an exploration into the Large Language Model (LLM) datasets,
which play a crucial role in the remarkable advancements of LLMs. The datasets serve as …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Inadequacies of large language model benchmarks in the era of generative artificial intelligence

TR McIntosh, T Susnjak, T Liu, P Watters… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities
has spurred public curiosity to evaluate and compare different LLMs, leading many …

被引用次数：65 相关文章所有 3 个版本

[PDF] aaai.org

Task contamination: Language models may not be few-shot anymore

C Li, J Flanigan - Proceedings of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org

Large language models (LLMs) offer impressive performance in various zero-shot and few-
shot tasks. However, their success in zero-shot or few-shot settings may be affected by task …

被引用次数：43 相关文章所有 3 个版本

[PDF] arxiv.org

How much are llms contaminated? a comprehensive survey and the llmsanitize library

M Ravaut, B Ding, F Jiao, H Chen, X Li, R Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rise of Large Language Models (LLMs) in recent years, new opportunities are
emerging, but also new challenges, and contamination is quickly becoming critical …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling

D Kim, C Park, S Kim, W Lee, W Song, Y Kim… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce SOLAR 10.7 B, a large language model (LLM) with 10.7 billion parameters,
demonstrating superior performance in various natural language processing (NLP) tasks …

被引用次数：46 相关文章所有 2 个版本

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

被引用次数：18 相关文章所有 4 个版本

[PDF] arxiv.org

An llm-free multi-dimensional benchmark for mllms hallucination evaluation

J Wang, Y Wang, G Xu, J Zhang, Y Gu, H Jia… - arXiv preprint arXiv …, 2023 - arxiv.org

Despite making significant progress in multi-modal tasks, current Multi-modal Large
Language Models (MLLMs) encounter the significant challenge of hallucination, which may …

被引用次数：42 相关文章所有 2 个版本

[PDF] arxiv.org

Ideal: Influence-driven selective annotations empower in-context learners in large language models

S Zhang, X Xia, Z Wang, LH Chen, J Liu, Q Wu… - arXiv preprint arXiv …, 2023 - arxiv.org

In-context learning is a promising paradigm that utilizes in-context examples as prompts for
the predictions of large language models. These prompts are crucial for achieving strong …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

Promptbench: A unified library for evaluation of large language models

K Zhu, Q Zhao, H Chen, J Wang, X Xie - arXiv preprint arXiv:2312.07910, 2023 - arxiv.org

The evaluation of large language models (LLMs) is crucial to assess their performance and
mitigate potential security risks. In this paper, we introduce PromptBench, a unified library to …

被引用次数：14 相关文章所有 2 个版本