Holistic evaluation of language models

P Liang, R Bommasani, T Lee, D Tsipras… - arXiv preprint arXiv …, 2022 - arxiv.org
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …

The rise and potential of large language model based agents: A survey

Z Xi, W Chen, X Guo, W He, Y Ding, B Hong… - arXiv preprint arXiv …, 2023 - arxiv.org
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …

Trustllm: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu, Q Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

Grounding and evaluation for large language models: Practical challenges and lessons learned (survey)

K Kenthapadi, M Sameki, A Taly - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes
domains, ensuring the trustworthiness, safety, and observability of these systems has …

State of what art? a call for multi-prompt llm evaluation

M Mizrahi, G Kaplan, D Malkin, R Dror… - Transactions of the …, 2024 - direct.mit.edu
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which
typically rely on a single instruction template per task. We create a large-scale collection of …

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations

L Yuan, Y Chen, G Cui, H Gao, F Zou… - Advances in …, 2023 - proceedings.neurips.cc
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …

Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective

L Yang, S Zhang, L Qin, Y Li, Y Wang, H Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
Pre-trained language models (PLMs) are known to improve the generalization performance
of natural language understanding models by leveraging large amounts of data during the …

Robust recommender system: a survey and future directions

K Zhang, Q Cao, F Sun, Y Wu, S Tao, H Shen… - arXiv preprint arXiv …, 2023 - arxiv.org
With the rapid growth of information, recommender systems have become integral for
providing personalized suggestions and overcoming information overload. However, their …

SemEval-2024 task 2: Safe biomedical natural language inference for clinical trials

M Jullien, M Valentino, A Freitas - arXiv preprint arXiv:2404.04963, 2024 - arxiv.org
Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in
dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs …