Holistic evaluation of language models
Language models (LMs) are becoming the foundation for almost all major language
technologies, but their capabilities, limitations, and risks are not well understood. We present …
technologies, but their capabilities, limitations, and risks are not well understood. We present …
The rise and potential of large language model based agents: A survey
For a long time, humanity has pursued artificial intelligence (AI) equivalent to or surpassing
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …
the human level, with AI agents considered a promising vehicle for this pursuit. AI agents are …
Trustllm: Trustworthiness in large language models
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …
attention for their excellent natural language processing capabilities. Nonetheless, these …
Grounding and evaluation for large language models: Practical challenges and lessons learned (survey)
With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes
domains, ensuring the trustworthiness, safety, and observability of these systems has …
domains, ensuring the trustworthiness, safety, and observability of these systems has …
State of what art? a call for multi-prompt llm evaluation
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which
typically rely on a single instruction template per task. We create a large-scale collection of …
typically rely on a single instruction template per task. We create a large-scale collection of …
[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …
natural language processing capabilities. Nonetheless, these LLMs present many …
Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …
NLP. We find that the distribution shift settings in previous studies commonly lack adequate …
Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective
Pre-trained language models (PLMs) are known to improve the generalization performance
of natural language understanding models by leveraging large amounts of data during the …
of natural language understanding models by leveraging large amounts of data during the …
Robust recommender system: a survey and future directions
With the rapid growth of information, recommender systems have become integral for
providing personalized suggestions and overcoming information overload. However, their …
providing personalized suggestions and overcoming information overload. However, their …
SemEval-2024 task 2: Safe biomedical natural language inference for clinical trials
Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in
dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs …
dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs …