Null it out: Guarding protected attributes by iterative nullspace projection

E Pavlick - … Transactions of the Royal Society A, 2023 - royalsocietypublishing.org

Large language models (LLMs) are one of the most impressive achievements of artificial
intelligence in recent years. However, their relevance to the study of language more broadly …

被引用次数：61 相关文章所有 4 个版本

[PDF] arxiv.org

A survey on fairness in large language models

Y Li, M Du, R Song, X Wang, Y Wang - arXiv preprint arXiv:2308.10149, 2023 - arxiv.org

Large language models (LLMs) have shown powerful performance and development
prospect and are widely deployed in the real world. However, LLMs can capture social …

被引用次数：50 相关文章所有 2 个版本

[PDF] mlr.press

Towards understanding and mitigating social biases in language models

PP Liang, C Wu, LP Morency… - … on Machine Learning, 2021 - proceedings.mlr.press

As machine learning methods are deployed in real-world settings such as healthcare, legal
systems, and social science, it is crucial to recognize how they shape social biases and …

被引用次数：330 相关文章所有 7 个版本

[PDF] aclanthology.org

Auto-debias: Debiasing masked language models with automated biased prompts

Y Guo, Y Yang, A Abbasi - … of the 60th Annual Meeting of the …, 2022 - aclanthology.org

Human-like biases and undesired social stereotypes exist in large pretrained language
models. Given the wide adoption of these models in real-world applications, mitigating such …

被引用次数：137 相关文章所有 3 个版本

[PDF] neurips.cc

Leace: Perfect linear concept erasure in closed form

N Belrose, D Schneider-Joseph… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Concept erasure aims to remove specified features from a representation. It can
improve fairness (eg preventing a classifier from using gender or race) and interpretability …

被引用次数：61 相关文章所有 5 个版本

[PDF] arxiv.org

Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI

A Jacovi, A Marasović, T Miller… - Proceedings of the 2021 …, 2021 - dl.acm.org

Trust is a central component of the interaction between people and AI, in that'incorrect'levels
of trust may cause misuse, abuse or disuse of the technology. But what, precisely, is the …

被引用次数：443 相关文章所有 5 个版本

[PDF] mit.edu

Causal inference in natural language processing: Estimation, prediction, interpretation and beyond

A Feder, KA Keith, E Manzoor, R Pryzant… - Transactions of the …, 2022 - direct.mit.edu

A fundamental goal of scientific research is to learn about causal relationships. However,
despite its critical role in the life and social sciences, causality has not had the same …

被引用次数：203 相关文章所有 10 个版本

[PDF] arxiv.org

Language (technology) is power: A critical survey of" bias" in nlp

SL Blodgett, S Barocas, H Daumé III… - arXiv preprint arXiv …, 2020 - arxiv.org

We survey 146 papers analyzing" bias" in NLP systems, finding that their motivations are
often vague, inconsistent, and lacking in normative reasoning, despite the fact that …

被引用次数：1102 相关文章所有 5 个版本

[HTML] mit.edu

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

T Schick, S Udupa, H Schütze - Transactions of the Association for …, 2021 - direct.mit.edu

Abstract⚠ This paper contains prompts and model outputs that are offensive in nature. When
trained on large, unfiltered crawls from the Internet, language models pick up and reproduce …

被引用次数：285 相关文章所有 10 个版本

[PDF] neurips.cc

Interpretability at scale: Identifying causal mechanisms in alpaca

Z Wu, A Geiger, T Icard, C Potts… - Advances in Neural …, 2024 - proceedings.neurips.cc

Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …

被引用次数：55 相关文章所有 6 个版本