关注
Paul Röttger
Paul Röttger
Postdoctoral Researcher, Bocconi University
在 unibocconi.it 的电子邮件经过验证 - 首页
标题
引用次数
引用次数
年份
HateCheck: Functional Tests for Hate Speech Detection Models
P Röttger, B Vidgen, D Nguyen, Z Waseem, H Margetts, J Pierrehumbert
ACL 2021 (Main) - 🏆 Stanford HAI AI Audit Challenge, 2021
2072021
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
P Röttger, B Vidgen, D Hovy, JB Pierrehumbert
NAACL 2022 (Main), 2022
1092022
SemEval-2023 Task 10: Explainable Detection of Online Sexism
HR Kirk, W Yin, B Vidgen, P Röttger
ACL 2023 (Main) - 🏆 Best Task Paper, 2023
852023
The benefits, risks and bounds of personalizing the alignment of large language models to individuals
HR Kirk, B Vidgen, P Röttger, SA Hale
Nature Machine Intelligence, 2024
79*2024
Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media
P Röttger, JB Pierrehumbert
EMNLP 2021 (Findings), 2021
562021
Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate
HR Kirk, B Vidgen, P Röttger, T Thrush, SA Hale
NAACL 2022 (Main), 2021
512021
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions
F Bianchi, M Suzgun, G Attanasio, P Röttger, D Jurafsky, T Hashimoto, ...
ICLR 2024 (Poster), 2023
452023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
P Röttger, HR Kirk, B Vidgen, G Attanasio, F Bianchi, D Hovy
NAACL 2024 (Main), 2023
422023
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models
P Röttger, H Seelawi, D Nozza, Z Talat, B Vidgen
WOAH at NAACL 2022, 2022
402022
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
HR Kirk, AM Bean, B Vidgen, P Röttger, SA Hale
EMNLP 2023 (Main), 2023
172023
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
P Röttger, V Hofmann, V Pyatkin, M Hinck, HR Kirk, H Schütze, D Hovy
ACL 2024 (Main), 2024
112024
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
X Wang, B Ma, C Hu, L Weber-Genzel, P Röttger, F Kreuter, D Hovy, ...
ACL 2024 (Findings), 2024
112024
The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics
M Orlikowski, P Röttger, P Cimiano, D Hovy
ACL 2023 (Main), 2023
112023
Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages
P Röttger, D Nozza, F Bianchi, D Hovy
EMNLP 2022 (Main), 2022
112022
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
HR Kirk, A Whitefield, P Röttger, A Bean, K Margatina, J Ciro, R Mosquera, ...
arXiv preprint arXiv:2404.16019, 2024
82024
Simplesafetytests: a test suite for identifying critical safety risks in large language models
B Vidgen, HR Kirk, R Qian, N Scherrer, A Kannappan, SA Hale, P Röttger
arXiv preprint arXiv:2311.08370, 2023
82023
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising “Alignment” in Large Language Models
HR Kirk, B Vidgen, P Röttger, SA Hale
NeurIPS 2023 (SoLaR Workshop), 2023
52023
Introducing v0.5 of the AI Safety Benchmark from MLCommons
B Vidgen, A Agrawal, AM Ahmed, V Akinwande, N Al-Nuaimi, N Alfaraj, ...
arXiv preprint arXiv:2404.12241, 2024
32024
Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety
P Röttger, F Pernisi, B Vidgen, D Hovy
arXiv preprint arXiv:2404.05399, 2024
32024
Near to Mid-term Risks and Opportunities of Open Source Generative AI
F Eiras, A Petrov, B Vidgen, CS de Witt, F Pizzati, K Elkins, ...
ICML 2024 (Oral), 2024
22024
系统目前无法执行此操作,请稍后再试。
文章 1–20