Towards generalisable hate speech detection: a review on obstacles and solutions

W Yin, A Zubiaga - PeerJ Computer Science, 2021 - peerj.com
Hate speech is one type of harmful online content which directly attacks or promotes hate
towards a group or an individual member based on their actual or perceived aspects of …

Directions in abusive language training data, a systematic review: Garbage in, garbage out

B Vidgen, L Derczynski - Plos one, 2020 - journals.plos.org
Data-driven and machine learning based approaches for detecting, categorising and
measuring abusive content such as hate speech and harassment have gained traction due …

A holistic approach to undesired content detection in the real world

T Markov, C Zhang, S Agarwal, FE Nekoul… - Proceedings of the …, 2023 - ojs.aaai.org
We present a holistic approach to building a robust and useful natural language
classification system for real-world content moderation. The success of such a system relies …

Dynabench: Rethinking benchmarking in NLP

D Kiela, M Bartolo, Y Nie, D Kaushik, A Geiger… - arXiv preprint arXiv …, 2021 - arxiv.org
We introduce Dynabench, an open-source platform for dynamic dataset creation and model
benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the …

Process for adapting language models to society (palms) with values-targeted datasets

I Solaiman, C Dennison - Advances in Neural Information …, 2021 - proceedings.neurips.cc
Abstract Language models can generate harmful and biased outputs and exhibit
undesirable behavior according to a given cultural context. We propose a Process for …

HateCheck: Functional tests for hate speech detection models

P Röttger, B Vidgen, D Nguyen, Z Waseem… - arXiv preprint arXiv …, 2020 - arxiv.org
Detecting online hate is a difficult task that even state-of-the-art models struggle with.
Typically, hate speech detection models are evaluated by measuring their performance on …

Learning from the worst: Dynamically generated datasets to improve online hate detection

B Vidgen, T Thrush, Z Waseem, D Kiela - arXiv preprint arXiv:2012.15761, 2020 - arxiv.org
We present a human-and-model-in-the-loop process for dynamically generating datasets
and training better performing and more robust hate detection models. We provide a new …

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

P Röttger, HR Kirk, B Vidgen, G Attanasio… - arXiv preprint arXiv …, 2023 - arxiv.org
Without proper safeguards, large language models will readily follow malicious instructions
and generate toxic content. This risk motivates safety efforts such as red-teaming and large …

A deep neural network based multi-task learning approach to hate speech detection

P Kapil, A Ekbal - Knowledge-Based Systems, 2020 - Elsevier
With the advent of the internet and numerous social media platforms, citizens now have
enormous opportunities to express and share their opinions on various societal and political …

Offensive language and hate speech detection for Danish

GI Sigurbergsson, L Derczynski - arXiv preprint arXiv:1908.04531, 2019 - arxiv.org
The presence of offensive language on social media platforms and the implications this
poses is becoming a major concern in modern society. Given the enormous amount of …