HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

USENIX Security 2025

Xinyue Shen¹, Yixin Wu¹, Yiting Qu¹, Michael Backes¹, Savvas Zannettou², Yang Zhang¹

¹CISPA Helmholtz Center for Information Security, ²TU Delft

PDF arXiv Code 🤗 Data

TL;DR

We introduce HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech.

Our main contributions are:

🗂️ New hate-speech dataset generated from LLMs, comprising 7,838 samples across 34 identity groups and six LLMs, with meticulously manual annotation.
🌟 New understanding of LLM-generated hate speech: while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs.
⚠️ New threat that LLMs bring to hate speech detection (LLM-driven hate campaigns): adversaries can use techniques like adversarial and model stealing attacks to evade detectors and automate hate campaigns, achieving up to 0.966 attack success rate. Its attack efficiency can be further improved by 13−21× through model stealing attacks.

Disclaimer. This website contains examples of hateful and abusive language. Reader discretion is recommended.

LLM-Generated Hate Speech

Human-written samples are more scattered and have some overlap with samples generated by LLMs.
GPT4-generated samples are notably more distant from human-written samples than other LLMs.

Click to browse LLM-Generated Samples

T-SNE visualization of
human-written and LLM-generated text

HateBench

HateBench operates in three stages: 1) dataset construction, 2) hate speech detector selection, and 3) assessment.

Till now, we covered:

7,838 LLM-generated samples: 3,641 hate and 4,197 non-hate
34 identity groups: Black or African American, Latino or Non-White Hispanic, Atheists, Buddhists, Immigrants, Transgender Women, Bisexual, People With Physical Disabilities, etc. Check Identity Groups for the complete list.
8 hate speech detectors: Perspective, Moderation, Detoxify (Original), Detoxify (Unbiased), LFTW, TweetHate, BERT-HateXplain
2 LLM status: original or jailbroken

Leaderboard

Existing top-performing hate speech detectors typically perform well on LLM-generated content.
However, their performance degrades with newer versions of LLMs such as GPT-4.

LLM-Driven Hate Campaigns

The hate campaign, also known as coordinated hate attack or raid, is a series of coordinated actions that aim to spread harmful or derogatory content, often targeting specific identity groups to incite discrimination, hostility, or violence.

Traditional way: Manually craft and tweak hate speech to bypass detection.
LLM-Driven way: Utilize LLMs to automatically generate hate speech and evade detectors by advanced techniques such as adversarial attacks and model stealing attacks.

Threat scenario of LLM-Driven hate campaign

1) Adversarial Hate Campaign

Detectors demonstrate weak robustness against adversarial attacks. The most potent adversarial attack can achieve an ASR of over 0.966 on Perspective, Moderation, and TweetHate.

2) Stealthy Hate Campaign

By establishing a local copy of the target detector, an adversary can increase the efficiency of generating hate speech by 13−21× while still retaining impressive ASR.

Ethics and Disclosures

Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw.

BibTeX

If you find this useful in your research, please consider citing:

@inproceedings{SWQBZZ25,
  author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang},
  title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}},
  booktitle = {{USENIX Security Symposium (USENIX Security)}},
  publisher = {USENIX},
  year = {2025}
}