HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

To Appear in the USENIX Security 2025
1CISPA Helmholtz Center for Information Security, 2TU Delft

Overview

We introduce HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs.

We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by 13−21× through model stealing attacks with acceptable attack performance.

Disclaimer. This website contains examples of hateful and abusive language. Reader discretion is recommended.

LLM-Generated Hate Speech

  • Human-written samples are more scattered and have some overlap with samples generated by LLMs.
  • GPT4-generated samples are notably more distant from human-written samples than other LLMs.

Click to see the samples generated by LLMs: LLM-Generated Samples

T-SNE visualization of
human-written and LLM-generated text

HateBench

HateBench operates in three stages: 1) dataset construction, 2) hate speech detector selection, and 3) assessment.

Overview of HateBench

Till now, we covered:

  • 7,838 LLM-generated samples: 3,641 hate and 4,197 non-hate
  • 34 identity groups: Black or African American, Latino or Non-White Hispanic, Atheists, Buddhists, Immigrants, Transgender Women, Bisexual, People With Physical Disabilities, etc. Check Identity Groups for the complete list.
  • 8 hate speech detectors: Perspective, Moderation, Detoxify (Original), Detoxify (Unbiased), LFTW, TweetHate, BERT-HateXplain
  • 2 LLM status: original or jailbroken

Leaderboard

  • Existing top-performing hate speech detectors typically perform well on LLM-generated content.
  • However, their performance degrades with newer versions of LLMs such as GPT-4.

LLM-Driven Hate Campaign

The hate campaign, also known as coordinated hate attack or raid, is a series of coordinated actions that aim to spread harmful or derogatory content, often targeting specific identity groups to incite discrimination, hostility, or violence.

  • Traditional way: Manually craft and tweak hate speech to bypass detection.
  • LLM-Driven way: Utilize LLMs to automatically generate hate speech and evade detectors by advanced techniques such as adversarial attacks and model stealing attacks.
Threat scenario of LLM-Driven hate campaign

Adversarial Hate Campaign

Detectors demonstrate weak robustness against adversarial attacks. The most potent adversarial attack can achieve an ASR of over 0.966 on Perspective, Moderation, and TweetHate.


Stealthy Hate Campaign

By establishing a local copy of the target detector, an adversary can increase the efficiency of generating hate speech by 13−21× while still retaining impressive ASR.


Ethics and Disclosures

Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw.


BibTeX

If you find this useful in your research, please consider citing:

@inproceedings{SWQBZZ25,
  author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang},
  title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}},
  booktitle = {{USENIX Security Symposium (USENIX Security)}},
  publisher = {USENIX},
  year = {2025}
}