HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

To Appear in the USENIX Security 2025

¹CISPA Helmholtz Center for Information Security, ²TU Delft

Ethics and Disclosures

Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw.

Website adapted from Nerfies.

BibTeX

If you find this useful in your research, please consider citing:

@inproceedings{SWQBZZ25, author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang}, title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}}, booktitle = {{USENIX Security Symposium (USENIX Security)}}, publisher = {USENIX}, year = {2025} }

HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

To Appear in the USENIX Security 2025

LLM-Generated Samples

Ethics and Disclosures

BibTeX