We introduce HateBench, a framework for benchmarking hate speech
detectors on LLM-generated hate speech.
We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34
identity groups.
We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset.
Our results show that while detectors are generally effective in identifying LLM-generated hate speech,
their performance degrades with newer versions of LLMs.
We also reveal the potential of LLM-driven hate campaigns, a new
threat that LLMs bring to the field of hate speech detection.
By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can
intentionally evade the detector and automate hate campaigns online.
The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by 13−21× through model stealing attacks with acceptable attack
performance.
Disclaimer. This website contains examples of hateful and abusive language. Reader
discretion is recommended.