Warning: explicit content and hateful ideologies here!

ChatGPT vs The Oversight Board

A project of the ARC Centre of Excellence for Automated Decision-Making and Society. Developed by Nicolas Suzor and Lucinda Nelson.

This experimental tool uses large multimodal machine learning models to understand implicit and covert hateful speech.

OK, this isn't really ChatGPT versus the Oversight Board. We're not trying to recreate the important deliberation and norm-setting role of Board. Instead, we're trying to learn from the Board's decisions and see how we can improve the application of standards to new contexts.

We use the Oversight Board decisions on hate speech to calibrate against some of the toughest challenges in content moderation. On this initial small sample, we have seen surprisingly good results -- we have been able to approximate Oversight Board results in 13 of the 15 hate speech decisions released up to late 2023.

|:----------------------------------|------------:|------------:|--------:|----------:|
|     accuracy - our prompts        | Claude 3.5  |   Gemini    | GPT 4o  | Llama 3.1 |
|                                   |   Sonnet    |  1.5 pro    |         |   405bn   |
|:----------------------------------|------------:|------------:|--------:|----------:|
| Drag Queens (zero-shot CoT)       |        0.93 |        0.78 |    0.86 |      0.87 |
| Drag Queens (mix of 8 experts)    |        0.88 |        0.86 |    0.88 |      0.89 |
| Oversight Board (zero-shot CoT)   |        0.88 |        0.72 |    0.78 |      0.75 |
| Oversight Board (mix of 8 experts)|        0.78 |        0.80 |    0.82 |      0.79 |
|:----------------------------------|------------:|------------:|--------:|----------:|

(anti-)tone police experiments

Most attempts to build machine learning models that can detect 'toxicity', hate, and abuse don't work very well. Usually, they end up detecting strongly-worded statements -- wrongly classifying counterspeech, in-group jokes, and reappropriated slurs as 'toxic', while missing whole ranges of hateful messages that are politely expressed, use coded language, or are disguised as humour. Predictably, these systems tend to disproportionately impact already marginalised groups.

This is bad news for societies that are increasingly reliant on machine classification and content generation. Safety standards and AI 'guardrails' are urgently needed, but so far we have few examples of tools that detect implicit or covert hate without over-policing the speech of marginalised groups.

We are working to build guardrails that do not tone police. We are chaining together large multimodal models with a range of information retrieval and classification techniques to better take context into account. So far, our focus is on gender and sex -- areas where we have extensive subject-matter expertise. Our goal is to develop evaluation criteria based on media standards that have been developed by community groups and experts. We hope to make these criteria available in the form of tools and test sets that can be used by developers to evaluate their own systems.

In the next step of this work-in-progress, we are exploring a variety of techniques to iteratively refine the quality of the analysis -- focusing particularly on the difficult challenge of understanding meaning from ambiguous content that typically requires knowledge of context and human assessment of the poster's intent and the audience's likely interpretation.

Here you can find some examples of our experiments so far and see where we're going. You might also be able to test our tools on your own examples and see how they perform. We're still in the early stages of this work, so please be patient with us!

Drag Queens vs White Supremacists examples:

Oversight Board examples:

Tone Policing examples:

ChatGPT vs The Oversight Board

(anti-)tone police experiments