Moderation API will flag content based on the models you’ve added to your project. Flagging happens When all the non-neutral labels of a model have a score below the flagging threshold.

The default threshold is 0.5 (50% confidence), but you can configure the threshold for each label in your project.

Flagging thresholds

Changing the thresholds

By changing the flagging threshold, you can control how strict the model is. For example, if you want to accept more content, you can increase the threshold to 0.75. This means that the model will only flag content if it is 75% confident about a non-neutral label.

On the other hand, if you want to be more strict, you can lower the threshold to 0.25. This means that the model will flag content even if it’s only 25% confident about a non-neutral label.

{
  // Threshold set to 0.5
  "flagged": false, // πŸ‘ˆ not flagged
  "original": "Only hobbits live in the shire",
  "toxicity": {
    "label": "NEUTRAL",
    "score": 0.72952238,
    "label_scores": {
        // ...
        "TOXICITY": 0.27047762, // πŸ‘ˆ score not above 0.5
        "NEUTRAL": 0.72952238
    }
  }
}

Please note that changing the flagging threshold does not affect the score or output of the individual models. It only affects the flagged field of the API response.

Thresholds can be configured independently for each model and labels. So you can be more strict with some types of content and more loose with others.

If any of the models exceeds their respective threshold, the flagged field will be set to true.

What is the flagged field used for?

When using review queues, you can filter the content based on the flagged field. By default only flagged content enters the queue.

We also recommend to use the flagged field in your application code to make decisions about the content. That way, you can change the flagging threshold without having to change your application code.

Flagging for recognizer models

Some models do not return any scores, but a list of matches. For example, the profanity model returns a list of swear words that were found in the text. In this case, you the item is flagged if the list contains at least one item.

Was this page helpful?