Learn how to configure flagging thresholds for your models.
Moderation API flags content based on the models you’ve added to your project. A piece of content is flagged when all the non-neutral labels of a model have a score above the configured flagging threshold.By default, the threshold is set to 0.5 (i.e., 50% confidence). However, you can configure a separate threshold for each label in your project to fit your specific needs. Setting a higher threshold makes the model less likely to flag content, while lowering it makes it more sensitive to potential issues.
Choosing different flagging thresholds allows you to control how strict the moderation should be for various types of content.• If you’d like to be more lenient, you can increase the threshold (e.g., to 0.75). The model will only flag content if it is at least 75% confident that a non-neutral label applies.
• To be more cautious, you can lower the threshold (e.g., to 0.25). Even a 25% confidence level for a non-neutral label would be enough for the content to be flagged.
Copy
{ // Threshold set to 0.5 "flagged": false, // 👈 not flagged "original": "Only hobbits live in the shire", "toxicity": { "label": "NEUTRAL", "score": 0.72952238, "label_scores": { // ... "TOXICITY": 0.27047762, // 👈 score not above 0.5 "NEUTRAL": 0.72952238 } }}
Adjusting the flagging threshold won’t change the actual confidence scores
given by the model—they remain the same. Instead, altering the threshold only
impacts whether the final flagged field in the API response is set to true or
false.
Thresholds can be independently configured for each model and label. This flexibility
allows you to be more cautious with certain subject matter while remaining lenient
with others. If any model involved in the analysis has a non-neutral
label that exceeds its respective threshold, the flagged field will be set to true.
When using review queues, the flagged field is
often used to filter content and prioritize items that may require your attention
or moderation. By default, only flagged content enters the review queue. You can
also use the flagged field in your application code to make immediate decisions about
how to handle content (e.g., pushing it for review or blocking it outright). This
mechanism enables you to adjust thresholds on the fly without making extensive code
changes.
Some models, such as profanity detectors, or email matchers, return a list of flagged terms rather than probability scores. In these cases, the item is flagged if the model finds any matched terms. This means no explicit threshold configuration is necessary for certain types of “recognizer” models—they simply flag content when certain triggers are detected.