False positives and negatives
When you’re relying on a content moderation system, a key concept to understand is the trade-off between false positives and false negatives:- False positives occur when a system flags content as violating policy when it is actually compliant.
- False negatives occur when a system fails to flag content that does violate policy.
- Adjust thresholds (covered below) to tightly align with your standards and risk appetite.
- Supplement automated checks with human moderation for borderline cases to reduce both types of errors.
Adjusting flagging thresholds
The Moderation API allows configuration for how strictly it flags content. By experimenting with threshold scores, you can tweak the sensitivity:- Lowering the threshold: This will reduce false negatives (because more content is flagged), but potentially increase false positives.
- Raising the threshold: This will reduce false positives (because the system flags fewer pieces of content), but potentially increase false negatives.
Adding or removing models
Your workflow or platform could be improved by adding specialized models or removing unnecessary ones:- Adding specialized models: If your use case deals with a specific domain (e.g., medical or legal), you might consider training or deploying a model tuned for that domain. This model could work in tandem with the general Moderation API to reduce domain-specific false flags.
- Removing unneeded models: If your current pipeline includes multiple checks from overlapping or redundant models, rationalizing them may reduce complexity and potential conflicts in results. It can also streamline moderation decisions.
Using AI agents
AI agents function as custom models that you can configure with your own guidelines. Some potential advantages include:- Automatically reviewing flagged content again with a different approach or additional data.
- Analyzing user behavior or past content to add context before making a final moderation decision.
- Escalating borderline content to human reviewers, re-checking the system’s decision, or adding notes for context.
Enabling context awareness
Many moderation challenges arise when the system doesn’t understand the broader context behind the text:- Certain terms might be acceptable in an educational or reclaiming context (e.g., quoting a slur to explain its original meaning).
- Cultural or community-specific language usage might not translate well with a general-purpose model.
- Provide additional metadata or preceding conversation snippets along with your text, to give the Moderation API or other classifiers a better understanding of what’s being said.
- Enable Context Awareness in your project settings and include
authorId
andcontextId
in your requests so the system can reference previous messages. See Submitting content to Moderation API for more details.
Training custom models
If you find that standard models are not meeting your performance requirements, consider building a custom fine-tuned model. This can help if:- You have a substantial dataset specific to your industry or type of content.
- You need higher precision for borderline or ambiguous cases.
- You want to reduce reliance on manual moderation for specialized content.
Get help from our team
If you need help deciding how to set up moderation, fine-tuning or advanced configurations, reach out to our support teams. We can help you:- Determine the right thresholds for your workflow.
- Explore sample code to integrate moderation into your application.
- Identify potential pitfalls with domain-specific moderation.