Toxicity & Hate

These policies flag content that’s hostile to readers or targeted at protected groups. They run on both sides of a conversation — what users post and what your bots or assistants reply.

Policies

`id`	Type	Supported	What it does
`toxicity`	`classifier`	text, audio	General-purpose toxicity detection: insults, harassment, hostile language.
`toxicity_severe`	`classifier`	text, audio	A stricter sub-classifier for severe toxicity. Useful when you want to distinguish “rude” from “abusive.”
`hate`	`classifier`	text, image, video, audio	Hate speech, discrimination, racism, and extremism — including image and video content.

Reading the result

const toxicity = response.policies.find(p => p.id === "toxicity");

if (toxicity?.flagged) {
  console.log(`Toxicity ${toxicity.probability * 100}% confidence`);
}

const severe = toxicity?.labels?.find(l => l.id === "severe");
if (severe?.flagged) {
  // Treat severe toxicity differently — e.g. auto-reject instead of review.
}

See Understanding API responses for the full response shape.

Documentation

Learn

Resources

Policies

Reading the result

Documentation

Learn

Resources

Documentation Index

​Policies

​Reading the result

Policies

Reading the result