Creating Guidelines for Your AI Agent
When setting up an AI agent for content moderation, you need to establish clear guidelines. These guidelines help the AI understand what content is acceptable and what isn’t, ensuring a safer and more respectful environment.
Choosing the Right Categories
Llama-Guard 3 is trained on 14 safety categories defined by MLCommons. These categories cover a broad range of moderation needs, from hate speech to illegal activities. When setting up your AI agent, start by selecting the categories that align with your moderation goals.
The 14 predefined Safety Categories
Category | Description |
---|---|
Violent Crimes | Should not create content that enables, encourages, or excuses the commission of violent crimes. Examples include: unlawful violence toward people (terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping) and unlawful violence toward animals (animal abuse). |
Non-Violent Crimes | Should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples include: financial crimes (fraud, scams, money laundering), property crimes (burglary, robbery, arson, vandalism), drug crimes (creating or trafficking narcotics), weapons crimes (producing unlicensed firearms), and cyber crimes (hacking, spyware, malware). |
Sex Crimes | Should not create content that enables, encourages, or excuses the commission of sex crimes. Examples include: human trafficking, sexual assault (rape), sexual harassment (groping), and lewd conduct. |
Child Exploitation | Should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children. |
Defamation | Should not create content about a real, living person that is verifiably false and likely to injure the person’s reputation. |
Specialized Advice | Should not create content that contains specialized financial, medical, or legal advice. |
Privacy | Should not create content that contains sensitive, personal information about private individuals. |
Intellectual Property | Should not create content that violates the intellectual property rights of third parties. |
Indiscriminate Weapons | Should not create content that enables the creation of weapons of mass destruction. Examples include: chemical weapons (nerve gas), biological weapons (anthrax), radiological weapons (salted bombs), nuclear weapons (atomic warheads), and high-yield explosive weapons (cluster munitions). |
Hate | Should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). Should not create content that perpetuates negative stereotypes. |
Self-Harm | Should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples include: suicide, self-injury (cutting), and disordered eating. |
Sexual Content | Should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts. |
Elections | Should not create content that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections. |
Code Interpreter Abuse | Should not create content that attempts to abuse code interpreters. Examples include: denial of service attacks, container escapes or privilege escalation. |
Customizing Categories
While predefined categories are a great starting point, you might need to customize them to better fit your specific needs. For example, the “Hate” category can be broadened to include any form of disrespectful behavior.
Customizing the “Hate” Category:
Writing Your Own Categories
If your moderation needs fall outside the predefined categories, you can create custom categories. However, keep in mind that Llama-Guard 3 might not perform as well with categories that are far removed from the MLCommons standards.
Tips for Custom Categories:
- Use as few categories as possible to maintain accuracy.
- Create separate AI agents for each custom category if needed.
- Use GPT-4o-mini if Llama-Guard 3 is not performing well.
- Consider reaching out to Moderation API for help with fine-tuning a bespoke Llama model for highly specific needs.
Implementing Context Awareness
Context is crucial in content moderation. AI agents includes a context awareness feature that considers previous messages in a conversation, providing a more nuanced understanding. This feature helps the AI make better decisions, especially in complex scenarios. Read about context awareness for more information about enabling context awareness.
Was this page helpful?