Creating Guidelines for Your AI Agent

When setting up an AI agent for content moderation, you need to establish clear guidelines. These guidelines help the AI understand what content is acceptable and what isn’t, ensuring a safer and more respectful environment.

Choosing the Right Categories

Llama-Guard 3 is trained on 14 safety categories defined by MLCommons. These categories cover a broad range of moderation needs, from hate speech to illegal activities. When setting up your AI agent, start by selecting the categories that align with your moderation goals.

The 14 predefined Safety Categories

Category	Description
Violent Crimes	Should not create content that enables, encourages, or excuses the commission of violent crimes. Examples include: unlawful violence toward people (terrorism, genocide, murder, hate crimes, child abuse, assault, battery, kidnapping) and unlawful violence toward animals (animal abuse).
Non-Violent Crimes	Should not create content that enables, encourages, or excuses the commission of non-violent crimes. Examples include: financial crimes (fraud, scams, money laundering), property crimes (burglary, robbery, arson, vandalism), drug crimes (creating or trafficking narcotics), weapons crimes (producing unlicensed firearms), and cyber crimes (hacking, spyware, malware).
Sex Crimes	Should not create content that enables, encourages, or excuses the commission of sex crimes. Examples include: human trafficking, sexual assault (rape), sexual harassment (groping), and lewd conduct.
Child Exploitation	Should not create content that depicts child nudity or that enables, encourages, excuses, or depicts the sexual abuse of children.
Defamation	Should not create content about a real, living person that is verifiably false and likely to injure the person’s reputation.
Specialized Advice	Should not create content that contains specialized financial, medical, or legal advice.
Privacy	Should not create content that contains sensitive, personal information about private individuals.
Intellectual Property	Should not create content that violates the intellectual property rights of third parties.
Indiscriminate Weapons	Should not create content that enables the creation of weapons of mass destruction. Examples include: chemical weapons (nerve gas), biological weapons (anthrax), radiological weapons (salted bombs), nuclear weapons (atomic warheads), and high-yield explosive weapons (cluster munitions).
Hate	Should not create content that is hateful toward people on the basis of their protected characteristics (race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity and serious disease). Should not create content that perpetuates negative stereotypes.
Self-Harm	Should not create content that enables, encourages, or excuses acts of intentional self-harm. Examples include: suicide, self-injury (cutting), and disordered eating.
Sexual Content	Should not create depictions of nude adults or content that contains erotic descriptions or explicit depictions of sex acts.
Elections	Should not create content that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections.
Code Interpreter Abuse	Should not create content that attempts to abuse code interpreters. Examples include: denial of service attacks, container escapes or privilege escalation.

Customizing Categories

While predefined categories are a great starting point, you might need to customize them to better fit your specific needs. For example, the “Hate” category can be broadened to include any form of disrespectful behavior.

Customizing the “Hate” Category:

Never create content that is obnoxious, disrespectful, or hateful. Examples include but are not limited to:
- Undermining someone
- Calling someone weak
- Expressing hatred towards someone

Writing Your Own Categories

If your moderation needs fall outside the predefined categories, you can create custom categories. However, keep in mind that Llama-Guard 3 might not perform as well with categories that are far removed from the MLCommons standards.

Tips for Custom Categories:

Use as few categories as possible to maintain accuracy.
Create separate AI agents for each custom category if needed.
Use GPT-4o-mini if Llama-Guard 3 is not performing well.
Consider reaching out to Moderation API for help with fine-tuning a bespoke Llama model for highly specific needs.

Implementing Context Awareness

Context is crucial in content moderation. AI agents includes a context awareness feature that considers previous messages in a conversation, providing a more nuanced understanding. This feature helps the AI make better decisions, especially in complex scenarios. Read about context awareness for more information about enabling context awareness.

Documentation

Learn

Guides

Resources

Creating Guidelines for Your AI Agent

Choosing the Right Categories

The 14 predefined Safety Categories

Customizing Categories

Customizing the “Hate” Category:

Writing Your Own Categories

Tips for Custom Categories:

Implementing Context Awareness

Documentation

Learn

Guides

Resources

​Choosing the Right Categories

​The 14 predefined Safety Categories

​Customizing Categories

​Customizing the “Hate” Category:

​Writing Your Own Categories

​Tips for Custom Categories:

​Implementing Context Awareness

Choosing the Right Categories

The 14 predefined Safety Categories

Customizing Categories

Customizing the “Hate” Category:

Writing Your Own Categories

Tips for Custom Categories:

Implementing Context Awareness