Content Moderation

Content moderation screens every message a user sends to Pria before the conversation reaches the LLM. When a message is flagged as unsafe, Pria returns a polite refusal instead of forwarding the prompt to the model — protecting your learners, your brand, and the underlying providers from policy violations. This page is the Admin-side reference for the Moderation feature. End users do not configure anything; they only see the refusal copy when their message trips the filter.

What Moderation Does

When enabled on a Digital Twin, every user message is sent to a dedicated moderation model before any other processing. The moderation model returns a classification across several unsafe categories:

Hate or harassment
Violence or graphic content
Self-harm
Sexual content
Illicit or illegal acts

If any category exceeds the provider’s threshold, the message is rejected with a safe response and never reaches the conversation model, the assistant’s prompt, your tools, or RAG. The block is also logged so Admins can see who tripped the filter and why.

Moderation runs against user inputs. Provider-side safety filters and the moderation hook on streaming responses also catch unsafe outputs for the providers that support it (Bedrock Guardrails, Google safety filters).

Enabling Moderation

Moderation is enabled per Digital Twin:

Go to Admin → Configuration.
Find the Enable Moderation toggle under the safety section.
Save.

The setting is on by default for new instances. Disabling it sends every prompt straight to the conversation model — only do this for narrowly scoped internal instances where users are fully trusted.

Selecting a Moderation Model

The Moderation Model field on the Digital Twin selects which model performs the safety check. Different providers offer different moderation surfaces:

OpenAI Moderation

OpenAI’s hosted moderation endpoint (omni-moderation-latest and earlier variants). Free to call, multilingual, fast, and well documented. Categories: hate, harassment, self-harm, sexual content, violence, illicit. This is the default and the easiest to operate.

Bedrock Guardrails

AWS Bedrock supports moderation through Guardrails — configurable policies you author in the AWS console and attach to model invocations. Useful when your organization already has Bedrock guardrails defined for compliance or you need fine-grained category controls.

When you leave the Moderation Model field empty, Pria falls back to the platform default configured for your tenant.

What Happens When Content Is Flagged

When the moderation model returns “unsafe”:

The user sees a brief refusal message in the chat. The original prompt is not forwarded to the LLM, so no tokens are billed against your AI budget for the rejected turn.
The event is recorded so Admins can review it in Histories — flagged turns appear with a moderation marker so you can spot patterns (e.g. a user repeatedly probing the filter).
If you have a contact email set on the Digital Twin, your safety contact is notified by email so the team can intervene early when policies are stress-tested.

The refusal copy is intentionally neutral and does not name the category that triggered the block — this avoids giving bad actors a template for evasion.

Customizing Thresholds

The OpenAI Moderation endpoint exposes a single per-category yes/no decision; there is no admin-tunable threshold in the Pria UI. Bedrock Guardrails are configured inside your AWS console — Pria invokes whichever Guardrail policy your moderation model points to. To tighten or loosen thresholds, edit the Guardrail in AWS; Pria picks up the new policy on the next request. This is the right place for organizations that need to allow specific clinical, legal, or educational language that a general-purpose moderator would block.

Moderation vs. Praxis Shield

Moderation and Praxis Shield are complementary safety layers:

Moderation (this page) blocks individual unsafe messages in the moment, before they reach the AI model.
Praxis Shield monitors conversations after the fact and raises security incidents — with severity, status, and category — in a dedicated triage panel where admins review them, add notes, take action, and mark false positives.

If you are looking for the dashboard where flagged activity is reviewed and triaged, that’s Praxis Shield. This page covers the per-message blocking configuration only.

Acting on Violations

A flagged conversation is a signal — not always a punishment. Use the Histories view to:

See the user’s full conversation surrounding the flagged turn.
Determine whether the user was testing the filter, asking a legitimate question that was misclassified, or genuinely attempting misuse.
Contact the user via the email associated with their account if intervention is needed.
Adjust your Digital Twin’s onboarding or assistant prompt if a category of legitimate question keeps tripping the filter.

For institutional instances, consider pairing moderation with a clear acceptable-use statement during user onboarding — this reduces accidental violations and gives you a clean basis for enforcement.

Limitations

No moderation model is perfect. A few honest caveats:

False positives happen — clinical, legal, harm-reduction, and historical contexts can trip safety filters even when the underlying intent is educational. Bedrock Guardrails give you the most control here; OpenAI Moderation is a take-it-or-leave-it surface.
False negatives also happen — a determined adversary can phrase requests to slip past any moderator. Moderation is a layer, not a fortress.
Output moderation is provider-dependent. OpenAI’s moderation endpoint scores user inputs; Bedrock and Google can also score model outputs. For the providers that don’t, lean on the assistant’s prompt to constrain behaviour.

The most resilient setup combines: (1) Moderation on inputs, (2) an Assistant prompt that explicitly refuses out-of-scope or unsafe requests, (3) clear onboarding guidance to users about what the Digital Twin will and won’t do.

Compliance Considerations

If your deployment is subject to a regulatory regime, moderation interacts with it as follows:

FERPA — flagged turns include student inputs and are stored in Histories. Treat them with the same access controls as the rest of the conversation log.
GDPR — under “right to be forgotten” requests, flagged turns must be deleted along with the user’s other conversation data. Pria’s standard user-deletion flow covers this.
COPPA — for products serving children under 13, keep moderation on and configure your assistant to refuse personal-data prompts. The audit trail in Histories supports the operator’s record-keeping obligations.

If you operate under an explicit Data Processing Agreement, contact the Praxis AI team at humans@praxis-ai.com for region-specific guidance.

Praxis Shield — review and triage security incidents detected across conversations.
AI Models — pick the conversation, moderation, and supporting models for your Digital Twin.
Feedback — review user-reported issues alongside automatic moderation flags.
Histories — inspect the conversations that produced a moderation event.
Privacy & Educational Consent — the policy framing for student data and flagged content.

Admin Guide

Account Management

Instance Settings

AI & Models

Assistants & Prompts

Tools & Connectors

User Management

Analytics & Monitoring

Legal & Compliance

API Reference

Runtime API

Administrator API

Integrations

Authentication

Chat Completions

Instructure Canvas

Web App

MCP Server

Voice & Avatars

Google Workspace

LMS Platforms (LTI 1.3)

Billing & Payments

SDK

Content Moderation

What Moderation Does

Enabling Moderation

Selecting a Moderation Model

What Happens When Content Is Flagged

Customizing Thresholds

Moderation vs. Praxis Shield

Acting on Violations

Limitations

Compliance Considerations

​What Moderation Does

​Enabling Moderation

​Selecting a Moderation Model

​What Happens When Content Is Flagged

​Customizing Thresholds

​Moderation vs. Praxis Shield

​Acting on Violations

​Limitations

​Compliance Considerations

​Related

What Moderation Does

Enabling Moderation

Selecting a Moderation Model

What Happens When Content Is Flagged

Customizing Thresholds

Moderation vs. Praxis Shield

Acting on Violations

Limitations

Compliance Considerations

Related