Content moderation screens every message a user sends to Pria before the conversation reaches the LLM. When a message is flagged as unsafe, Pria returns a polite refusal instead of forwarding the prompt to the model — protecting your learners, your brand, and the underlying providers from policy violations. This page is the Admin-side reference for the Moderation feature. End users do not configure anything; they only see the refusal copy when their message trips the filter.Documentation Index
Fetch the complete documentation index at: https://docs.praxis-ai.com/llms.txt
Use this file to discover all available pages before exploring further.
What Moderation Does
When enabled on a Digital Twin, every user message is sent to a dedicated moderation model before any other processing. The moderation model returns a classification across several unsafe categories:- Hate or harassment
- Violence or graphic content
- Self-harm
- Sexual content
- Illicit or illegal acts
Moderation runs against user inputs. Provider-side safety filters and the moderation hook on streaming responses also catch unsafe outputs for the providers that support it (Bedrock Guardrails, Google safety filters).
Enabling Moderation
Moderation is enabled per Digital Twin:- Go to Admin → Configuration.
- Find the Enable Moderation toggle under the safety section.
- Save.
Selecting a Moderation Model
The Moderation Model field on the Digital Twin selects which model performs the safety check. Different providers offer different moderation surfaces:OpenAI Moderation
OpenAI Moderation
OpenAI’s hosted moderation endpoint (
omni-moderation-latest and earlier variants). Free to call, multilingual, fast, and well documented. Categories: hate, harassment, self-harm, sexual content, violence, illicit. This is the default and the easiest to operate.Bedrock Guardrails
Bedrock Guardrails
AWS Bedrock supports moderation through Guardrails — configurable policies you author in the AWS console and attach to model invocations. Useful when your organization already has Bedrock guardrails defined for compliance or you need fine-grained category controls.
What Happens When Content Is Flagged
When the moderation model returns “unsafe”:- The user sees a brief refusal message in the chat. The original prompt is not forwarded to the LLM, so no tokens are billed against your AI budget for the rejected turn.
- The event is recorded so Admins can review it in Histories — flagged turns appear with a moderation marker so you can spot patterns (e.g. a user repeatedly probing the filter).
- If you have a contact email set on the Digital Twin, your safety contact is notified by email so the team can intervene early when policies are stress-tested.
Customizing Thresholds
The OpenAI Moderation endpoint exposes a single per-category yes/no decision; there is no admin-tunable threshold in the Pria UI. Bedrock Guardrails are configured inside your AWS console — Pria invokes whichever Guardrail policy your moderation model points to. To tighten or loosen thresholds, edit the Guardrail in AWS; Pria picks up the new policy on the next request. This is the right place for organizations that need to allow specific clinical, legal, or educational language that a general-purpose moderator would block.Acting on Violations
A flagged conversation is a signal — not always a punishment. Use the Histories view to:- See the user’s full conversation surrounding the flagged turn.
- Determine whether the user was testing the filter, asking a legitimate question that was misclassified, or genuinely attempting misuse.
- Contact the user via the email associated with their account if intervention is needed.
- Adjust your Digital Twin’s onboarding or assistant prompt if a category of legitimate question keeps tripping the filter.
Limitations
No moderation model is perfect. A few honest caveats:- False positives happen — clinical, legal, harm-reduction, and historical contexts can trip safety filters even when the underlying intent is educational. Bedrock Guardrails give you the most control here; OpenAI Moderation is a take-it-or-leave-it surface.
- False negatives also happen — a determined adversary can phrase requests to slip past any moderator. Moderation is a layer, not a fortress.
- Output moderation is provider-dependent. OpenAI’s moderation endpoint scores user inputs; Bedrock and Google can also score model outputs. For the providers that don’t, lean on the assistant’s prompt to constrain behaviour.
Compliance Considerations
If your deployment is subject to a regulatory regime, moderation interacts with it as follows:- FERPA — flagged turns include student inputs and are stored in Histories. Treat them with the same access controls as the rest of the conversation log.
- GDPR — under “right to be forgotten” requests, flagged turns must be deleted along with the user’s other conversation data. Pria’s standard user-deletion flow covers this.
- COPPA — for products serving children under 13, keep moderation on and configure your assistant to refuse personal-data prompts. The audit trail in Histories supports the operator’s record-keeping obligations.
Related
- AI Models — pick the conversation, moderation, and supporting models for your Digital Twin.
- Feedback — review user-reported issues alongside automatic moderation flags.
- Histories — inspect the conversations that produced a moderation event.
- Privacy & Educational Consent — the policy framing for student data and flagged content.