Realtime Voice & Avatars

Convo Mode is Pria’s live two-way voice surface. Audio streams from the microphone to a speech-to-text engine, through your selected LLM, and back out as spoken audio — with optional animated avatars on top. This page covers the Admin-side setup. For the end-user experience see the user-guide page on Convo Mode.

What Convo Mode Is

Once enabled, users see a microphone affordance in the Pria interface that opens a live audio session. The session runs entirely browser-to-provider after Pria mints a short-lived session token — your long-lived API keys never reach the browser. What a session does end-to-end:

Captures microphone audio and streams it to the provider.
Streams partial transcripts back to the UI (optional).
Runs your assistant’s prompt, tools, RAG, and personalization on every turn.
Speaks the response back through the provider’s TTS voice.
Optionally renders an animated avatar synced to the audio.

Audio frames never traverse the Pria backend in steady state. Pria brokers the session token, then the browser opens a direct connection to the voice provider.

Who Can Use Convo Mode

Two toggles on the Digital Twin gate access.

Enable Convo Mode (rtEnabled)

Master switch for the entire Digital Twin. When off, the microphone affordance is hidden from every user. Default: off.

Admin-only (rtAdminOnly)

When on, only Admin users see Convo Mode. Useful while you’re testing a new provider, tuning voices, or troubleshooting an avatar setup before exposing it to learners. Default: on — flip to off once you’re ready for everyone.

Text input alongside voice (rtTextInputEnabled)

Lets users type into the Convo Mode panel as well as speak. Helpful for noisy environments or accessibility.

Show running transcript (rtTranscriptEnabled)

Renders the live STT transcript inside the Convo Mode widget so users can see what was heard.

Choosing a Provider

Pick a provider by setting the Realtime Model field on the Digital Twin. The model string determines which provider Pria routes to.

OpenAI Realtime

Lowest latency, broadest model coverage, built-in OpenAI voices. The default choice for most Digital Twins. Uses WebRTC.

ElevenLabs

Premium voices via the ElevenLabs ConvAI agent bridge. Use when voice quality is the priority and your agent is already configured in the ElevenLabs dashboard.

Gemini Live

Google’s audio-native session with thinking support. Uses WebSocket; 30 prebuilt voices.

xAI Realtime

Grok voice with five built-in voices (eve, ara, rex, sal, leo). Uses WebSocket.

Anam Avatar

Animated avatar driven by your selected LLM. Anam owns mic, STT, TTS, and video; Pria supplies the assistant turn text — so your prompts, RAG, and tools still run on Pria.

LemonSlice Avatar

Legacy avatar path via Daily.co. Prefer Anam for new setups; LemonSlice is maintained while existing tenants migrate.

Per-Provider Configuration

OpenAI Realtime

Field	Purpose
Realtime Model	A `gpt-realtime-` or `gpt-4o-realtime-` model from the catalog.
Voice	One of the OpenAI realtime voices (e.g. `marin`, `cedar`). Invalid voices fall back to `marin`.
VAD Eagerness	`low` / `medium` / `high` — how aggressively the model decides the user has finished speaking.
Noise Reduction	`near_field` for headsets / close mics, `far_field` for laptops / conference rooms. Leave blank to disable.
Transcription Language	Optional hint (e.g. `en`, `fr-FR`) — improves STT accuracy when your users speak a single non-English language.

OpenAI API key resolves in this order: per-model key in Custom Models → Digital Twin’s openai_api_key → platform fallback.

ElevenLabs

Field	Purpose
Realtime Model	Set to `elevenlabs`.
Agent ID	The ConvAI agent ID from your ElevenLabs dashboard. The agent itself defines the voice, prompt, and tool surface.
Connection Method	`webrtc` or `websocket` (default).
API Key	Per-tenant ElevenLabs key. Falls back to platform default.

ElevenLabs is the only provider where “agent” is a vendor-side concept — your ElevenLabs agent does the talking. Configure it to call Pria back as a Custom LLM if you want Pria’s prompts, RAG, and tools in the loop.

Gemini Live

Field	Purpose
Realtime Model	A `gemini-*` realtime model (e.g. `gemini-2.5-flash-native-audio-preview-12-2025`).
Gemini Voice	One of 30 prebuilt voices (`Puck`, `Charon`, `Kore`, `Fenrir`, `Aoede`, `Zephyr`, …), each labelled with its character. Leave on User’s Choice to let users pick.
Gemini API Key	Per-tenant key. Falls back to platform default.

Gemini sessions are capped around 10 minutes per connection by Google; Pria configures a sliding context window so longer conversations keep working across reconnects.

xAI Realtime

Field	Purpose
Realtime Model	A `grok-*` model (e.g. `grok-3-fast`).
xAI Voice	`eve`, `ara`, `rex`, `sal`, or `leo`.
xAI API Key	Per-tenant key. Falls back to platform default.

Note: image markdown is sometimes garbled in spoken output on grok-fast — this is a current model limitation on xAI’s side.

Anam Avatar

Field	Purpose
Realtime Model	Set to `anam_pria_custom_llm`.
Avatar ID	Vendor avatar identifier from Anam.
Voice ID	Vendor voice identifier from Anam.
Placeholder Image URL	Shown while the video stream is loading.
Loading Video URL	Optional looping animation while a turn is in flight.
Intro Message	Optional line the avatar speaks on session start.
Conversation Model	Optional override for the LLM that powers voice turns. Leave blank to use the Digital Twin’s default conversation model.
Anam API Key	Per-tenant key. Falls back to platform default.

Pria runs the LLM, RAG, tools, and personalization for every turn — Anam just renders the avatar and handles mic/TTS.

LemonSlice Avatar

Field	Purpose
Realtime Model	Set to `lemonslice`.
Agent ID	LemonSlice agent identifier.
Model	Optional per-tenant model override for voice turns.
Placeholder Image URL	Shown before the video stream starts.
Loading Video URL	Optional loop while turns are in flight.
Intro Message	Optional line spoken on session start.
Allow Imagine	When on, lets the avatar generate images mid-session via the imagination tool.
LemonSlice API Key	Per-tenant key. Falls back to platform default.

LemonSlice is being deprecated in favour of Anam. Use Anam for any new Digital Twin setup.

Voice Activity Detection (VAD)

VAD is how the provider decides when a user has finished speaking. Two knobs apply to OpenAI Realtime:

VAD Eagerness — low keeps the user talking longer between turns (good for thoughtful conversations); high cuts in faster (good for quick Q&A drills).
Noise Reduction — near_field cleans up headset audio; far_field cleans up laptop mics in rooms with background noise. Leave blank if your users are on quality hardware.

Gemini, xAI, ElevenLabs, Anam, and LemonSlice manage VAD internally — those toggles are OpenAI-only.

Transcription Language

If your users speak a single non-English language, set the Transcription Language field (e.g. fr, es, de, ja). It biases the STT engine and reduces transcription errors. Leave blank for English or multilingual rooms. This setting applies to OpenAI Realtime only.

Avatar Imagination Prompts

LemonSlice’s Allow Imagine toggle lets users ask the avatar to generate images during a voice session. The image is produced via your Digital Twin’s Image Generation model and shown alongside the avatar. Disable if you don’t want generated images appearing during voice conversations. Anam supports tools (including image generation) through the per-turn voice handler — no separate toggle needed.

Testing Your Setup

Keep Convo Mode admin-only while testing

Leave Admin-only on while you tune voice, VAD, and avatar settings. You’ll be the only one who sees the microphone affordance.

Open Convo Mode

Verify the basics

Confirm: the avatar (if any) appears, your voice is transcribed, the assistant responds in the chosen voice, and your assistant’s tools still fire mid-conversation.

Tune VAD if needed

If the assistant interrupts users mid-thought, drop VAD Eagerness to low. If responses feel sluggish, bump it to high.

Flip Admin-only off

Once you’re satisfied, turn off Admin-only so every user can use Convo Mode.

Cost Considerations

Convo Mode is billed per minute by every provider. Costs vary widely:

OpenAI Realtime and Gemini Live are billed on input/output audio tokens per minute by the provider.
ElevenLabs charges per character of TTS output plus session minutes.
xAI Realtime is billed on audio minutes.
Anam and LemonSlice add an avatar surcharge on top of the underlying LLM cost.

If you bring your own provider API keys via the Custom Models flow, those minutes are billed directly to your provider account. Otherwise they go through Pria’s platform billing.

Avatars (Anam, LemonSlice) are the most expensive option per minute. If cost is a concern and you don’t need a visible persona, OpenAI Realtime or Gemini Live with a quality voice will deliver excellent results at a fraction of the cost.

Voice & Realtime Providers (integration overview) — compare every provider (OpenAI, ElevenLabs, Gemini Live, xAI, Anam, LemonSlice) and how each connects
Convo Mode (User Guide)
Configuration
AI Models
Personalization

Admin Guide

Account Management

Instance Settings

AI & Models

Assistants & Prompts

Tools & Connectors

User Management

Analytics & Monitoring

Legal & Compliance

API Reference

Runtime API

Administrator API

Integrations

Authentication

Chat Completions

Instructure Canvas

Web App

MCP Server

Voice & Avatars

Google Workspace

LMS Platforms (LTI 1.3)

Billing & Payments

SDK

Realtime Voice & Avatars

What Convo Mode Is

Who Can Use Convo Mode

Choosing a Provider

OpenAI Realtime

ElevenLabs

Gemini Live

xAI Realtime

Anam Avatar

LemonSlice Avatar

Per-Provider Configuration

Voice Activity Detection (VAD)

Transcription Language

Avatar Imagination Prompts

Testing Your Setup

Cost Considerations

​What Convo Mode Is

​Who Can Use Convo Mode

​Choosing a Provider

OpenAI Realtime

ElevenLabs

Gemini Live

xAI Realtime

Anam Avatar

LemonSlice Avatar

​Per-Provider Configuration

​Voice Activity Detection (VAD)

​Transcription Language

​Avatar Imagination Prompts

​Testing Your Setup

​Cost Considerations

​Related

What Convo Mode Is

Who Can Use Convo Mode

Choosing a Provider

Per-Provider Configuration

Voice Activity Detection (VAD)

Transcription Language

Avatar Imagination Prompts

Testing Your Setup

Cost Considerations

Related