Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.praxis-ai.com/llms.txt

Use this file to discover all available pages before exploring further.

When you drop a file into your IP Vault, it doesn’t become instantly searchable. Pria runs it through a five-phase pipeline that extracts text, breaks it into searchable segments, cleans it up, embeds it as vectors, and finalizes it for retrieval. Knowing what each phase does — and what each status means — helps you spot trouble early and reprocess only what needs reprocessing.

The journey from upload to searchable

Upload  →  Extract  →  Chunk  →  Sanitize  →  Embed  →  Finalize  →  Included
Each phase commits its work before the next one starts, so a failure surgically retries only the failed phase — you don’t lose progress on a 200-page PDF because the last batch of embeddings timed out. You can watch the journey live on the file card in the vault. The progress bar advances through the phases, and the percentage shows real work done (not interpolated time).

The 5 phases

Pria reads the source bytes and turns them into clean text.
Source typeWhat Extract does
PDF / DOCX / PPTX / XLSXParses text directly. Image-based PDFs fall back to OCR.
Images (JPG, PNG, etc.)Vision model description + OCR for any embedded text.
Audio / videoTranscribes speech to text. Files over 25 MB are auto-split, transcribed in parallel, and re-stitched.
Email (.eml, .msg)Headers + body, including inline attachments.
HTML / web pagesStrips navigation, ads, and boilerplate; keeps article text.
Jupyter notebooksFlattens markdown and code cells.
ZIP / IMSCC archivesEach member becomes its own file and re-enters the pipeline.
The card sits between 5% and 40% during extract. Long videos and large PDFs spend most of their time here.
The extracted text is split into segments of about 8,000 characters each (roughly 1,500–2,000 words). Segments overlap by ~5% so a sentence that lands on a boundary still appears whole in one of the two neighbours.Segment boundaries snap to the nearest paragraph or sentence end — chunks never tear mid-thought.The card runs 40% to 50% during chunking. Even huge files finish this phase in seconds.
Each segment is screened by a quick heuristic. Clean prose passes through untouched — no LLM call, no cost. Segments that look noisy (OCR garbage, broken formatting, navigation residue) are rewritten by your institution’s summary model into cleaner prose.Sanitize is enabled by default, but your administrator can turn it off per institution.The card runs 50% to 75% during sanitize. The bar advances in real time as each batch comes back from the model.
Every segment is converted to a vector embedding — a high-dimensional numerical representation of its meaning. Embeddings are what make semantic search possible: your Digital Twin compares the vector of your question against every segment vector to find the closest matches.Pria optionally enriches each segment with 2–3 AI-generated keywords plus a document-level summary header before embedding. This makes single-segment matches more findable.The card runs 75% to 95% during embed.
The final phase is cheap closing work:
  1. Vault Health Grade is computed from the final segment set.
  2. The file’s status flips from inactive to Included, which is the trigger every retrieval path checks.
  3. Scratch data is cleaned up.
  4. Token usage is rolled into the billing record.
Until the status flips to Included, retrieval will skip the file entirely. The card jumps from 95% to 100% during finalize.

Reading the progress bar

The bar on each file card shows the active phase and the percentage of work completed within the pipeline. The percentage reflects actual segments processed, not interpolated time — so 50% really means half the segments are through.
Status chipWhat it means
QueuedWaiting for a worker to pick it up. Usually a second or two.
ExtractPulling text from the source.
ChunkSplitting text into segments.
SanitizeCleaning noisy segments.
EmbedGenerating vector embeddings.
FinalizeWrapping up and computing health.
Included (green)Done. Searchable. Cited in chat.
Error (red)A phase failed and exceeded its retry budget. Click for details.
The counter updates in batches, not one segment at a time. On short files the bar may pause briefly before jumping ahead. That’s normal — work is still in flight.

Vault Health Grade

After finalize, every file gets a Vault Health Grade — a quick read on how usable the file is for retrieval. The badge appears on the file card and contributes to your overall Vault Health letter grade (A–F).

What affects the grade

Text quality

Clean, well-structured prose scores highest. OCR-heavy text, garbled characters, and code-dominant content lower the grade.

Chunk coverage

Files that produce a reasonable number of well-sized segments grade better than tiny one-segment files or oversized monolithic blobs.

OCR accuracy

Scanned PDFs with high OCR confidence grade higher than blurry image-based documents.

Metadata richness

Files with extractable titles, authors, and structure (headings, sections) help Pria retrieve more precisely.
If a file scores poorly, open it in the File Preview to see what’s flagged, then run Sanitize All on its segments or re-upload a cleaner source.

Common issues

SymptomLikely causeWhat to do
File too largeSingle file over 150 MB, or batch over 500 MBSplit the file, or upload in smaller batches.
Unsupported formatThe file type isn’t in the supported listConvert to a supported format (PDF, DOCX, MP4, etc.) before upload.
OCR failed / empty extractScanned PDF with very low image quality, or a corrupt sourceTry a higher-resolution scan, or transcribe the document manually.
Audio too long / no speechTranscription couldn’t find usable speechCheck the audio plays back; for music-heavy tracks, consider not indexing.
Embed failed mid-batchEmbedding provider hit a rate limit or 5xxPria retries automatically. If it keeps failing, Reprocess Content → Re-embed.
Stuck in one phaseA worker crashed mid-jobThe sweeper re-queues stuck jobs automatically within ~90 seconds. If still stuck, Reprocess Content → Full Re-extract.
Status: ErrorAll retries exhaustedOpen the preview to see the failure reason; try Reprocess or Re-ingest from URL if applicable.

Re-processing a file

Open the file’s action menu and pick Reprocess Content. You get four modes:
ModeWhat it doesWhen to use
Re-embedKeeps the segments, regenerates embeddingsAfter changing the embedding model
Full Re-extractWipes everything, restarts from ExtractWhen extraction was wrong (bad PDF, OCR upgrade)
Re-ingest from URLRe-downloads the source and runs the full pipelineURL imports where the source changed
Re-extract Knowledge GraphRe-runs the graph extractor against existing segmentsWhen entity/relationship counts look off
Re-ingest preserves the file’s identifiers, so any links you shared keep resolving.

Confidential files

When you upload a file with the Confidential flag:
  • Only you, the uploader, can see it. Even other admins in your institution can’t read its contents.
  • The file is excluded from the shared vault — it doesn’t appear in Instance or Account searches for anyone else.
  • Other users searching content see a locked placeholder where the snippet would have been, with no preview text.
  • Your Digital Twin still cites it when you are chatting — citations to confidential files work normally for the owner.
This is the right level for personal notes, draft documents, and material you want indexed for your own conversations but not exposed to teammates.
Confidential is per-file and irreversible without re-uploading. If you mark a file confidential, then later want to share it, you’ll need to upload a new copy (or ask your administrator to assist).