When you drop a file into your IP Vault, it doesn’t become instantly searchable. Pria runs it through a five-phase pipeline that extracts text, breaks it into searchable segments, cleans it up, embeds it as vectors, and finalizes it for retrieval. Knowing what each phase does — and what each status means — helps you spot trouble early and reprocess only what needs reprocessing.Documentation Index
Fetch the complete documentation index at: https://docs.praxis-ai.com/llms.txt
Use this file to discover all available pages before exploring further.
The journey from upload to searchable
The 5 phases
1. Extract — pulling text out of the source
1. Extract — pulling text out of the source
Pria reads the source bytes and turns them into clean text.
The card sits between 5% and 40% during extract. Long videos and large PDFs spend most of their time here.
| Source type | What Extract does |
|---|---|
| PDF / DOCX / PPTX / XLSX | Parses text directly. Image-based PDFs fall back to OCR. |
| Images (JPG, PNG, etc.) | Vision model description + OCR for any embedded text. |
| Audio / video | Transcribes speech to text. Files over 25 MB are auto-split, transcribed in parallel, and re-stitched. |
Email (.eml, .msg) | Headers + body, including inline attachments. |
| HTML / web pages | Strips navigation, ads, and boilerplate; keeps article text. |
| Jupyter notebooks | Flattens markdown and code cells. |
| ZIP / IMSCC archives | Each member becomes its own file and re-enters the pipeline. |
2. Chunk — splitting into searchable segments
2. Chunk — splitting into searchable segments
The extracted text is split into segments of about 8,000 characters each (roughly 1,500–2,000 words). Segments overlap by ~5% so a sentence that lands on a boundary still appears whole in one of the two neighbours.Segment boundaries snap to the nearest paragraph or sentence end — chunks never tear mid-thought.The card runs 40% to 50% during chunking. Even huge files finish this phase in seconds.
3. Sanitize — cleaning OCR noise and enriching context
3. Sanitize — cleaning OCR noise and enriching context
Each segment is screened by a quick heuristic. Clean prose passes through untouched — no LLM call, no cost. Segments that look noisy (OCR garbage, broken formatting, navigation residue) are rewritten by your institution’s summary model into cleaner prose.Sanitize is enabled by default, but your administrator can turn it off per institution.The card runs 50% to 75% during sanitize. The bar advances in real time as each batch comes back from the model.
4. Embed — turning each chunk into a vector
4. Embed — turning each chunk into a vector
Every segment is converted to a vector embedding — a high-dimensional numerical representation of its meaning. Embeddings are what make semantic search possible: your Digital Twin compares the vector of your question against every segment vector to find the closest matches.Pria optionally enriches each segment with 2–3 AI-generated keywords plus a document-level summary header before embedding. This makes single-segment matches more findable.The card runs 75% to 95% during embed.
5. Finalize — making the file searchable
5. Finalize — making the file searchable
The final phase is cheap closing work:
- Vault Health Grade is computed from the final segment set.
- The file’s status flips from
inactiveto Included, which is the trigger every retrieval path checks. - Scratch data is cleaned up.
- Token usage is rolled into the billing record.
Reading the progress bar
The bar on each file card shows the active phase and the percentage of work completed within the pipeline. The percentage reflects actual segments processed, not interpolated time — so 50% really means half the segments are through.| Status chip | What it means |
|---|---|
| Queued | Waiting for a worker to pick it up. Usually a second or two. |
| Extract | Pulling text from the source. |
| Chunk | Splitting text into segments. |
| Sanitize | Cleaning noisy segments. |
| Embed | Generating vector embeddings. |
| Finalize | Wrapping up and computing health. |
| Included (green) | Done. Searchable. Cited in chat. |
| Error (red) | A phase failed and exceeded its retry budget. Click for details. |
The counter updates in batches, not one segment at a time. On short files the bar may pause briefly before jumping ahead. That’s normal — work is still in flight.
Vault Health Grade
After finalize, every file gets a Vault Health Grade — a quick read on how usable the file is for retrieval. The badge appears on the file card and contributes to your overall Vault Health letter grade (A–F).What affects the grade
Text quality
Clean, well-structured prose scores highest. OCR-heavy text, garbled characters, and code-dominant content lower the grade.
Chunk coverage
Files that produce a reasonable number of well-sized segments grade better than tiny one-segment files or oversized monolithic blobs.
OCR accuracy
Scanned PDFs with high OCR confidence grade higher than blurry image-based documents.
Metadata richness
Files with extractable titles, authors, and structure (headings, sections) help Pria retrieve more precisely.
Common issues
| Symptom | Likely cause | What to do |
|---|---|---|
| File too large | Single file over 150 MB, or batch over 500 MB | Split the file, or upload in smaller batches. |
| Unsupported format | The file type isn’t in the supported list | Convert to a supported format (PDF, DOCX, MP4, etc.) before upload. |
| OCR failed / empty extract | Scanned PDF with very low image quality, or a corrupt source | Try a higher-resolution scan, or transcribe the document manually. |
| Audio too long / no speech | Transcription couldn’t find usable speech | Check the audio plays back; for music-heavy tracks, consider not indexing. |
| Embed failed mid-batch | Embedding provider hit a rate limit or 5xx | Pria retries automatically. If it keeps failing, Reprocess Content → Re-embed. |
| Stuck in one phase | A worker crashed mid-job | The sweeper re-queues stuck jobs automatically within ~90 seconds. If still stuck, Reprocess Content → Full Re-extract. |
| Status: Error | All retries exhausted | Open the preview to see the failure reason; try Reprocess or Re-ingest from URL if applicable. |
Re-processing a file
Open the file’s action menu and pick Reprocess Content. You get four modes:| Mode | What it does | When to use |
|---|---|---|
| Re-embed | Keeps the segments, regenerates embeddings | After changing the embedding model |
| Full Re-extract | Wipes everything, restarts from Extract | When extraction was wrong (bad PDF, OCR upgrade) |
| Re-ingest from URL | Re-downloads the source and runs the full pipeline | URL imports where the source changed |
| Re-extract Knowledge Graph | Re-runs the graph extractor against existing segments | When entity/relationship counts look off |
Confidential files
When you upload a file with the Confidential flag:- Only you, the uploader, can see it. Even other admins in your institution can’t read its contents.
- The file is excluded from the shared vault — it doesn’t appear in Instance or Account searches for anyone else.
- Other users searching content see a locked placeholder where the snippet would have been, with no preview text.
- Your Digital Twin still cites it when you are chatting — citations to confidential files work normally for the owner.
Related
- Managing Files — uploading, organizing, and the file action menu
- IP Vault — vault overview and access tiers
- Searching Inside Your Files — content search across the vault
- Knowledge Modes — how processed files are used at chat time