Skip to main content
When you drop a file into your IP Vault, it doesn’t become instantly searchable. Pria runs it through a five-phase pipeline that extracts text, breaks it into searchable segments, cleans it up, embeds it as vectors, and finalizes it for retrieval. Knowing what each phase does — and what each status means — helps you spot trouble early and reprocess only what needs reprocessing.

The journey from upload to searchable

Upload  →  Extract  →  Chunk  →  Sanitize  →  Embed  →  Finalize  →  Included
Each phase commits its work before the next one starts, so a failure surgically retries only the failed phase — you don’t lose progress on a 200-page PDF because the last batch of embeddings timed out. You can watch the journey live on the file card in the vault. The progress bar advances through the phases, and the percentage shows real work done (not interpolated time).
Files uploaded with Skip Indexing checked bypass this pipeline entirely — they’re stored for safekeeping and download but never become searchable (and never incur indexing cost). Run Reprocess Content later if you change your mind.

The 5 phases

Pria reads the source bytes and turns them into clean text.
Source typeWhat Extract does
PDF / DOCX / PPTX / XLSXParses text directly. Image-based PDFs fall back to OCR.
Images (JPG, PNG, etc.)Vision model description + OCR for any embedded text.
Audio / videoTranscribes speech to text. Files over 25 MB are auto-split, transcribed in parallel, and re-stitched.
Email (.eml, .msg)Headers + body, including inline attachments.
HTML / web pagesStrips navigation, ads, and boilerplate; keeps article text.
Jupyter notebooksFlattens markdown and code cells.
ZIP / IMSCC archivesEach member becomes its own file and re-enters the pipeline.
The card sits between 5% and 40% during extract. Long videos and large PDFs spend most of their time here.
The extracted text is split into segments of about 8,000 characters each (roughly 1,500–2,000 words). Segments overlap by ~5% so a sentence that lands on a boundary still appears whole in one of the two neighbours.Segment boundaries snap to the nearest paragraph or sentence end — chunks never tear mid-thought.The card runs 40% to 50% during chunking. Even huge files finish this phase in seconds.
Each segment is screened by a quick heuristic. Clean prose passes through untouched — no LLM call, no cost. Segments that look noisy (OCR garbage, broken formatting, navigation residue) are rewritten by your institution’s summary model into cleaner prose.Sanitize is enabled by default, but your administrator can turn it off per institution.The card runs 50% to 75% during sanitize. The bar advances in real time as each batch comes back from the model.
Every segment is converted to a vector embedding — a high-dimensional numerical representation of its meaning. Embeddings are what make semantic search possible: your Digital Twin compares the vector of your question against every segment vector to find the closest matches.Pria optionally enriches each segment with 2–3 AI-generated keywords plus a document-level summary header before embedding. This makes single-segment matches more findable.The card runs 75% to 95% during embed.
The final phase is cheap closing work:
  1. Vault Health Grade is computed from the final segment set.
  2. The file’s status flips from inactive to Included, which is the trigger every retrieval path checks.
  3. Scratch data is cleaned up.
  4. Token usage is rolled into the billing record.
Until the status flips to Included, retrieval will skip the file entirely. The card jumps from 95% to 100% during finalize.

Reading the progress bar

The bar on each file card shows the active phase and the percentage of work completed within the pipeline. The percentage reflects actual segments processed, not interpolated time — so 50% really means half the segments are through.
Status chipWhat it means
QueuedWaiting for a worker to pick it up. Usually a second or two.
ExtractPulling text from the source.
ChunkSplitting text into segments.
SanitizeCleaning noisy segments.
EmbedGenerating vector embeddings.
FinalizeWrapping up and computing health.
Included (green)Done. Searchable. Cited in chat.
Error (red)A phase failed and exceeded its retry budget. Click for details.
The counter updates in batches, not one segment at a time. On short files the bar may pause briefly before jumping ahead. That’s normal — work is still in flight.

Vault Health Grade

After finalize, every file gets a Vault Health Grade — a quick read on how usable the file is for retrieval. The badge appears on the file card and contributes to your overall Vault Health letter grade (A–F).

What affects the grade

Text quality

Clean, well-structured prose scores highest. OCR-heavy text, garbled characters, and code-dominant content lower the grade.

Chunk coverage

Files that produce a reasonable number of well-sized segments grade better than tiny one-segment files or oversized monolithic blobs.

OCR accuracy

Scanned PDFs with high OCR confidence grade higher than blurry image-based documents.

Metadata richness

Files with extractable titles, authors, and structure (headings, sections) help Pria retrieve more precisely.
If a file scores poorly, open it in the File Preview to see what’s flagged, then run Sanitize All on its segments or re-upload a cleaner source.

Common issues

SymptomLikely causeWhat to do
File too largeSingle file over 150 MB, or batch over 500 MBSplit the file, or upload in smaller batches.
Unsupported formatThe file type isn’t in the supported listConvert to a supported format (PDF, DOCX, MP4, etc.) before upload.
OCR failed / empty extractScanned PDF with very low image quality, or a corrupt sourceTry a higher-resolution scan, or transcribe the document manually.
Audio too long / no speechTranscription couldn’t find usable speechCheck the audio plays back; for music-heavy tracks, consider not indexing.
Embed failed mid-batchEmbedding provider hit a rate limit or 5xxPria retries automatically. If it keeps failing, Reprocess Content → Re-embed.
Stuck in one phaseA worker crashed mid-jobThe sweeper re-queues stuck jobs automatically within ~90 seconds. If still stuck, Reprocess Content → Full Re-extract.
Status: ErrorAll retries exhaustedOpen the preview to see the failure reason; try Reprocess or Re-ingest from URL if applicable.

Re-processing a file

Open the file’s action menu and pick Reprocess Content. You get four modes:
ModeWhat it doesWhen to use
Re-embedKeeps the segments, regenerates embeddingsAfter changing the embedding model
Full Re-extractWipes everything, restarts from ExtractWhen extraction was wrong (bad PDF, OCR upgrade)
Re-ingest from URLRe-downloads the source and runs the full pipelineURL imports where the source changed
Re-extract Knowledge GraphRe-runs the graph extractor against existing segmentsWhen entity/relationship counts look off
Re-ingest preserves the file’s identifiers, so any links you shared keep resolving.

Confidential files

Confidential is a per-file flag for shared-vault uploads (Instance or Account). When set:
  • Only you, the uploader, can open the file. Other members — including admins — can’t preview or download it; where its details would appear they see a ”🔒 CONFIDENTIAL” marker instead.
  • Processing is unchanged. Confidential files go through the same pipeline and stay indexed — their content can still inform the Digital Twin’s answers for other members, but those members only ever see a short, redacted snippet in the retrieved-passages panel, never the document itself.
  • For you, everything works normally — full citations, full preview, full retrieval in your own conversations.
You can toggle Confidential before upload (in the upload dialogue) or any time after (file action menu → Make Confidential / Make Public). See What Confidential really means for the full picture.