File Processing Status

When you drop a file into your IP Vault, it doesn’t become instantly searchable. Pria runs it through a five-phase pipeline that extracts text, breaks it into searchable segments, cleans it up, embeds it as vectors, and finalizes it for retrieval. Knowing what each phase does — and what each status means — helps you spot trouble early and reprocess only what needs reprocessing.

The journey from upload to searchable

Upload  →  Extract  →  Chunk  →  Sanitize  →  Embed  →  Finalize  →  Included

Each phase commits its work before the next one starts, so a failure surgically retries only the failed phase — you don’t lose progress on a 200-page PDF because the last batch of embeddings timed out. You can watch the journey live on the file card in the vault. The progress bar advances through the phases, and the percentage shows real work done (not interpolated time).

Files uploaded with Skip Indexing checked bypass this pipeline entirely — they’re stored for safekeeping and download but never become searchable (and never incur indexing cost). Run Reprocess Content later if you change your mind.

The 5 phases

1. Extract — pulling text out of the source

Pria reads the source bytes and turns them into clean text.

Source type	What Extract does
PDF / DOCX / PPTX / XLSX	Parses text directly. Image-based PDFs fall back to OCR.
Images (JPG, PNG, etc.)	Vision model description + OCR for any embedded text.
Audio / video	Transcribes speech to text. Files over 25 MB are auto-split, transcribed in parallel, and re-stitched.
Email (`.eml`, `.msg`)	Headers + body, including inline attachments.
HTML / web pages	Strips navigation, ads, and boilerplate; keeps article text.
Jupyter notebooks	Flattens markdown and code cells.
ZIP / IMSCC archives	Each member becomes its own file and re-enters the pipeline.

The card sits between 5% and 40% during extract. Long videos and large PDFs spend most of their time here.

2. Chunk — splitting into searchable segments

The extracted text is split into segments of about 8,000 characters each (roughly 1,500–2,000 words). Segments overlap by ~5% so a sentence that lands on a boundary still appears whole in one of the two neighbours.Segment boundaries snap to the nearest paragraph or sentence end — chunks never tear mid-thought.The card runs 40% to 50% during chunking. Even huge files finish this phase in seconds.

3. Sanitize — cleaning OCR noise and enriching context

Each segment is screened by a quick heuristic. Clean prose passes through untouched — no LLM call, no cost. Segments that look noisy (OCR garbage, broken formatting, navigation residue) are rewritten by your institution’s summary model into cleaner prose.Sanitize is enabled by default, but your administrator can turn it off per institution.The card runs 50% to 75% during sanitize. The bar advances in real time as each batch comes back from the model.

4. Embed — turning each chunk into a vector

Every segment is converted to a vector embedding — a high-dimensional numerical representation of its meaning. Embeddings are what make semantic search possible: your Digital Twin compares the vector of your question against every segment vector to find the closest matches.Pria optionally enriches each segment with 2–3 AI-generated keywords plus a document-level summary header before embedding. This makes single-segment matches more findable.The card runs 75% to 95% during embed.

5. Finalize — making the file searchable

The final phase is cheap closing work:

Vault Health Grade is computed from the final segment set.
The file’s status flips from inactive to Included, which is the trigger every retrieval path checks.
Scratch data is cleaned up.
Token usage is rolled into the billing record.

Until the status flips to Included, retrieval will skip the file entirely. The card jumps from 95% to 100% during finalize.

Reading the progress bar

The bar on each file card shows the active phase and the percentage of work completed within the pipeline. The percentage reflects actual segments processed, not interpolated time — so 50% really means half the segments are through.

Status chip	What it means
Queued	Waiting for a worker to pick it up. Usually a second or two.
Extract	Pulling text from the source.
Chunk	Splitting text into segments.
Sanitize	Cleaning noisy segments.
Embed	Generating vector embeddings.
Finalize	Wrapping up and computing health.
Included (green)	Done. Searchable. Cited in chat.
Error (red)	A phase failed and exceeded its retry budget. Click for details.

The counter updates in batches, not one segment at a time. On short files the bar may pause briefly before jumping ahead. That’s normal — work is still in flight.

Vault Health Grade

After finalize, every file gets a Vault Health Grade — a quick read on how usable the file is for retrieval. The badge appears on the file card and contributes to your overall Vault Health letter grade (A–F).

What affects the grade

Text quality

Clean, well-structured prose scores highest. OCR-heavy text, garbled characters, and code-dominant content lower the grade.

Chunk coverage

Files that produce a reasonable number of well-sized segments grade better than tiny one-segment files or oversized monolithic blobs.

OCR accuracy

Scanned PDFs with high OCR confidence grade higher than blurry image-based documents.

Metadata richness

Files with extractable titles, authors, and structure (headings, sections) help Pria retrieve more precisely.

If a file scores poorly, open it in the File Preview to see what’s flagged, then run Sanitize All on its segments or re-upload a cleaner source.

Common issues

Symptom	Likely cause	What to do
File too large	Single file over 150 MB, or batch over 500 MB	Split the file, or upload in smaller batches.
Unsupported format	The file type isn’t in the supported list	Convert to a supported format (PDF, DOCX, MP4, etc.) before upload.
OCR failed / empty extract	Scanned PDF with very low image quality, or a corrupt source	Try a higher-resolution scan, or transcribe the document manually.
Audio too long / no speech	Transcription couldn’t find usable speech	Check the audio plays back; for music-heavy tracks, consider not indexing.
Embed failed mid-batch	Embedding provider hit a rate limit or 5xx	Pria retries automatically. If it keeps failing, Reprocess Content → Re-embed.
Stuck in one phase	A worker crashed mid-job	The sweeper re-queues stuck jobs automatically within ~90 seconds. If still stuck, Reprocess Content → Full Re-extract.
Status: Error	All retries exhausted	Open the preview to see the failure reason; try Reprocess or Re-ingest from URL if applicable.

Re-processing a file

Open the file’s action menu and pick Reprocess Content. You get four modes:

Mode	What it does	When to use
Re-embed	Keeps the segments, regenerates embeddings	After changing the embedding model
Full Re-extract	Wipes everything, restarts from Extract	When extraction was wrong (bad PDF, OCR upgrade)
Re-ingest from URL	Re-downloads the source and runs the full pipeline	URL imports where the source changed
Re-extract Knowledge Graph	Re-runs the graph extractor against existing segments	When entity/relationship counts look off

Re-ingest preserves the file’s identifiers, so any links you shared keep resolving.

Confidential files

Confidential is a per-file flag for shared-vault uploads (Instance or Account). When set:

Only you, the uploader, can open the file. Other members — including admins — can’t preview or download it; where its details would appear they see a ”🔒 CONFIDENTIAL” marker instead.
Processing is unchanged. Confidential files go through the same pipeline and stay indexed — their content can still inform the Digital Twin’s answers for other members, but those members only ever see a short, redacted snippet in the retrieved-passages panel, never the document itself.
For you, everything works normally — full citations, full preview, full retrieval in your own conversations.

You can toggle Confidential before upload (in the upload dialogue) or any time after (file action menu → Make Confidential / Make Public). See What Confidential really means for the full picture.

Managing Files — uploading, organizing, and the file action menu
IP Vault — vault overview and access tiers
Searching Inside Your Files — content search across the vault
Knowledge Modes — how processed files are used at chat time

Digital Twin Capabilities

Files

Assistants

Memory

Settings & Account

File Processing Status

The journey from upload to searchable

The 5 phases

Reading the progress bar

Vault Health Grade

What affects the grade

Text quality

Chunk coverage

OCR accuracy

Metadata richness

Common issues

Re-processing a file

Confidential files

​The journey from upload to searchable

​The 5 phases

​Reading the progress bar

​Vault Health Grade

​What affects the grade

Text quality

Chunk coverage

OCR accuracy

Metadata richness

​Common issues

​Re-processing a file

​Confidential files

​Related

The journey from upload to searchable

The 5 phases

Reading the progress bar

Vault Health Grade

What affects the grade

Common issues

Re-processing a file

Confidential files

Related