Skip to main content

What is the IP Vault?

The IP Vault is your personal file storage. When you upload documents here, your digital twin can read and reference them when answering your questions. Simple workflow:
  1. Upload your files (PDFs, documents, spreadsheets, etc.)
  2. Ask questions about your content
  3. Get answers that reference your uploaded materials
This is useful for uploading course materials, research papers, company policies, product documentation, or any content you want your digital twin to know about.
File Security: All files in the IP Vault are protected by a three-tier access control system — Public, Private (default), and Confidential. See File Access Control for details.Sign-in required page displayed when accessing a private file

When Does My Digital Twin Use My Files?

Your digital twin automatically searches your uploaded files when you:
  • Ask questions related to content you’ve uploaded
  • Select specific files to include in a conversation
  • Reference topics covered in your documents

Relevance Scores

When your digital twin finds matching content in your files, it shows a relevance score indicating how semantically close the match is to your question. Scores are computed by MongoDB Atlas Vector Search using Euclidean distance between vector embeddings, normalized to a 0-100% scale using the formula score = 1 / (1 + distance).
ScoreEuclidean DistanceMeaning
90-100%0.00 – 0.11Excellent match — vectors are very close; the content directly addresses your question
70-89%0.12 – 0.43Good match — semantically related content that should help answer your question
50-69%0.45 – 1.00Partial match — loosely related background information
Below 50%> 1.00Weak match — distant vectors with limited semantic overlap
By default, results below 45% relevance are filtered out. Administrators can adjust this minimum relevance threshold to control the trade-off between recall (more results) and precision (more relevant results).
For administrators: You can configure the minimum relevance threshold in your institution settings. You can also disable file uploads for users while retaining upload capabilities yourself, to control what content enters the knowledge base.

Your digital twin uses a technology called RAG (Retrieval Augmented Generation) to search your files. Here’s what happens behind the scenes:
1

File Processing

When you upload a file, the system breaks it into smaller sections (segments) and converts each section into a numerical representation called a vector embedding — a high-dimensional array of numbers that captures the semantic meaning of the text. Embeddings are generated by a dedicated AI model (e.g., OpenAI text-embedding-3-small, Amazon titan-embed-text-v2, or Google gemini-embedding-001).RAG Example
2

Question Analysis

Your question is converted into a vector embedding using the same model, allowing the system to mathematically compare your question against all your file sections in vector space.
3

Finding Matches

The system uses MongoDB Atlas Vector Search with Exact Nearest Neighbor (ENN) to find file sections whose embeddings are closest to your question’s embedding, measured by Euclidean distance. Results are ranked by relevance score, computed as 1 / (1 + distance) — where shorter distance means higher relevance.k-NN Example
4

Relevance Threshold

Results below a minimum relevance threshold are filtered out (default: 45%). Administrators can adjust how strict the matching is:
  • Higher threshold: Only highly relevant content is included — better precision
  • Lower threshold: More content is included — better recall, but may be less directly relevant

Segment Size and Search Accuracy

When a file is processed, it is split into segments (also called “chunks”) before being converted into vector embeddings. The size of these segments has a direct impact on how accurately your digital twin retrieves relevant content.Research findings from studies by Chroma Research and multi-dataset analyses consistently show an inverse relationship between segment size and retrieval precision:
Segment SizePrecisionRecallTrade-off
Small (~200 tokens)HighestGoodBest precision, but may fragment context
Medium (~400 tokens)ModerateHighestBalanced precision and recall
Large (~800 tokens)LowestGoodMore context per segment, but more noise in results
Key insights from the research:
  • Smaller segments match queries more precisely because the vector embedding represents a focused piece of content rather than a broad summary of many topics.
  • Larger segments preserve more surrounding context but introduce noise — the embedding averages over more content, diluting the signal for any single topic.
  • Industry defaults (e.g., 800 tokens with 400-token overlap) were found to score below average across all retrieval metrics.
  • The optimal size depends on query type: fact-based questions benefit from smaller segments (256-512 tokens), while analytical questions requiring broader context may benefit from larger ones (1024+ tokens).
At Praxis, we have determined that segment length should not exceed 8,000 characters, limiting each segment to approximately 2,000 tokens maximum. This cap accommodates the input limits of all supported embedding models — including OpenAI, Amazon Titan, and Google Gemini — while keeping segments small enough to preserve semantic search accuracy. This balance ensures that retrieval precision remains high without fragmenting content so aggressively that context is lost.
You can view and edit your file’s segments using the RAG Segments action in your file management options. Each segment shows its character count and the embedding model used to index it.