What is the IP Vault?
The IP Vault is your personal file storage. When you upload documents here, your digital twin can read and reference them when answering your questions. Simple workflow:- Upload your files (PDFs, documents, spreadsheets, etc.)
- Ask questions about your content
- Get answers that reference your uploaded materials
File Security: All files in the IP Vault are protected by a three-tier access control system — Public, Private (default), and Confidential. See File Access Control for details.

When Does My Digital Twin Use My Files?
Your digital twin automatically searches your uploaded files when you:- Ask questions related to content you’ve uploaded
- Select specific files to include in a conversation
- Reference topics covered in your documents
Relevance Scores
When your digital twin finds matching content in your files, it shows a relevance score indicating how semantically close the match is to your question. Scores are computed by MongoDB Atlas Vector Search using Euclidean distance between vector embeddings, normalized to a 0-100% scale using the formulascore = 1 / (1 + distance).
| Score | Euclidean Distance | Meaning |
|---|---|---|
| 90-100% | 0.00 – 0.11 | Excellent match — vectors are very close; the content directly addresses your question |
| 70-89% | 0.12 – 0.43 | Good match — semantically related content that should help answer your question |
| 50-69% | 0.45 – 1.00 | Partial match — loosely related background information |
| Below 50% | > 1.00 | Weak match — distant vectors with limited semantic overlap |
For administrators: You can configure the minimum relevance threshold in your institution settings. You can also disable file uploads for users while retaining upload capabilities yourself, to control what content enters the knowledge base.
Technical Details: How File Search Works
Technical Details: How File Search Works
Your digital twin uses a technology called RAG (Retrieval Augmented Generation) to search your files. Here’s what happens behind the scenes:
Key insights from the research:
File Processing
When you upload a file, the system breaks it into smaller sections (segments) and converts each section into a numerical representation called a vector embedding — a high-dimensional array of numbers that captures the semantic meaning of the text. Embeddings are generated by a dedicated AI model (e.g., OpenAI 
text-embedding-3-small, Amazon titan-embed-text-v2, or Google gemini-embedding-001).
Question Analysis
Your question is converted into a vector embedding using the same model, allowing the system to mathematically compare your question against all your file sections in vector space.
Finding Matches
The system uses MongoDB Atlas Vector Search with Exact Nearest Neighbor (ENN) to find file sections whose embeddings are closest to your question’s embedding, measured by Euclidean distance. Results are ranked by relevance score, computed as 
1 / (1 + distance) — where shorter distance means higher relevance.
Relevance Threshold
Results below a minimum relevance threshold are filtered out (default: 45%). Administrators can adjust how strict the matching is:
- Higher threshold: Only highly relevant content is included — better precision
- Lower threshold: More content is included — better recall, but may be less directly relevant
Segment Size and Search Accuracy
When a file is processed, it is split into segments (also called “chunks”) before being converted into vector embeddings. The size of these segments has a direct impact on how accurately your digital twin retrieves relevant content.Research findings from studies by Chroma Research and multi-dataset analyses consistently show an inverse relationship between segment size and retrieval precision:| Segment Size | Precision | Recall | Trade-off |
|---|---|---|---|
| Small (~200 tokens) | Highest | Good | Best precision, but may fragment context |
| Medium (~400 tokens) | Moderate | Highest | Balanced precision and recall |
| Large (~800 tokens) | Lowest | Good | More context per segment, but more noise in results |
- Smaller segments match queries more precisely because the vector embedding represents a focused piece of content rather than a broad summary of many topics.
- Larger segments preserve more surrounding context but introduce noise — the embedding averages over more content, diluting the signal for any single topic.
- Industry defaults (e.g., 800 tokens with 400-token overlap) were found to score below average across all retrieval metrics.
- The optimal size depends on query type: fact-based questions benefit from smaller segments (256-512 tokens), while analytical questions requiring broader context may benefit from larger ones (1024+ tokens).