38 KiB
Chapter 03: Knowledge Base System - Vector Search and Semantic Retrieval
The General Bots Knowledge Base (gbkb) system implements a state-of-the-art semantic search infrastructure that enables intelligent document retrieval through vector embeddings and neural information retrieval. This chapter provides comprehensive technical documentation on the architecture, implementation, and optimization of the knowledge base subsystem.
Executive Summary
The knowledge base system transforms unstructured documents into queryable semantic representations, enabling natural language understanding and context-aware information retrieval. Unlike traditional keyword-based search systems, the gbkb implementation leverages dense vector representations to capture semantic meaning, supporting cross-lingual retrieval, conceptual similarity matching, and intelligent context augmentation for language model responses.
System Architecture Overview
Core Components and Data Flow
The knowledge base architecture implements a multi-stage pipeline for document processing and retrieval:
Technical Specifications
Document Processing Pipeline
Phase 1: Document Ingestion and Extraction
The system implements format-specific extractors for comprehensive document support. The PDF processing component provides advanced extraction capabilities with layout preservation, including:
- Text Layer Extraction: Direct extraction of embedded text from PDF documents
- OCR Processing: Optical character recognition for scanned documents
- Table Detection: Identification and extraction of tabular data
- Image Extraction: Retrieval of embedded images and figures
- Metadata Preservation: Author, creation date, and document properties
- Structure Detection: Identification of sections, headings, and document hierarchy
Supported File Formats and Parsers
| Format | Parser Library | Features | Max Size | Processing Time |
|---|---|---|---|---|
| Apache PDFBox + Tesseract | Text, OCR, Tables, Images | 500MB | ~10s/MB | |
| DOCX | Apache POI + python-docx | Formatted text, Styles, Comments | 100MB | ~5s/MB |
| XLSX | Apache POI + openpyxl | Sheets, Formulas, Charts | 100MB | ~8s/MB |
| PPTX | Apache POI + python-pptx | Slides, Notes, Shapes | 200MB | ~7s/MB |
| HTML | BeautifulSoup + lxml | DOM parsing, CSS extraction | 50MB | ~3s/MB |
| Markdown | CommonMark + mistune | GFM support, Tables, Code | 10MB | ~1s/MB |
| Plain Text | Native UTF-8 decoder | Encoding detection | 100MB | <1s/MB |
| RTF | python-rtf | Formatted text, Images | 50MB | ~4s/MB |
| CSV/TSV | pandas + csv module | Tabular data, Headers | 1GB | ~2s/MB |
| JSON | ujson + jsonschema | Nested structures, Validation | 100MB | ~1s/MB |
| XML | lxml + xmlschema | XPath, XSLT, Validation | 100MB | ~3s/MB |
Storage Mathematics: The Hidden Reality of Vector Databases
Important Note: Unlike traditional databases where 1TB of documents remains roughly 1TB in storage, vector databases require significantly more space due to embedding generation, indexing, and metadata. This section reveals the true mathematics behind LLM storage requirements that big tech companies rarely discuss openly.
The Storage Multiplication Factor
Original Documents: 1.0 TB Vector DB Total: 3.5 TB Multiplication Factor: 3.5×
With redundancy/backup: Production Total: 7.0 TB (2× replica)
Reality: You need 3.5-7× your document storage
Storage Calculation Formula
The actual storage requirement for a vector database can be calculated using:
Total Storage = D × (1 + E + I + M + C)
Where:
- D = Original document size
- E = Embedding storage factor (typically 1.2-1.5×)
- I = Index overhead factor (typically 0.6-0.8×)
- M = Metadata factor (typically 0.4-0.5×)
- C = Cache/auxiliary factor (typically 0.3-0.5×)
Real-World Storage Examples for Self-Hosted Infrastructure
| Your Document Storage | Vector DB Required | With Redundancy (2×) | Recommended Local Storage |
|---|---|---|---|
| 100 GB | 350 GB | 700 GB | 1 TB NVMe SSD |
| 500 GB | 1.75 TB | 3.5 TB | 4 TB NVMe SSD |
| 1 TB | 3.5 TB | 7 TB | 8 TB NVMe SSD |
| 5 TB | 17.5 TB | 35 TB | 40 TB SSD Array |
| 10 TB | 35 TB | 70 TB | 80 TB SSD Array |
| 50 TB | 175 TB | 350 TB | 400 TB Storage Server |
Note: Self-hosting your vector database gives you complete control over your data, eliminates recurring cloud costs, and ensures data sovereignty. Initial hardware investment pays for itself typically within 6-12 months compared to cloud alternatives.
Detailed Storage Breakdown by Component
0 GB 250 GB 500 GB 750 GB 1000 GB
Original 1000 GB Extracted 800 GB Embeddings 1200 GB Index 600 GB Metadata 400 GB Cache 500 GB 3.5 TB TotalWhy This Matters: Planning Your Infrastructure
Critical Insights:
- The 3.5× Rule: For every 1TB of documents, plan for at least 3.5TB of vector database storage
- Memory Requirements: Vector operations require significant RAM (typically 10-15% of index size must fit in memory)
- Backup Strategy: Production systems need 2-3× redundancy, effectively making it 7-10.5× original size
- Growth Planning: Vector databases don't compress well - plan storage linearly with document growth
Self-Hosted Infrastructure Example (1TB Document Collection):
| Component | Requirement | Recommended Hardware | One-Time Investment |
|---|---|---|---|
| Document Storage | 1 TB | 2 TB NVMe SSD | Quality drive for source docs |
| Vector Database | 3.5 TB | 4 TB NVMe SSD | High-performance for vectors |
| RAM Requirements | 256 GB | 256 GB DDR4/DDR5 | For index operations |
| Backup Storage | 3.5 TB | 4 TB SATA SSD | Local backup drive |
| Network | 10 Gbps | 10GbE NIC | Fast local network |
| Total Storage | 8 TB | 10 TB usable | Future-proof capacity |
Advantages of Self-Hosting:
- No recurring costs after initial hardware investment
- Complete data privacy - your data never leaves your infrastructure
- Full control over performance tuning and optimization
- No vendor lock-in or surprise price increases
- Faster local access without internet latency
- Compliance-ready for regulations requiring on-premise data
The actual infrastructure needs are 7-10× larger than the original document size when accounting for all components and redundancy, but owning your hardware means predictable costs and total control.
Phase 2: Text Preprocessing and Cleaning
The preprocessing pipeline ensures consistent, high-quality text for embedding through multiple stages:
-
Encoding Normalization
- Unicode normalization (NFD/NFC)
- Encoding error correction
- Character set standardization
-
Whitespace and Formatting
- Whitespace normalization
- Control character removal
- Line break standardization
-
Content Cleaning
- Boilerplate removal
- Header/footer cleaning
- Watermark detection and removal
-
Language-Specific Processing
- Language detection
- Language-specific rules application
- Script normalization
-
Semantic Preservation
- Named entity preservation
- Acronym handling
- Numeric value preservation
Phase 3: Intelligent Chunking Strategy
The chunking engine implements context-aware segmentation with semantic boundary detection:
Boundary Detection Types:
- Paragraph Boundaries: Natural text breaks with highest priority
- Sentence Boundaries: Linguistic sentence detection
- Section Headers: Document structure preservation
- List Items: Maintaining list coherence
- Code Blocks: Preserving code integrity
Phase 4: Embedding Generation
The system generates dense vector representations using transformer models with optimized batching:
Key Features:
- Batch Processing: Efficient processing of multiple chunks
- Mean Pooling: Token embedding aggregation
- Normalization: L2 normalization for cosine similarity
- Memory Management: Optimized GPU/CPU utilization
- Dynamic Batching: Adaptive batch sizes based on available memory
Embedding Model Comparison
| Model | Dimensions | Size | Speed | Quality | Memory |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 80MB | 14,200 sent/sec | 0.631 | 290MB |
| all-mpnet-base-v2 | 768 | 420MB | 2,800 sent/sec | 0.634 | 1.2GB |
| multi-qa-MiniLM-L6 | 384 | 80MB | 14,200 sent/sec | 0.618 | 290MB |
| paraphrase-multilingual | 768 | 1.1GB | 2,300 sent/sec | 0.628 | 2.1GB |
| e5-base-v2 | 768 | 440MB | 2,700 sent/sec | 0.642 | 1.3GB |
| bge-base-en | 768 | 440MB | 2,600 sent/sec | 0.644 | 1.3GB |
Phase 5: Vector Index Construction
The system builds high-performance vector indices using HNSW (Hierarchical Navigable Small World) algorithm:
HNSW Configuration:
- M Parameter: 16 bi-directional links per node
- ef_construction: 200 for build-time accuracy
- ef_search: 100 for query-time accuracy
- Metric: Cosine similarity for semantic matching
- Threading: Multi-threaded construction support
Index Performance Characteristics
| Documents | Index Size | Build Time | Query Latency | Recall@10 |
|---|---|---|---|---|
| 10K | 15MB | 30s | 5ms | 0.99 |
| 100K | 150MB | 5min | 15ms | 0.98 |
| 1M | 1.5GB | 50min | 35ms | 0.97 |
| 10M | 15GB | 8hr | 75ms | 0.95 |
Retrieval System Architecture
Semantic Search Implementation
The retrieval engine implements multi-stage retrieval with re-ranking:
Retrieval Pipeline:
- Query Processing: Expansion and understanding
- Vector Search: HNSW approximate nearest neighbor
- Hybrid Search: Combining dense and sparse retrieval
- Re-ranking: Cross-encoder scoring
- Result Enhancement: Metadata enrichment
Query Processing and Expansion
Sophisticated query understanding includes:
- Language Detection: Multi-lingual query support
- Intent Recognition: Understanding search intent
- Query Expansion: Synonyms and related terms
- Entity Extraction: Named entity recognition
- Spell Correction: Typo and error correction
Hybrid Search and Fusion
Combining dense and sparse retrieval methods:
Dense Retrieval:
- Vector similarity search
- Semantic matching
- Concept-based retrieval
Sparse Retrieval:
- BM25 scoring
- Keyword matching
- Exact phrase search
Fusion Strategies:
- Reciprocal Rank Fusion (RRF)
- Linear combination
- Learning-to-rank models
Re-ranking with Cross-Encoders
Advanced re-ranking for improved precision:
- Cross-attention scoring: Query-document interaction
- Contextual relevance: Fine-grained matching
- Diversity optimization: Result set diversification
Context Management and Compaction
Context Window Optimization
<!-- Technique 1 -->
<g transform="translate(0, 30)">
<rect x="0" y="0" width="180" height="140" fill="none" stroke="currentColor" stroke-width="1" stroke-dasharray="3,2" opacity="0.8"/>
<text x="90" y="20" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" font-weight="bold" fill="currentColor">Semantic Deduplication</text>
<text x="10" y="40" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Remove redundant info</text>
<text x="10" y="55" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Merge similar chunks</text>
<text x="10" y="70" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Keep unique facts</text>
<text x="10" y="90" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.7">Reduction: 30-40%</text>
<!-- Visual representation -->
<circle cx="90" cy="115" r="15" fill="none" stroke="currentColor" stroke-width="1" opacity="0.5"/>
<circle cx="75" cy="115" r="15" fill="none" stroke="currentColor" stroke-width="1" opacity="0.5"/>
<circle cx="105" cy="115" r="15" fill="none" stroke="currentColor" stroke-width="1" opacity="0.5"/>
</g>
<!-- Technique 2 -->
<g transform="translate(210, 30)">
<rect x="0" y="0" width="180" height="140" fill="none" stroke="currentColor" stroke-width="1" stroke-dasharray="3,2" opacity="0.8"/>
<text x="90" y="20" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" font-weight="bold" fill="currentColor">Relevance Scoring</text>
<text x="10" y="40" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Score by query match</text>
<text x="10" y="55" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Keep top-k relevant</text>
<text x="10" y="70" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Drop low scores</text>
<text x="10" y="90" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.7">Reduction: 40-50%</text>
<!-- Visual bars -->
<rect x="50" y="105" width="80" height="5" fill="none" stroke="currentColor" stroke-width="1"/>
<rect x="50" y="115" width="60" height="5" fill="none" stroke="currentColor" stroke-width="1"/>
<rect x="50" y="125" width="40" height="5" fill="none" stroke="currentColor" stroke-width="1"/>
</g>
<!-- Technique 3 -->
<g transform="translate(420, 30)">
<rect x="0" y="0" width="180" height="140" fill="none" stroke="currentColor" stroke-width="1" stroke-dasharray="3,2" opacity="0.8"/>
<text x="90" y="20" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" font-weight="bold" fill="currentColor">Hierarchical Summary</text>
<text x="10" y="40" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Extract key points</text>
<text x="10" y="55" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Create abstracts</text>
<text x="10" y="70" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Preserve details</text>
<text x="10" y="90" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.7">Reduction: 50-60%</text>
<!-- Tree structure -->
<circle cx="90" cy="105" r="3" fill="currentColor"/>
<circle cx="70" cy="120" r="3" fill="currentColor"/>
<circle cx="90" cy="120" r="3" fill="currentColor"/>
<circle cx="110" cy="120" r="3" fill="currentColor"/>
<line x1="90" y1="108" x2="70" y2="117" stroke="currentColor" stroke-width="1" opacity="0.5"/>
<line x1="90" y1="108" x2="90" y2="117" stroke="currentColor" stroke-width="1" opacity="0.5"/>
<line x1="90" y1="108" x2="110" y2="117" stroke="currentColor" stroke-width="1" opacity="0.5"/>
</g>
<!-- Technique 4 -->
<g transform="translate(630, 30)">
<rect x="0" y="0" width="170" height="140" fill="none" stroke="currentColor" stroke-width="1" stroke-dasharray="3,2" opacity="0.8"/>
<text x="85" y="20" text-anchor="middle" font-family="Arial, sans-serif" font-size="12" font-weight="bold" fill="currentColor">Token Optimization</text>
<text x="10" y="40" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Remove stopwords</text>
<text x="10" y="55" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Compress phrases</text>
<text x="10" y="70" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.8">• Use abbreviations</text>
<text x="10" y="90" font-family="Arial, sans-serif" font-size="10" fill="currentColor" opacity="0.7">Reduction: 20-30%</text>
<!-- Text compression visual -->
<text x="85" y="115" text-anchor="middle" font-family="monospace" font-size="8" fill="currentColor" opacity="0.7">ABCD → AB</text>
<text x="85" y="125" text-anchor="middle" font-family="monospace" font-size="8" fill="currentColor" opacity="0.7">EFGH → EF</text>
</g>
Compression Level 4 achieves 60-75% reduction while maintaining 95%+ information retention
Compression Strategies with prompt-compact=4:
- Semantic Deduplication: Removes redundant information across chunks
- Relevance Scoring: Prioritizes chunks by query relevance (threshold: 0.95)
- Hierarchical Summarization: Creates multi-level abstracts
- Token Optimization: Reduces token count while preserving meaning
Dynamic Context Strategies
| Strategy | Use Case | Context Efficiency | Quality |
|---|---|---|---|
| Top-K Selection | General queries | High | Good |
| Diversity Sampling | Broad topics | Medium | Better |
| Hierarchical | Long documents | Very High | Best |
| Temporal | Time-sensitive | Medium | Good |
| Entity-centric | Fact-finding | High | Excellent |
Performance Optimization
Caching Architecture
Multi-level caching for improved performance:
Cache Levels:
- Query Cache: Recent query results
- Embedding Cache: Frequently accessed vectors
- Document Cache: Popular documents
- Index Cache: Hot index segments
Cache Configuration:
| Cache Type | Size | TTL | Hit Rate | Latency Reduction |
|---|---|---|---|---|
| Query | 10K entries | 1hr | 35% | 95% |
| Embedding | 100K vectors | 24hr | 60% | 80% |
| Document | 1K docs | 6hr | 45% | 70% |
| Index | 10GB | Static | 80% | 60% |
Index Optimization Techniques
Index Sharding:
- Description: Distribute index across multiple shards
- When to use: Large-scale deployments (>10M documents)
- Configuration:
- shard_count: 8-32 shards
- shard_strategy: hash-based or range-based
- replication_factor: 2-3 for availability
Quantization:
- Description: Reduce vector precision for space/speed
- When to use: Memory-constrained environments
- Configuration:
- type: Product Quantization (PQ)
- subvectors: 48-96
- bits: 8 per subvector
- training_samples: 100K vectors
Hierarchical Index:
- Description: Multi-level index structure
- When to use: Ultra-large collections (>100M)
- Configuration:
- levels: 2-3 hierarchy levels
- fanout: 100-1000 per level
- rerank_top_k: 100-500 candidates
GPU Acceleration:
- Description: CUDA-accelerated operations
- When to use: High-throughput requirements
- Configuration:
- device: CUDA-capable GPU
- batch_size: 256-1024
- precision: FP16 for speed, FP32 for accuracy
Integration with LLM Systems
Retrieval-Augmented Generation (RAG)
The knowledge base seamlessly integrates with language models:
RAG Pipeline:
- Query Understanding: LLM-based query analysis
- Document Retrieval: Semantic search execution
- Context Assembly: Relevant passage selection
- Prompt Construction: Context injection
- Response Generation: LLM completion
- Citation Tracking: Source attribution
RAG Configuration (from config.csv):
| Parameter | Value | Purpose |
|---|---|---|
| prompt-compact | 4 | Context compaction level |
| llm-ctx-size | 4096 | LLM context window size |
| llm-n-predict | 1024 | Maximum tokens to generate |
| embedding-model | bge-small-en-v1.5 | Model for semantic embeddings |
| llm-cache | false | Response caching disabled |
| llm-cache-semantic | true | Semantic cache matching enabled |
| llm-cache-threshold | 0.95 | Semantic similarity threshold for cache |
Note: The actual system uses prompt compaction level 4 for efficient context management, with a 4096 token context window and generates up to 1024 tokens per response.
Monitoring and Analytics
Knowledge Base Metrics
Real-time monitoring dashboard tracks:
Collection Statistics:
- Total documents indexed
- Total chunks generated
- Total embeddings created
- Index size (MB/GB)
- Storage utilization
Performance Metrics:
- Indexing rate (docs/sec)
- Query latency (p50, p95, p99)
- Embedding generation latency
- Cache hit rates
- Throughput (queries/sec)
Quality Metrics:
- Mean relevance scores
- Recall@K measurements
- Precision metrics
- User feedback scores
- Query success rates
Health Monitoring
| Metric | Threshold | Alert Level | Action |
|---|---|---|---|
| Query Latency p99 | >100ms | Warning | Scale replicas |
| Cache Hit Rate | <30% | Info | Warm cache |
| Index Fragmentation | >40% | Warning | Rebuild index |
| Memory Usage | >85% | Critical | Add resources |
| Error Rate | >1% | Critical | Investigate logs |
Best Practices and Guidelines
Document Preparation
- Ensure documents are properly formatted
- Remove unnecessary headers/footers before ingestion
- Validate encoding and character sets
- Structure documents with clear sections
Index Maintenance
- Regular index optimization (weekly)
- Periodic full reindexing (monthly)
- Monitor fragmentation levels
- Implement gradual rollout for updates
Query Optimization
- Use specific, contextual queries
- Leverage query expansion for broad searches
- Implement query caching for common patterns
- Monitor and analyze query logs
System Scaling
- Horizontal scaling with index sharding
- Read replicas for high availability
- Load balancing across instances
- Implement circuit breakers for resilience