Changed incorrect references to .vbs files to .bas and corrected USE_WEBSITE keyword naming. Also added missing fields to API response structure and clarified that start.bas is optional for bots.
3.8 KiB
3.8 KiB
Document Indexing
Document indexing in BotServer is automatic. When documents are added to .gbkb folders, they are processed and made searchable without any manual configuration.
Automatic Indexing
The system automatically indexes documents when:
- Files are added to any
.gbkbfolder USE KBis called for a collection- Files are modified or updated
USE WEBSITEregisters websites for crawling (preprocessing) and associates them with sessions (runtime)
How Indexing Works
- Document Detection - System scans
.gbkbfolders for files - Text Extraction - Content extracted from PDF, DOCX, HTML, MD, TXT
- Chunking - Text split into manageable segments
- Embedding Generation - Chunks converted to vectors using BGE model
- Storage - Vectors stored for semantic search
Supported File Types
- PDF - Full text extraction
- DOCX - Microsoft Word documents
- TXT - Plain text files
- HTML - Web pages (text only)
- MD - Markdown documents
- CSV - Structured data
Website Indexing
To keep web content fresh, schedule regular crawls:
' In update-docs.bas
SET SCHEDULE "0 2 * * *" ' Run daily at 2 AM
USE WEBSITE "https://docs.example.com"
' Website is registered for crawling during preprocessing
' At runtime, it associates the crawled content with the session
Scheduling Options
SET SCHEDULE "0 * * * *" ' Every hour
SET SCHEDULE "*/30 * * * *" ' Every 30 minutes
SET SCHEDULE "0 0 * * 0" ' Weekly on Sunday
SET SCHEDULE "0 0 1 * *" ' Monthly on the 1st
Real-Time Updates
Documents are re-indexed automatically when:
- File content changes
- New files appear in folders
- Files are deleted (removed from index)
Using Indexed Content
Once indexed, content is automatically available:
USE KB "documentation"
' All documents in the documentation folder are now searchable
' The LLM will use this knowledge when answering questions
You don't need to explicitly search - the system does it automatically when generating responses.
Configuration
Indexing uses settings from config.csv:
embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf
The BGE embedding model can be replaced with any compatible model.
Performance Optimization
The system optimizes indexing by:
- Processing only changed files
- Caching embeddings
- Parallel processing when possible
- Incremental updates
Example: Knowledge Base Maintenance
Structure your knowledge base:
company.gbkb/
├── products/
│ ├── manual-v1.pdf
│ └── specs.docx
├── policies/
│ ├── hr-policy.pdf
│ └── it-policy.md
└── news/
└── updates.html
Schedule regular web updates:
' In maintenance.bas
SET SCHEDULE "0 1 * * *"
' Register websites for crawling
USE WEBSITE "https://company.com/news"
USE WEBSITE "https://company.com/products"
' Websites are crawled by background service
Best Practices
- Organize documents by topic in separate folders
- Schedule updates for web content
- Keep files updated - system handles re-indexing
- Monitor folder sizes - very large collections may impact performance
- Use clear naming - helps with organization
Troubleshooting
Documents Not Appearing
- Check file is in a
.gbkbfolder - Verify file type is supported
- Ensure
USE KBwas called for that collection
Slow Indexing
- Large PDFs may take time to process
- Consider splitting very large documents
- Check available system resources
Outdated Content
- Set up scheduled crawls for web content
- Ensure files are being updated
- Check that re-indexing is triggered
Remember: Indexing is automatic - just add documents to folders and use USE KB to activate them!