botserver/docs/src/chapter-03/indexing.md
Rodrigo Rodriguez (Pragmatismo) f40cb6c7b4 Fix typos in bot file extensions and keyword names
Changed incorrect references to .vbs files to .bas and corrected
USE_WEBSITE keyword naming. Also added missing fields to API response
structure and clarified that start.bas is optional for bots.
2025-11-26 22:54:22 -03:00

3.8 KiB

Document Indexing

Document indexing in BotServer is automatic. When documents are added to .gbkb folders, they are processed and made searchable without any manual configuration.

Automatic Indexing

The system automatically indexes documents when:

  • Files are added to any .gbkb folder
  • USE KB is called for a collection
  • Files are modified or updated
  • USE WEBSITE registers websites for crawling (preprocessing) and associates them with sessions (runtime)

How Indexing Works

  1. Document Detection - System scans .gbkb folders for files
  2. Text Extraction - Content extracted from PDF, DOCX, HTML, MD, TXT
  3. Chunking - Text split into manageable segments
  4. Embedding Generation - Chunks converted to vectors using BGE model
  5. Storage - Vectors stored for semantic search

Supported File Types

  • PDF - Full text extraction
  • DOCX - Microsoft Word documents
  • TXT - Plain text files
  • HTML - Web pages (text only)
  • MD - Markdown documents
  • CSV - Structured data

Website Indexing

To keep web content fresh, schedule regular crawls:

' In update-docs.bas
SET SCHEDULE "0 2 * * *"  ' Run daily at 2 AM

USE WEBSITE "https://docs.example.com"
' Website is registered for crawling during preprocessing
' At runtime, it associates the crawled content with the session

Scheduling Options

SET SCHEDULE "0 * * * *"     ' Every hour
SET SCHEDULE "*/30 * * * *"  ' Every 30 minutes
SET SCHEDULE "0 0 * * 0"     ' Weekly on Sunday
SET SCHEDULE "0 0 1 * *"     ' Monthly on the 1st

Real-Time Updates

Documents are re-indexed automatically when:

  • File content changes
  • New files appear in folders
  • Files are deleted (removed from index)

Using Indexed Content

Once indexed, content is automatically available:

USE KB "documentation"
' All documents in the documentation folder are now searchable
' The LLM will use this knowledge when answering questions

You don't need to explicitly search - the system does it automatically when generating responses.

Configuration

Indexing uses settings from config.csv:

embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

The BGE embedding model can be replaced with any compatible model.

Performance Optimization

The system optimizes indexing by:

  • Processing only changed files
  • Caching embeddings
  • Parallel processing when possible
  • Incremental updates

Example: Knowledge Base Maintenance

Structure your knowledge base:

company.gbkb/
├── products/
│   ├── manual-v1.pdf
│   └── specs.docx
├── policies/
│   ├── hr-policy.pdf
│   └── it-policy.md
└── news/
    └── updates.html

Schedule regular web updates:

' In maintenance.bas
SET SCHEDULE "0 1 * * *"

' Register websites for crawling
USE WEBSITE "https://company.com/news"
USE WEBSITE "https://company.com/products"
' Websites are crawled by background service

Best Practices

  1. Organize documents by topic in separate folders
  2. Schedule updates for web content
  3. Keep files updated - system handles re-indexing
  4. Monitor folder sizes - very large collections may impact performance
  5. Use clear naming - helps with organization

Troubleshooting

Documents Not Appearing

  • Check file is in a .gbkb folder
  • Verify file type is supported
  • Ensure USE KB was called for that collection

Slow Indexing

  • Large PDFs may take time to process
  • Consider splitting very large documents
  • Check available system resources

Outdated Content

  • Set up scheduled crawls for web content
  • Ensure files are being updated
  • Check that re-indexing is triggered

Remember: Indexing is automatic - just add documents to folders and use USE KB to activate them!