Rodrigo Rodriguez (Pragmatismo) f40cb6c7b4 Fix typos in bot file extensions and keyword names

Changed incorrect references to .vbs files to .bas and corrected
USE_WEBSITE keyword naming. Also added missing fields to API response
structure and clarified that start.bas is optional for bots.

2025-11-26 22:54:22 -03:00

3.8 KiB

Raw Blame History

Document Indexing

Document indexing in BotServer is automatic. When documents are added to .gbkb folders, they are processed and made searchable without any manual configuration.

Automatic Indexing

The system automatically indexes documents when:

Files are added to any .gbkb folder
USE KB is called for a collection
Files are modified or updated
USE WEBSITE registers websites for crawling (preprocessing) and associates them with sessions (runtime)

How Indexing Works

Document Detection - System scans .gbkb folders for files
Text Extraction - Content extracted from PDF, DOCX, HTML, MD, TXT
Chunking - Text split into manageable segments
Embedding Generation - Chunks converted to vectors using BGE model
Storage - Vectors stored for semantic search

Supported File Types

PDF - Full text extraction
DOCX - Microsoft Word documents
TXT - Plain text files
HTML - Web pages (text only)
MD - Markdown documents
CSV - Structured data

Website Indexing

To keep web content fresh, schedule regular crawls:

' In update-docs.bas
SET SCHEDULE "0 2 * * *"  ' Run daily at 2 AM

USE WEBSITE "https://docs.example.com"
' Website is registered for crawling during preprocessing
' At runtime, it associates the crawled content with the session

Scheduling Options

SET SCHEDULE "0 * * * *"     ' Every hour
SET SCHEDULE "*/30 * * * *"  ' Every 30 minutes
SET SCHEDULE "0 0 * * 0"     ' Weekly on Sunday
SET SCHEDULE "0 0 1 * *"     ' Monthly on the 1st

Real-Time Updates

Documents are re-indexed automatically when:

File content changes
New files appear in folders
Files are deleted (removed from index)

Using Indexed Content

Once indexed, content is automatically available:

USE KB "documentation"
' All documents in the documentation folder are now searchable
' The LLM will use this knowledge when answering questions

You don't need to explicitly search - the system does it automatically when generating responses.

Configuration

Indexing uses settings from config.csv:

embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

The BGE embedding model can be replaced with any compatible model.

Performance Optimization

The system optimizes indexing by:

Processing only changed files
Caching embeddings
Parallel processing when possible
Incremental updates

Example: Knowledge Base Maintenance

Structure your knowledge base:

company.gbkb/
├── products/
│   ├── manual-v1.pdf
│   └── specs.docx
├── policies/
│   ├── hr-policy.pdf
│   └── it-policy.md
└── news/
    └── updates.html

Schedule regular web updates:

' In maintenance.bas
SET SCHEDULE "0 1 * * *"

' Register websites for crawling
USE WEBSITE "https://company.com/news"
USE WEBSITE "https://company.com/products"
' Websites are crawled by background service

Best Practices

Organize documents by topic in separate folders
Schedule updates for web content
Keep files updated - system handles re-indexing
Monitor folder sizes - very large collections may impact performance
Use clear naming - helps with organization

Troubleshooting

Documents Not Appearing

Check file is in a .gbkb folder
Verify file type is supported
Ensure USE KB was called for that collection

Slow Indexing

Large PDFs may take time to process
Consider splitting very large documents
Check available system resources

Outdated Content

Set up scheduled crawls for web content
Ensure files are being updated
Check that re-indexing is triggered

Remember: Indexing is automatic - just add documents to folders and use USE KB to activate them!

3.8 KiB Raw Blame History