botserver/docs/KB_AND_TOOLS.md

542 lines
16 KiB
Markdown
Raw Normal View History

2025-10-18 18:20:02 -03:00
# Knowledge Base (KB) and Tools System
## Overview
This document describes the comprehensive Knowledge Base (KB) and BASIC Tools compilation system integrated into the botserver. This system enables:
1. **Dynamic Knowledge Base Management**: Monitor MinIO buckets for document changes and automatically index them in Qdrant vector database
2. **BASIC Tool Compilation**: Compile BASIC scripts into AST and generate MCP/OpenAI tool definitions
3. **Intelligent Context Processing**: Enhance prompts with relevant KB documents and available tools based on answer mode
4. **Temporary Website Indexing**: Crawl and index web pages for session-specific knowledge
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Bot Server │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ KB Manager │ │ MinIO │ │ Qdrant │ │
│ │ │◄──►│ Handler │◄──►│ Client │ │
│ │ │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ ▲ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ BASIC │ │ Embeddings │ │
│ │ Compiler │ │ Generator │ │
│ │ │ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Prompt Processor │ │
│ │ (Integrates KB + Tools based on Answer Mode) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Components
### 1. KB Manager (`src/kb/mod.rs`)
The KB Manager coordinates MinIO monitoring and Qdrant indexing:
- **Watches collections**: Monitors `.gbkb/` folders for document changes
- **Detects changes**: Uses file hashing (SHA256) to detect modified files
- **Indexes documents**: Splits documents into chunks and generates embeddings
- **Stores metadata**: Maintains document information in PostgreSQL
#### Key Functions
```rust
// Add a KB collection to be monitored
kb_manager.add_collection(bot_id, "enrollpdfs").await?;
// Remove a collection
kb_manager.remove_collection("enrollpdfs").await?;
// Start the monitoring service
let kb_handle = kb_manager.spawn();
```
### 2. MinIO Handler (`src/kb/minio_handler.rs`)
Monitors MinIO buckets for file changes:
- **Polling**: Checks for changes every 15 seconds
- **Event detection**: Identifies created, modified, and deleted files
- **State tracking**: Maintains file ETags and sizes for change detection
#### File Change Events
```rust
pub enum FileChangeEvent {
Created { path: String, size: i64, etag: String },
Modified { path: String, size: i64, etag: String },
Deleted { path: String },
}
```
### 3. Qdrant Client (`src/kb/qdrant_client.rs`)
Manages vector database operations:
- **Collection management**: Create, delete, and check collections
- **Point operations**: Upsert and delete vector points
- **Search**: Semantic search using cosine similarity
#### Example Usage
```rust
let client = get_qdrant_client(&state)?;
// Create collection
client.create_collection("kb_bot123_enrollpdfs", 1536).await?;
// Search
let results = client.search("kb_bot123_enrollpdfs", query_vector, 5).await?;
```
### 4. Embeddings Generator (`src/kb/embeddings.rs`)
Handles text embedding and document indexing:
- **Chunking**: Splits documents into 512-character chunks with 50-char overlap
- **Embedding**: Generates vectors using local LLM server
- **Indexing**: Stores chunks with metadata in Qdrant
#### Document Processing
```rust
// Index a document
index_document(&state, "kb_bot_collection", "file.pdf", &content).await?;
// Search for similar documents
let results = search_similar(&state, "kb_bot_collection", "query", 5).await?;
```
### 5. BASIC Compiler (`src/basic/compiler/mod.rs`)
Compiles BASIC scripts and generates tool definitions:
#### Input: BASIC Script with Metadata
```basic
PARAM name AS string LIKE "Abreu Silva" DESCRIPTION "Required full name"
PARAM birthday AS date LIKE "23/09/2001" DESCRIPTION "Birth date in DD/MM/YYYY"
PARAM email AS string LIKE "user@example.com" DESCRIPTION "Email address"
DESCRIPTION "Enrollment process for new users"
// Script logic here
SAVE "enrollments.csv", id, name, birthday, email
TALK "Thanks, you are enrolled!"
SET_KB "enrollpdfs"
```
#### Output: Multiple Files
1. **enrollment.ast**: Compiled Rhai AST
2. **enrollment.mcp.json**: MCP tool definition
3. **enrollment.tool.json**: OpenAI tool definition
#### MCP Tool Format
```json
{
"name": "enrollment",
"description": "Enrollment process for new users",
"input_schema": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Required full name",
"example": "Abreu Silva"
},
"birthday": {
"type": "string",
"description": "Birth date in DD/MM/YYYY",
"example": "23/09/2001"
}
},
"required": ["name", "birthday", "email"]
}
}
```
### 6. Prompt Processor (`src/context/prompt_processor.rs`)
Enhances queries with context based on answer mode:
#### Answer Modes
| Mode | Value | Description |
|------|-------|-------------|
| Direct | 0 | No additional context, direct LLM response |
| WithTools | 1 | Include available tools in prompt |
| DocumentsOnly | 2 | Search KB only, no LLM generation |
| WebSearch | 3 | Include web search results |
| Mixed | 4 | Combine KB documents + tools (context-aware) |
#### Mixed Mode Flow
```
User Query
┌─────────────────────────┐
│ Prompt Processor │
│ (Answer Mode: Mixed) │
└─────────────────────────┘
├──► Search KB Documents (Qdrant)
│ └─► Returns relevant chunks
├──► Get Available Tools (Session Context)
│ └─► Returns tool definitions
┌─────────────────────────┐
│ Enhanced Prompt │
│ • System Prompt │
│ • Document Context │
│ • Available Tools │
│ • User Query │
└─────────────────────────┘
```
## BASIC Keywords
### SET_KB
Activates a KB collection for the current session.
```basic
SET_KB "enrollpdfs"
```
- Creates/ensures Qdrant collection exists
- Updates session context with active collection
- Documents in `.gbkb/enrollpdfs/` are indexed
### ADD_KB
Adds an additional KB collection (can have multiple).
```basic
ADD_KB "productbrochurespdfsanddocs"
```
### ADD_TOOL
Compiles and registers a BASIC tool.
```basic
ADD_TOOL "enrollment.bas"
```
Downloads from MinIO (`.gbdialog/enrollment.bas`), compiles to:
- `./work/{bot_id}.gbai/{bot_id}.gbdialog/enrollment.ast`
- `./work/{bot_id}.gbai/{bot_id}.gbdialog/enrollment.mcp.json`
- `./work/{bot_id}.gbai/{bot_id}.gbdialog/enrollment.tool.json`
#### With MCP Endpoint
```basic
ADD_TOOL "enrollment.bas" as MCP
```
Creates an HTTP endpoint at `/default/enrollment` that:
- Accepts JSON matching the tool schema
- Executes the BASIC script
- Returns the result
### ADD_WEBSITE
Crawls and indexes a website for the current session.
```basic
ADD_WEBSITE "https://example.com/docs"
```
- Fetches HTML content
- Extracts readable text (removes scripts, styles)
- Creates temporary Qdrant collection
- Indexes content with embeddings
- Available for remainder of session
## Database Schema
### kb_documents
Stores metadata about indexed documents:
```sql
CREATE TABLE kb_documents (
id UUID PRIMARY KEY,
bot_id UUID NOT NULL,
collection_name TEXT NOT NULL,
file_path TEXT NOT NULL,
file_size BIGINT NOT NULL,
file_hash TEXT NOT NULL,
first_published_at TIMESTAMPTZ NOT NULL,
last_modified_at TIMESTAMPTZ NOT NULL,
indexed_at TIMESTAMPTZ,
metadata JSONB DEFAULT '{}',
UNIQUE(bot_id, collection_name, file_path)
);
```
### kb_collections
Stores KB collection information:
```sql
CREATE TABLE kb_collections (
id UUID PRIMARY KEY,
bot_id UUID NOT NULL,
name TEXT NOT NULL,
folder_path TEXT NOT NULL,
qdrant_collection TEXT NOT NULL,
document_count INTEGER NOT NULL DEFAULT 0,
UNIQUE(bot_id, name)
);
```
### basic_tools
Stores compiled BASIC tools:
```sql
CREATE TABLE basic_tools (
id UUID PRIMARY KEY,
bot_id UUID NOT NULL,
tool_name TEXT NOT NULL,
file_path TEXT NOT NULL,
ast_path TEXT NOT NULL,
mcp_json JSONB,
tool_json JSONB,
compiled_at TIMESTAMPTZ NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT true,
UNIQUE(bot_id, tool_name)
);
```
## Workflow Examples
### Example 1: Enrollment with KB
**File Structure:**
```
bot.gbai/
├── .gbkb/
│ └── enrollpdfs/
│ ├── enrollment_guide.pdf
│ ├── requirements.pdf
│ └── faq.pdf
├── .gbdialog/
│ ├── start.bas
│ └── enrollment.bas
```
**start.bas:**
```basic
ADD_TOOL "enrollment.bas" as MCP
ADD_KB "enrollpdfs"
```
**enrollment.bas:**
```basic
PARAM name AS string LIKE "John Doe" DESCRIPTION "Full name"
PARAM email AS string LIKE "john@example.com" DESCRIPTION "Email"
DESCRIPTION "Enrollment process with KB support"
// Validate input
IF name = "" THEN
TALK "Please provide your name"
EXIT
END IF
// Save to database
SAVE "enrollments.csv", name, email
// Set KB for enrollment docs
SET_KB "enrollpdfs"
TALK "Thanks! You can now ask me about enrollment procedures."
```
**User Interaction:**
1. User: "I want to enroll"
2. Bot calls `enrollment` tool, collects parameters
3. After enrollment, SET_KB activates `enrollpdfs` collection
4. User: "What documents do I need?"
5. Bot searches KB (mode=2 or 4), finds relevant PDFs, responds with info
### Example 2: Product Support with Web Content
**support.bas:**
```basic
PARAM product AS string LIKE "fax" DESCRIPTION "Product name"
DESCRIPTION "Get product information"
// Find in database
price = -1
productRecord = FIND "products.csv", "name = ${product}"
IF productRecord THEN
price = productRecord.price
END IF
// Add product documentation website
ADD_WEBSITE "https://example.com/products/${product}"
// Add product brochures KB
SET_KB "productbrochurespdfsanddocs"
RETURN price
```
**User Flow:**
1. User: "What's the price of a fax machine?"
2. Tool executes, finds price in CSV
3. ADD_WEBSITE indexes product page
4. SET_KB activates brochures collection
5. User: "How do I set it up?"
6. Prompt processor (Mixed mode) searches both:
- Temporary website collection
- Product brochures KB
7. Returns setup instructions from indexed sources
## Configuration
### Environment Variables
```bash
# Qdrant Configuration
QDRANT_URL=http://localhost:6333
# LLM for Embeddings
LLM_URL=http://localhost:8081
# MinIO Configuration (from config)
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
MINIO_ORG_PREFIX=org1_
# Database
DATABASE_URL=postgresql://user:pass@localhost/botserver
```
### Answer Mode Selection
Set in session's `answer_mode` field:
```rust
// Example: Update session to Mixed mode
session.answer_mode = 4;
```
Or via API when creating session:
```json
POST /sessions
{
"user_id": "...",
"bot_id": "...",
"answer_mode": 4
}
```
## Security Considerations
1. **Path Traversal Protection**: All file paths validated to prevent `..` attacks
2. **Safe Tool Paths**: Tools must be in `.gbdialog/` folder
3. **URL Validation**: ADD_WEBSITE only allows HTTP/HTTPS URLs
4. **Bucket Isolation**: Each organization has separate MinIO bucket
5. **Hash Verification**: File changes detected by SHA256 hash
## Performance Tuning
### KB Manager
- **Poll Interval**: 30 seconds (adjustable in `kb/mod.rs`)
- **Chunk Size**: 512 characters (in `kb/embeddings.rs`)
- **Chunk Overlap**: 50 characters
### MinIO Handler
- **Poll Interval**: 15 seconds (adjustable in `kb/minio_handler.rs`)
- **State Caching**: File states cached in memory
### Qdrant
- **Vector Size**: 1536 (OpenAI ada-002 compatible)
- **Distance Metric**: Cosine similarity
- **Search Limit**: Configurable per query
## Troubleshooting
### Documents Not Indexing
1. Check MinIO handler is watching correct prefix:
```rust
minio_handler.watch_prefix(".gbkb/").await;
```
2. Verify Qdrant connection:
```bash
curl http://localhost:6333/collections
```
3. Check logs for indexing errors:
```
grep "Indexing document" botserver.log
```
### Tools Not Compiling
1. Verify PARAM syntax is correct
2. Check tool file is in `.gbdialog/` folder
3. Ensure work directory exists and is writable
4. Review compilation logs
### KB Search Not Working
1. Verify collection exists in session context
2. Check Qdrant collection created:
```bash
curl http://localhost:6333/collections/{collection_name}
```
3. Ensure embeddings are being generated (check LLM server)
## Future Enhancements
1. **Incremental Indexing**: Only reindex changed chunks
2. **Document Deduplication**: Detect and merge duplicate content
3. **Advanced Crawling**: Follow links, handle JavaScript
4. **Tool Versioning**: Track tool versions and changes
5. **KB Analytics**: Track search queries and document usage
6. **Automatic Tool Discovery**: Scan `.gbdialog/` on startup
7. **Distributed Indexing**: Scale across multiple workers
8. **Real-time Notifications**: WebSocket updates when KB changes
## References
- **Qdrant Documentation**: https://qdrant.tech/documentation/
- **Model Context Protocol**: https://modelcontextprotocol.io/
- **MinIO Documentation**: https://min.io/docs/
- **Rhai Scripting**: https://rhai.rs/book/
## Support
For issues or questions:
- GitHub Issues: https://github.com/GeneralBots/BotServer/issues
- Documentation: https://docs.generalbots.ai/