6.2 KiB
Caching
BotServer includes automatic caching to improve response times and reduce redundant processing, including semantic caching for LLM responses using Valkey (a Redis-compatible in-memory database).
Features
- Exact Match Caching: Cache responses for identical prompts
- Semantic Similarity Matching: Find and reuse responses for semantically similar prompts
- Configurable TTL: Control how long cached responses remain valid
- Per-Bot Configuration: Enable/disable caching on a per-bot basis
- Embedding-Based Similarity: Use local embedding models for semantic matching
- Statistics & Monitoring: Track cache hits, misses, and performance metrics
How Caching Works
Caching in BotServer is controlled by configuration parameters in config.csv. The system automatically caches LLM responses and manages conversation history.
When enabled, the semantic cache:
- User asks a question
- System checks if a semantically similar question was asked before
- If similarity > threshold (0.95), returns cached response
- Otherwise, generates new response and caches it
Configuration
Basic Cache Settings
From default.gbai/default.gbot/config.csv:
llm-cache,false # Enable/disable LLM response caching
llm-cache-ttl,3600 # Cache time-to-live in seconds
llm-cache-semantic,true # Use semantic similarity for cache matching
llm-cache-threshold,0.95 # Similarity threshold for cache hits
Configuration Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
llm-cache |
boolean | false | Enable/disable LLM response caching |
llm-cache-ttl |
integer | 3600 | Time-to-live for cached entries (in seconds) |
llm-cache-semantic |
boolean | true | Enable semantic similarity matching |
llm-cache-threshold |
float | 0.95 | Similarity threshold for semantic matches (0.0-1.0) |
Embedding Service Configuration
For semantic similarity matching, ensure your embedding service is configured:
embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf
Conversation History Management
The system manages conversation context through these parameters:
prompt-history,2 # Number of previous messages to include in context
prompt-compact,4 # Compact conversation after N exchanges
What These Settings Do
- prompt-history: Keeps the last 2 exchanges in the conversation context
- prompt-compact: After 4 exchanges, older messages are summarized or removed to save tokens
Cache Storage with Valkey
Architecture
Cache Key Structure
The cache uses a multi-level key structure:
- Exact match: Hash of the exact prompt
- Semantic match: Embedding vector stored with semantic index
Valkey Integration
Valkey provides:
- Fast in-memory storage: Sub-millisecond response times
- Automatic expiration: TTL-based cache invalidation
- Distributed caching: Share cache across multiple bot instances
- Persistence options: Optional disk persistence for cache durability
Example Usage
Basic Caching
' Caching happens automatically when enabled
USE KB "policies"
' First user asks: "What's the vacation policy?"
' System generates response and caches it
' Second user asks: "Tell me about vacation rules"
' System finds semantic match (>0.95 similarity) and returns cached response
Tool Response Caching
' Tool responses can also be cached
USE TOOL "weather-api"
' First request: "What's the weather in NYC?"
' Makes API call, caches response for 1 hour
' Second request within TTL: "NYC weather?"
' Returns cached response without API call
Cache Management
The cache operates automatically based on your configuration settings. Cache entries are managed through TTL expiration and Valkey's memory policies.
Best Practices
When to Enable Caching
Enable caching for:
- ✅ FAQ bots with repetitive questions
- ✅ Knowledge base queries
- ✅ API-heavy integrations
- ✅ High-traffic bots
Disable caching for:
- ❌ Real-time data queries
- ❌ Personalized responses
- ❌ Time-sensitive information
- ❌ Development/testing
Tuning Cache Parameters
TTL Settings:
- Short (300s): News, weather, stock prices
- Medium (3600s): General knowledge, FAQs
- Long (86400s): Static documentation, policies
Similarity Threshold:
- High (0.95+): Strict matching, fewer false positives
- Medium (0.85-0.95): Balance between coverage and accuracy
- Low (<0.85): Broad matching, risk of incorrect responses
Memory Management
Valkey automatically manages memory through:
- Eviction policies: LRU (Least Recently Used) by default
- Max memory limits: Configure in Valkey settings
- Key expiration: Automatic cleanup of expired entries
Performance Impact
Typical performance improvements with caching enabled:
| Metric | Without Cache | With Cache | Improvement |
|---|---|---|---|
| Response Time | 2-5s | 50-200ms | 10-100x faster |
| API Calls | Every request | First request only | 90%+ reduction |
| Token Usage | Full context | Cached response | 95%+ reduction |
| Cost | $0.02/request | $0.001/request | 95% cost saving |
Troubleshooting
Cache Not Working
Check:
- Valkey is running:
ps aux | grep valkey - Cache enabled in config:
llm-cache,true - TTL not expired
- Similarity threshold not too high
Clear Cache
To clear the cache manually:
# Connect to Valkey
valkey-cli
# Clear all bot cache
FLUSHDB
# Clear specific bot cache
DEL bot:cache:*
Clear Cache
To clear the cache manually:
# Connect to Valkey
valkey-cli
# Clear all bot cache
FLUSHDB
# Clear specific bot cache
DEL bot:cache:*
Summary
The semantic caching system in BotServer provides intelligent response caching that:
- Reduces response latency by 10-100x
- Cuts API costs by 90%+
- Maintains response quality through semantic matching
- Scales automatically with Valkey
Configure caching based on your bot's needs, monitor performance metrics, and tune parameters for optimal results.