botserver/docs/src/chapter-03/caching.md

6.2 KiB

Caching

BotServer includes automatic caching to improve response times and reduce redundant processing, including semantic caching for LLM responses using Valkey (a Redis-compatible in-memory database).

Features

  • Exact Match Caching: Cache responses for identical prompts
  • Semantic Similarity Matching: Find and reuse responses for semantically similar prompts
  • Configurable TTL: Control how long cached responses remain valid
  • Per-Bot Configuration: Enable/disable caching on a per-bot basis
  • Embedding-Based Similarity: Use local embedding models for semantic matching
  • Statistics & Monitoring: Track cache hits, misses, and performance metrics

How Caching Works

Caching in BotServer is controlled by configuration parameters in config.csv. The system automatically caches LLM responses and manages conversation history.

When enabled, the semantic cache:

  1. User asks a question
  2. System checks if a semantically similar question was asked before
  3. If similarity > threshold (0.95), returns cached response
  4. Otherwise, generates new response and caches it

Configuration

Basic Cache Settings

From default.gbai/default.gbot/config.csv:

llm-cache,false              # Enable/disable LLM response caching
llm-cache-ttl,3600          # Cache time-to-live in seconds
llm-cache-semantic,true     # Use semantic similarity for cache matching
llm-cache-threshold,0.95    # Similarity threshold for cache hits

Configuration Parameters

Parameter Type Default Description
llm-cache boolean false Enable/disable LLM response caching
llm-cache-ttl integer 3600 Time-to-live for cached entries (in seconds)
llm-cache-semantic boolean true Enable semantic similarity matching
llm-cache-threshold float 0.95 Similarity threshold for semantic matches (0.0-1.0)

Embedding Service Configuration

For semantic similarity matching, ensure your embedding service is configured:

embedding-url,http://localhost:8082
embedding-model,../../../../data/llm/bge-small-en-v1.5-f32.gguf

Conversation History Management

The system manages conversation context through these parameters:

prompt-history,2    # Number of previous messages to include in context
prompt-compact,4    # Compact conversation after N exchanges

What These Settings Do

  • prompt-history: Keeps the last 2 exchanges in the conversation context
  • prompt-compact: After 4 exchanges, older messages are summarized or removed to save tokens

Cache Storage with Valkey

Architecture

Caching Architecture Flow

Cache Key Structure

The cache uses a multi-level key structure:

  • Exact match: Hash of the exact prompt
  • Semantic match: Embedding vector stored with semantic index

Valkey Integration

Valkey provides:

  • Fast in-memory storage: Sub-millisecond response times
  • Automatic expiration: TTL-based cache invalidation
  • Distributed caching: Share cache across multiple bot instances
  • Persistence options: Optional disk persistence for cache durability

Example Usage

Basic Caching

' Caching happens automatically when enabled
USE KB "policies"

' First user asks: "What's the vacation policy?"
' System generates response and caches it

' Second user asks: "Tell me about vacation rules"
' System finds semantic match (>0.95 similarity) and returns cached response

Tool Response Caching

' Tool responses can also be cached
USE TOOL "weather-api"

' First request: "What's the weather in NYC?"
' Makes API call, caches response for 1 hour

' Second request within TTL: "NYC weather?"
' Returns cached response without API call

Cache Management

The cache operates automatically based on your configuration settings. Cache entries are managed through TTL expiration and Valkey's memory policies.

Best Practices

When to Enable Caching

Enable caching for:

  • FAQ bots with repetitive questions
  • Knowledge base queries
  • API-heavy integrations
  • High-traffic bots

Disable caching for:

  • Real-time data queries
  • Personalized responses
  • Time-sensitive information
  • Development/testing

Tuning Cache Parameters

TTL Settings:

  • Short (300s): News, weather, stock prices
  • Medium (3600s): General knowledge, FAQs
  • Long (86400s): Static documentation, policies

Similarity Threshold:

  • High (0.95+): Strict matching, fewer false positives
  • Medium (0.85-0.95): Balance between coverage and accuracy
  • Low (<0.85): Broad matching, risk of incorrect responses

Memory Management

Valkey automatically manages memory through:

  • Eviction policies: LRU (Least Recently Used) by default
  • Max memory limits: Configure in Valkey settings
  • Key expiration: Automatic cleanup of expired entries

Performance Impact

Typical performance improvements with caching enabled:

Metric Without Cache With Cache Improvement
Response Time 2-5s 50-200ms 10-100x faster
API Calls Every request First request only 90%+ reduction
Token Usage Full context Cached response 95%+ reduction
Cost $0.02/request $0.001/request 95% cost saving

Troubleshooting

Cache Not Working

Check:

  1. Valkey is running: ps aux | grep valkey
  2. Cache enabled in config: llm-cache,true
  3. TTL not expired
  4. Similarity threshold not too high

Clear Cache

To clear the cache manually:

# Connect to Valkey
valkey-cli

# Clear all bot cache
FLUSHDB

# Clear specific bot cache
DEL bot:cache:*

Clear Cache

To clear the cache manually:

# Connect to Valkey
valkey-cli

# Clear all bot cache
FLUSHDB

# Clear specific bot cache
DEL bot:cache:*

Summary

The semantic caching system in BotServer provides intelligent response caching that:

  • Reduces response latency by 10-100x
  • Cuts API costs by 90%+
  • Maintains response quality through semantic matching
  • Scales automatically with Valkey

Configure caching based on your bot's needs, monitor performance metrics, and tune parameters for optimal results.