botbook/src/appendix-external-services/llm-providers.md

# LLM Providers

General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.

## Overview

LLMs are the intelligence behind General Bots' conversational capabilities. You can configure:

- **Cloud Providers** - External APIs (OpenAI, Anthropic, Groq, etc.)
- **Local Models** - Self-hosted models via llama.cpp
- **Hybrid** - Use local for simple tasks, cloud for complex reasoning

## Cloud Providers

### OpenAI (GPT Series)

The most widely known LLM provider, offering GPT-4 and GPT-4o models.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| GPT-4o | 128K | General purpose, vision | Fast |
| GPT-4o-mini | 128K | Cost-effective tasks | Very Fast |
| GPT-4 Turbo | 128K | Complex reasoning | Medium |
| o1-preview | 128K | Advanced reasoning, math | Slow |
| o1-mini | 128K | Code, logic tasks | Medium |

**Configuration:**

```csv
llm-provider,openai
llm-api-key,sk-xxxxxxxxxxxxxxxxxxxxxxxx
llm-model,gpt-4o
```

**Strengths:**
- Excellent general knowledge
- Strong code generation
- Good instruction following
- Vision capabilities (GPT-4o)

**Considerations:**
- API costs can add up
- Data sent to external servers
- Rate limits apply

### Anthropic (Claude Series)

Known for safety, helpfulness, and large context windows.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| Claude 3.5 Sonnet | 200K | Best balance of capability/speed | Fast |
| Claude 3.5 Haiku | 200K | Quick, everyday tasks | Very Fast |
| Claude 3 Opus | 200K | Most capable, complex tasks | Slow |

**Configuration:**

```csv
llm-provider,anthropic
llm-api-key,sk-ant-xxxxxxxxxxxxxxxx
llm-model,claude-3-5-sonnet-20241022
```

**Strengths:**
- Largest context window (200K tokens)
- Excellent at following complex instructions
- Strong coding abilities
- Better at refusing harmful requests

**Considerations:**
- Premium pricing
- No vision in all models
- Newer provider, smaller ecosystem

### Groq

Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| Llama 3.3 70B | 128K | Complex reasoning | Very Fast |
| Llama 3.1 8B | 128K | Quick responses | Extremely Fast |
| Mixtral 8x7B | 32K | Balanced performance | Very Fast |
| Gemma 2 9B | 8K | Lightweight tasks | Extremely Fast |

**Configuration:**

```csv
llm-provider,groq
llm-api-key,gsk_xxxxxxxxxxxxxxxx
llm-model,llama-3.3-70b-versatile
```

**Strengths:**
- Fastest inference speeds (500+ tokens/sec)
- Competitive pricing
- Open-source models
- Great for real-time applications

**Considerations:**
- Limited model selection
- Rate limits on free tier
- Models may be less capable than GPT-4/Claude

### Google (Gemini Series)

Google's multimodal AI models with strong reasoning capabilities.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| Gemini 1.5 Pro | 2M | Extremely long documents | Medium |
| Gemini 1.5 Flash | 1M | Fast multimodal | Fast |
| Gemini 2.0 Flash | 1M | Latest capabilities | Fast |

**Configuration:**

```csv
llm-provider,google
llm-api-key,AIzaxxxxxxxxxxxxxxxx
llm-model,gemini-1.5-pro
```

**Strengths:**
- Largest context window (2M tokens)
- Native multimodal (text, image, video, audio)
- Strong at structured data
- Good coding abilities

**Considerations:**
- Newer ecosystem
- Some features region-limited
- API changes more frequently

### Mistral AI

European AI company offering efficient, open-weight models.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| Mistral Large | 128K | Complex tasks | Medium |
| Mistral Medium | 32K | Balanced performance | Fast |
| Mistral Small | 32K | Cost-effective | Very Fast |
| Codestral | 32K | Code generation | Fast |

**Configuration:**

```csv
llm-provider,mistral
llm-api-key,xxxxxxxxxxxxxxxx
llm-model,mistral-large-latest
```

**Strengths:**
- European data sovereignty (GDPR)
- Excellent code generation (Codestral)
- Open-weight models available
- Competitive pricing

**Considerations:**
- Smaller context than competitors
- Less brand recognition
- Fewer fine-tuning options

### DeepSeek

Chinese AI company known for efficient, capable models.

| Model | Context | Best For | Speed |
|-------|---------|----------|-------|
| DeepSeek-V3 | 128K | General purpose | Fast |
| DeepSeek-R1 | 128K | Reasoning, math | Medium |
| DeepSeek-Coder | 128K | Programming | Fast |

**Configuration:**

```csv
llm-provider,deepseek
llm-api-key,sk-xxxxxxxxxxxxxxxx
llm-model,deepseek-chat
llm-server-url,https://api.deepseek.com
```

**Strengths:**
- Extremely cost-effective
- Strong reasoning (R1 model)
- Excellent code generation
- Open-weight versions available

**Considerations:**
- Data processed in China
- Newer provider
- May have content restrictions

## Local Models

Run models on your own hardware for privacy, cost control, and offline operation.

### Setting Up Local LLM

General Bots uses **llama.cpp** server for local inference:

```csv
llm-provider,local
llm-server-url,https://localhost:8081
llm-model,DeepSeek-R1-Distill-Qwen-1.5B
```

### Recommended Local Models

#### For High-End GPU (24GB+ VRAM)

| Model | Size | VRAM | Quality |
|-------|------|------|---------|
| GPT-OSS 120B Q4 | 70GB | 48GB+ | Excellent |
| Llama 3.1 70B Q4 | 40GB | 48GB+ | Excellent |
| DeepSeek-R1 32B Q4 | 20GB | 24GB | Very Good |
| Qwen 2.5 72B Q4 | 42GB | 48GB+ | Excellent |

#### For Mid-Range GPU (12-16GB VRAM)

| Model | Size | VRAM | Quality |
|-------|------|------|---------|
| GPT-OSS 20B F16 | 40GB | 16GB | Very Good |
| Llama 3.1 8B Q8 | 9GB | 12GB | Good |
| DeepSeek-R1-Distill 14B Q4 | 8GB | 12GB | Good |
| Mistral Nemo 12B Q4 | 7GB | 10GB | Good |

#### For Small GPU or CPU (8GB VRAM or less)

| Model | Size | VRAM | Quality |
|-------|------|------|---------|
| DeepSeek-R1-Distill 1.5B Q4 | 1GB | 4GB | Basic |
| Phi-3 Mini 3.8B Q4 | 2.5GB | 6GB | Acceptable |
| Gemma 2 2B Q8 | 3GB | 6GB | Acceptable |
| Qwen 2.5 3B Q4 | 2GB | 4GB | Basic |

### Model Download URLs

Add models to `installer.rs` data_download_list:

```rust
// GPT-OSS 20B - Recommended for small GPU
"https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf"

// DeepSeek R1 Distill - For CPU or minimal GPU
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf"

// Llama 3.1 8B - Good balance
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
```

### Embedding Models

For vector search, you need an embedding model:

```csv
embedding-provider,local
embedding-server-url,https://localhost:8082
embedding-model,bge-small-en-v1.5
```

Recommended embedding models:

| Model | Dimensions | Size | Quality |
|-------|------------|------|---------|
| bge-small-en-v1.5 | 384 | 130MB | Good |
| bge-base-en-v1.5 | 768 | 440MB | Better |
| bge-large-en-v1.5 | 1024 | 1.3GB | Best |
| nomic-embed-text | 768 | 550MB | Good |

## Hybrid Configuration

Use different models for different tasks:

```csv
# Primary model for complex conversations
llm-provider,anthropic
llm-model,claude-3-5-sonnet-20241022

# Fast model for simple tasks
llm-fast-provider,groq
llm-fast-model,llama-3.1-8b-instant

# Local fallback for offline operation
llm-fallback-provider,local
llm-fallback-model,DeepSeek-R1-Distill-Qwen-1.5B

# Embeddings always local
embedding-provider,local
embedding-model,bge-small-en-v1.5
```

## Model Selection Guide

### By Use Case

| Use Case | Recommended | Why |
|----------|-------------|-----|
| Customer support | Claude 3.5 Sonnet | Best at following guidelines |
| Code generation | DeepSeek-Coder, GPT-4o | Specialized for code |
| Document analysis | Gemini 1.5 Pro | 2M context window |
| Real-time chat | Groq Llama 3.1 8B | Fastest responses |
| Privacy-sensitive | Local DeepSeek-R1 | No external data transfer |
| Cost-sensitive | DeepSeek-V3, Local | Lowest cost per token |
| Complex reasoning | Claude 3 Opus, o1 | Best reasoning ability |

### By Budget

| Budget | Recommended Setup |
|--------|-------------------|
| Free | Local models only |
| Low ($10-50/mo) | Groq + Local fallback |
| Medium ($50-200/mo) | GPT-4o-mini + Claude Haiku |
| High ($200+/mo) | GPT-4o + Claude Sonnet |
| Enterprise | Private deployment + premium APIs |

## Configuration Reference

### Environment Variables

```bash
# Primary LLM
LLM_PROVIDER=openai
LLM_API_KEY=sk-xxx
LLM_MODEL=gpt-4o
LLM_SERVER_URL=https://api.openai.com

# Local LLM Server
LLM_LOCAL_URL=https://localhost:8081
LLM_LOCAL_MODEL=DeepSeek-R1-Distill-Qwen-1.5B

# Embedding
EMBEDDING_PROVIDER=local
EMBEDDING_URL=https://localhost:8082
EMBEDDING_MODEL=bge-small-en-v1.5
```

### config.csv Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `llm-provider` | Provider name | `openai`, `anthropic`, `local` |
| `llm-api-key` | API key for cloud providers | `sk-xxx` |
| `llm-model` | Model identifier | `gpt-4o` |
| `llm-server-url` | API endpoint | `https://api.openai.com` |
| `llm-server-ctx-size` | Context window size | `128000` |
| `llm-temperature` | Response randomness (0-2) | `0.7` |
| `llm-max-tokens` | Maximum response length | `4096` |
| `llm-cache-enabled` | Enable semantic caching | `true` |
| `llm-cache-ttl` | Cache time-to-live (seconds) | `3600` |

## Security Considerations

### Cloud Providers

- API keys should be stored in environment variables or secrets manager
- Consider data residency requirements (EU: Mistral, US: OpenAI)
- Review provider data retention policies
- Use separate keys for production/development

### Local Models

- All data stays on your infrastructure
- No internet required after model download
- Full control over model versions
- Consider GPU security for sensitive deployments

## Performance Optimization

### Caching

Enable semantic caching to reduce API calls:

```csv
llm-cache-enabled,true
llm-cache-ttl,3600
llm-cache-similarity-threshold,0.92
```

### Batching

For bulk operations, use batch APIs when available:

```csv
llm-batch-enabled,true
llm-batch-size,10
```

### Context Management

Optimize context window usage:

```csv
llm-context-compaction,true
llm-max-history-turns,10
llm-summarize-long-contexts,true
```

## Troubleshooting

### Common Issues

**API Key Invalid**
- Verify key is correct and not expired
- Check if key has required permissions
- Ensure billing is active

**Model Not Found**
- Check model name spelling
- Verify model is available in your region
- Some models require waitlist access

**Rate Limits**
- Implement exponential backoff
- Use caching to reduce calls
- Consider upgrading API tier

**Local Model Slow**
- Check GPU memory usage
- Reduce context size
- Use quantized models (Q4 instead of F16)

### Logging

Enable LLM logging for debugging:

```csv
llm-log-requests,true
llm-log-responses,false
llm-log-timing,true
```

## Next Steps

- [LLM Configuration](../chapter-08-config/llm-config.md) - Detailed configuration guide
- [Semantic Caching](../chapter-03/caching.md) - Cache configuration
- [NVIDIA GPU Setup](../chapter-09-api/nvidia-gpu-setup.md) - GPU configuration for local models
- Split from botserver. 2025-12-03 19:56:35 -03:00			`# LLM Providers`

			`General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.`

			`## Overview`

			`LLMs are the intelligence behind General Bots' conversational capabilities. You can configure:`

			`- Cloud Providers - External APIs (OpenAI, Anthropic, Groq, etc.)`
			`- Local Models - Self-hosted models via llama.cpp`
			`- Hybrid - Use local for simple tasks, cloud for complex reasoning`

			`## Cloud Providers`

			`### OpenAI (GPT Series)`

			`The most widely known LLM provider, offering GPT-4 and GPT-4o models.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| GPT-4o \| 128K \| General purpose, vision \| Fast \|`
			`\| GPT-4o-mini \| 128K \| Cost-effective tasks \| Very Fast \|`
			`\| GPT-4 Turbo \| 128K \| Complex reasoning \| Medium \|`
			`\| o1-preview \| 128K \| Advanced reasoning, math \| Slow \|`
			`\| o1-mini \| 128K \| Code, logic tasks \| Medium \|`

			`Configuration:`

			```csv
			`llm-provider,openai`
			`llm-api-key,sk-xxxxxxxxxxxxxxxxxxxxxxxx`
			`llm-model,gpt-4o`
			```

			`Strengths:`
			`- Excellent general knowledge`
			`- Strong code generation`
			`- Good instruction following`
			`- Vision capabilities (GPT-4o)`

			`Considerations:`
			`- API costs can add up`
			`- Data sent to external servers`
			`- Rate limits apply`

			`### Anthropic (Claude Series)`

			`Known for safety, helpfulness, and large context windows.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| Claude 3.5 Sonnet \| 200K \| Best balance of capability/speed \| Fast \|`
			`\| Claude 3.5 Haiku \| 200K \| Quick, everyday tasks \| Very Fast \|`
			`\| Claude 3 Opus \| 200K \| Most capable, complex tasks \| Slow \|`

			`Configuration:`

			```csv
			`llm-provider,anthropic`
			`llm-api-key,sk-ant-xxxxxxxxxxxxxxxx`
			`llm-model,claude-3-5-sonnet-20241022`
			```

			`Strengths:`
			`- Largest context window (200K tokens)`
			`- Excellent at following complex instructions`
			`- Strong coding abilities`
			`- Better at refusing harmful requests`

			`Considerations:`
			`- Premium pricing`
			`- No vision in all models`
			`- Newer provider, smaller ecosystem`

			`### Groq`

			`Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| Llama 3.3 70B \| 128K \| Complex reasoning \| Very Fast \|`
			`\| Llama 3.1 8B \| 128K \| Quick responses \| Extremely Fast \|`
			`\| Mixtral 8x7B \| 32K \| Balanced performance \| Very Fast \|`
			`\| Gemma 2 9B \| 8K \| Lightweight tasks \| Extremely Fast \|`

			`Configuration:`

			```csv
			`llm-provider,groq`
			`llm-api-key,gsk_xxxxxxxxxxxxxxxx`
			`llm-model,llama-3.3-70b-versatile`
			```

			`Strengths:`
			`- Fastest inference speeds (500+ tokens/sec)`
			`- Competitive pricing`
			`- Open-source models`
			`- Great for real-time applications`

			`Considerations:`
			`- Limited model selection`
			`- Rate limits on free tier`
			`- Models may be less capable than GPT-4/Claude`

			`### Google (Gemini Series)`

			`Google's multimodal AI models with strong reasoning capabilities.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| Gemini 1.5 Pro \| 2M \| Extremely long documents \| Medium \|`
			`\| Gemini 1.5 Flash \| 1M \| Fast multimodal \| Fast \|`
			`\| Gemini 2.0 Flash \| 1M \| Latest capabilities \| Fast \|`

			`Configuration:`

			```csv
			`llm-provider,google`
			`llm-api-key,AIzaxxxxxxxxxxxxxxxx`
			`llm-model,gemini-1.5-pro`
			```

			`Strengths:`
			`- Largest context window (2M tokens)`
			`- Native multimodal (text, image, video, audio)`
			`- Strong at structured data`
			`- Good coding abilities`

			`Considerations:`
			`- Newer ecosystem`
			`- Some features region-limited`
			`- API changes more frequently`

			`### Mistral AI`

			`European AI company offering efficient, open-weight models.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| Mistral Large \| 128K \| Complex tasks \| Medium \|`
			`\| Mistral Medium \| 32K \| Balanced performance \| Fast \|`
			`\| Mistral Small \| 32K \| Cost-effective \| Very Fast \|`
			`\| Codestral \| 32K \| Code generation \| Fast \|`

			`Configuration:`

			```csv
			`llm-provider,mistral`
			`llm-api-key,xxxxxxxxxxxxxxxx`
			`llm-model,mistral-large-latest`
			```

			`Strengths:`
			`- European data sovereignty (GDPR)`
			`- Excellent code generation (Codestral)`
			`- Open-weight models available`
			`- Competitive pricing`

			`Considerations:`
			`- Smaller context than competitors`
			`- Less brand recognition`
			`- Fewer fine-tuning options`

			`### DeepSeek`

			`Chinese AI company known for efficient, capable models.`

			`\| Model \| Context \| Best For \| Speed \|`
			`\|-------\|---------\|----------\|-------\|`
			`\| DeepSeek-V3 \| 128K \| General purpose \| Fast \|`
			`\| DeepSeek-R1 \| 128K \| Reasoning, math \| Medium \|`
			`\| DeepSeek-Coder \| 128K \| Programming \| Fast \|`

			`Configuration:`

			```csv
			`llm-provider,deepseek`
			`llm-api-key,sk-xxxxxxxxxxxxxxxx`
			`llm-model,deepseek-chat`
			`llm-server-url,https://api.deepseek.com`
			```

			`Strengths:`
			`- Extremely cost-effective`
			`- Strong reasoning (R1 model)`
			`- Excellent code generation`
			`- Open-weight versions available`

			`Considerations:`
			`- Data processed in China`
			`- Newer provider`
			`- May have content restrictions`

			`## Local Models`

			`Run models on your own hardware for privacy, cost control, and offline operation.`

			`### Setting Up Local LLM`

			`General Bots uses llama.cpp server for local inference:`

			```csv
			`llm-provider,local`
			`llm-server-url,https://localhost:8081`
			`llm-model,DeepSeek-R1-Distill-Qwen-1.5B`
			```

			`### Recommended Local Models`

			`#### For High-End GPU (24GB+ VRAM)`

			`\| Model \| Size \| VRAM \| Quality \|`
			`\|-------\|------\|------\|---------\|`
			`\| GPT-OSS 120B Q4 \| 70GB \| 48GB+ \| Excellent \|`
			`\| Llama 3.1 70B Q4 \| 40GB \| 48GB+ \| Excellent \|`
			`\| DeepSeek-R1 32B Q4 \| 20GB \| 24GB \| Very Good \|`
			`\| Qwen 2.5 72B Q4 \| 42GB \| 48GB+ \| Excellent \|`

			`#### For Mid-Range GPU (12-16GB VRAM)`

			`\| Model \| Size \| VRAM \| Quality \|`
			`\|-------\|------\|------\|---------\|`
			`\| GPT-OSS 20B F16 \| 40GB \| 16GB \| Very Good \|`
			`\| Llama 3.1 8B Q8 \| 9GB \| 12GB \| Good \|`
			`\| DeepSeek-R1-Distill 14B Q4 \| 8GB \| 12GB \| Good \|`
			`\| Mistral Nemo 12B Q4 \| 7GB \| 10GB \| Good \|`

			`#### For Small GPU or CPU (8GB VRAM or less)`

			`\| Model \| Size \| VRAM \| Quality \|`
			`\|-------\|------\|------\|---------\|`
			`\| DeepSeek-R1-Distill 1.5B Q4 \| 1GB \| 4GB \| Basic \|`
			`\| Phi-3 Mini 3.8B Q4 \| 2.5GB \| 6GB \| Acceptable \|`
			`\| Gemma 2 2B Q8 \| 3GB \| 6GB \| Acceptable \|`
			`\| Qwen 2.5 3B Q4 \| 2GB \| 4GB \| Basic \|`

			`### Model Download URLs`

			Add models to `installer.rs` data_download_list:

			```rust
			`// GPT-OSS 20B - Recommended for small GPU`
			`"https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf"`

			`// DeepSeek R1 Distill - For CPU or minimal GPU`
			`"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf"`

			`// Llama 3.1 8B - Good balance`
			`"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"`
			```

			`### Embedding Models`

			`For vector search, you need an embedding model:`

			```csv
			`embedding-provider,local`
			`embedding-server-url,https://localhost:8082`
			`embedding-model,bge-small-en-v1.5`
			```

			`Recommended embedding models:`

			`\| Model \| Dimensions \| Size \| Quality \|`
			`\|-------\|------------\|------\|---------\|`
			`\| bge-small-en-v1.5 \| 384 \| 130MB \| Good \|`
			`\| bge-base-en-v1.5 \| 768 \| 440MB \| Better \|`
			`\| bge-large-en-v1.5 \| 1024 \| 1.3GB \| Best \|`
			`\| nomic-embed-text \| 768 \| 550MB \| Good \|`

			`## Hybrid Configuration`

			`Use different models for different tasks:`

			```csv
			`# Primary model for complex conversations`
			`llm-provider,anthropic`
			`llm-model,claude-3-5-sonnet-20241022`

			`# Fast model for simple tasks`
			`llm-fast-provider,groq`
			`llm-fast-model,llama-3.1-8b-instant`

			`# Local fallback for offline operation`
			`llm-fallback-provider,local`
			`llm-fallback-model,DeepSeek-R1-Distill-Qwen-1.5B`

			`# Embeddings always local`
			`embedding-provider,local`
			`embedding-model,bge-small-en-v1.5`
			```

			`## Model Selection Guide`

			`### By Use Case`

			`\| Use Case \| Recommended \| Why \|`
			`\|----------\|-------------\|-----\|`
			`\| Customer support \| Claude 3.5 Sonnet \| Best at following guidelines \|`
			`\| Code generation \| DeepSeek-Coder, GPT-4o \| Specialized for code \|`
			`\| Document analysis \| Gemini 1.5 Pro \| 2M context window \|`
			`\| Real-time chat \| Groq Llama 3.1 8B \| Fastest responses \|`
			`\| Privacy-sensitive \| Local DeepSeek-R1 \| No external data transfer \|`
			`\| Cost-sensitive \| DeepSeek-V3, Local \| Lowest cost per token \|`
			`\| Complex reasoning \| Claude 3 Opus, o1 \| Best reasoning ability \|`

			`### By Budget`

			`\| Budget \| Recommended Setup \|`
			`\|--------\|-------------------\|`
			`\| Free \| Local models only \|`
			`\| Low ($10-50/mo) \| Groq + Local fallback \|`
			`\| Medium ($50-200/mo) \| GPT-4o-mini + Claude Haiku \|`
			`\| High ($200+/mo) \| GPT-4o + Claude Sonnet \|`
			`\| Enterprise \| Private deployment + premium APIs \|`

			`## Configuration Reference`

			`### Environment Variables`

			```bash
			`# Primary LLM`
			`LLM_PROVIDER=openai`
			`LLM_API_KEY=sk-xxx`
			`LLM_MODEL=gpt-4o`
			`LLM_SERVER_URL=https://api.openai.com`

			`# Local LLM Server`
			`LLM_LOCAL_URL=https://localhost:8081`
			`LLM_LOCAL_MODEL=DeepSeek-R1-Distill-Qwen-1.5B`

			`# Embedding`
			`EMBEDDING_PROVIDER=local`
			`EMBEDDING_URL=https://localhost:8082`
			`EMBEDDING_MODEL=bge-small-en-v1.5`
			```

			`### config.csv Parameters`

			`\| Parameter \| Description \| Example \|`
			`\|-----------\|-------------\|---------\|`
			\| `llm-provider` \| Provider name \| `openai`, `anthropic`, `local` \|
			\| `llm-api-key` \| API key for cloud providers \| `sk-xxx` \|
			\| `llm-model` \| Model identifier \| `gpt-4o` \|
			\| `llm-server-url` \| API endpoint \| `https://api.openai.com` \|
			\| `llm-server-ctx-size` \| Context window size \| `128000` \|
			\| `llm-temperature` \| Response randomness (0-2) \| `0.7` \|
			\| `llm-max-tokens` \| Maximum response length \| `4096` \|
			\| `llm-cache-enabled` \| Enable semantic caching \| `true` \|
			\| `llm-cache-ttl` \| Cache time-to-live (seconds) \| `3600` \|

			`## Security Considerations`

			`### Cloud Providers`

			`- API keys should be stored in environment variables or secrets manager`
			`- Consider data residency requirements (EU: Mistral, US: OpenAI)`
			`- Review provider data retention policies`
			`- Use separate keys for production/development`

			`### Local Models`

			`- All data stays on your infrastructure`
			`- No internet required after model download`
			`- Full control over model versions`
			`- Consider GPU security for sensitive deployments`

			`## Performance Optimization`

			`### Caching`

			`Enable semantic caching to reduce API calls:`

			```csv
			`llm-cache-enabled,true`
			`llm-cache-ttl,3600`
			`llm-cache-similarity-threshold,0.92`
			```

			`### Batching`

			`For bulk operations, use batch APIs when available:`

			```csv
			`llm-batch-enabled,true`
			`llm-batch-size,10`
			```

			`### Context Management`

			`Optimize context window usage:`

			```csv
			`llm-context-compaction,true`
			`llm-max-history-turns,10`
			`llm-summarize-long-contexts,true`
			```

			`## Troubleshooting`

			`### Common Issues`

			`API Key Invalid`
			`- Verify key is correct and not expired`
			`- Check if key has required permissions`
			`- Ensure billing is active`

			`Model Not Found`
			`- Check model name spelling`
			`- Verify model is available in your region`
			`- Some models require waitlist access`

			`Rate Limits`
			`- Implement exponential backoff`
			`- Use caching to reduce calls`
			`- Consider upgrading API tier`

			`Local Model Slow`
			`- Check GPU memory usage`
			`- Reduce context size`
			`- Use quantized models (Q4 instead of F16)`

			`### Logging`

			`Enable LLM logging for debugging:`

			```csv
			`llm-log-requests,true`
			`llm-log-responses,false`
			`llm-log-timing,true`
			```

			`## Next Steps`

			`- [LLM Configuration](../chapter-08-config/llm-config.md) - Detailed configuration guide`
			`- [Semantic Caching](../chapter-03/caching.md) - Cache configuration`
			`- [NVIDIA GPU Setup](../chapter-09-api/nvidia-gpu-setup.md) - GPU configuration for local models`