437 lines
11 KiB
Markdown
437 lines
11 KiB
Markdown
|
|
# LLM Providers
|
||
|
|
|
||
|
|
General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
LLMs are the intelligence behind General Bots' conversational capabilities. You can configure:
|
||
|
|
|
||
|
|
- **Cloud Providers** - External APIs (OpenAI, Anthropic, Groq, etc.)
|
||
|
|
- **Local Models** - Self-hosted models via llama.cpp
|
||
|
|
- **Hybrid** - Use local for simple tasks, cloud for complex reasoning
|
||
|
|
|
||
|
|
## Cloud Providers
|
||
|
|
|
||
|
|
### OpenAI (GPT Series)
|
||
|
|
|
||
|
|
The most widely known LLM provider, offering GPT-4 and GPT-4o models.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| GPT-4o | 128K | General purpose, vision | Fast |
|
||
|
|
| GPT-4o-mini | 128K | Cost-effective tasks | Very Fast |
|
||
|
|
| GPT-4 Turbo | 128K | Complex reasoning | Medium |
|
||
|
|
| o1-preview | 128K | Advanced reasoning, math | Slow |
|
||
|
|
| o1-mini | 128K | Code, logic tasks | Medium |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,openai
|
||
|
|
llm-api-key,sk-xxxxxxxxxxxxxxxxxxxxxxxx
|
||
|
|
llm-model,gpt-4o
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- Excellent general knowledge
|
||
|
|
- Strong code generation
|
||
|
|
- Good instruction following
|
||
|
|
- Vision capabilities (GPT-4o)
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- API costs can add up
|
||
|
|
- Data sent to external servers
|
||
|
|
- Rate limits apply
|
||
|
|
|
||
|
|
### Anthropic (Claude Series)
|
||
|
|
|
||
|
|
Known for safety, helpfulness, and large context windows.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| Claude 3.5 Sonnet | 200K | Best balance of capability/speed | Fast |
|
||
|
|
| Claude 3.5 Haiku | 200K | Quick, everyday tasks | Very Fast |
|
||
|
|
| Claude 3 Opus | 200K | Most capable, complex tasks | Slow |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,anthropic
|
||
|
|
llm-api-key,sk-ant-xxxxxxxxxxxxxxxx
|
||
|
|
llm-model,claude-3-5-sonnet-20241022
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- Largest context window (200K tokens)
|
||
|
|
- Excellent at following complex instructions
|
||
|
|
- Strong coding abilities
|
||
|
|
- Better at refusing harmful requests
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- Premium pricing
|
||
|
|
- No vision in all models
|
||
|
|
- Newer provider, smaller ecosystem
|
||
|
|
|
||
|
|
### Groq
|
||
|
|
|
||
|
|
Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| Llama 3.3 70B | 128K | Complex reasoning | Very Fast |
|
||
|
|
| Llama 3.1 8B | 128K | Quick responses | Extremely Fast |
|
||
|
|
| Mixtral 8x7B | 32K | Balanced performance | Very Fast |
|
||
|
|
| Gemma 2 9B | 8K | Lightweight tasks | Extremely Fast |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,groq
|
||
|
|
llm-api-key,gsk_xxxxxxxxxxxxxxxx
|
||
|
|
llm-model,llama-3.3-70b-versatile
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- Fastest inference speeds (500+ tokens/sec)
|
||
|
|
- Competitive pricing
|
||
|
|
- Open-source models
|
||
|
|
- Great for real-time applications
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- Limited model selection
|
||
|
|
- Rate limits on free tier
|
||
|
|
- Models may be less capable than GPT-4/Claude
|
||
|
|
|
||
|
|
### Google (Gemini Series)
|
||
|
|
|
||
|
|
Google's multimodal AI models with strong reasoning capabilities.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| Gemini 1.5 Pro | 2M | Extremely long documents | Medium |
|
||
|
|
| Gemini 1.5 Flash | 1M | Fast multimodal | Fast |
|
||
|
|
| Gemini 2.0 Flash | 1M | Latest capabilities | Fast |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,google
|
||
|
|
llm-api-key,AIzaxxxxxxxxxxxxxxxx
|
||
|
|
llm-model,gemini-1.5-pro
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- Largest context window (2M tokens)
|
||
|
|
- Native multimodal (text, image, video, audio)
|
||
|
|
- Strong at structured data
|
||
|
|
- Good coding abilities
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- Newer ecosystem
|
||
|
|
- Some features region-limited
|
||
|
|
- API changes more frequently
|
||
|
|
|
||
|
|
### Mistral AI
|
||
|
|
|
||
|
|
European AI company offering efficient, open-weight models.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| Mistral Large | 128K | Complex tasks | Medium |
|
||
|
|
| Mistral Medium | 32K | Balanced performance | Fast |
|
||
|
|
| Mistral Small | 32K | Cost-effective | Very Fast |
|
||
|
|
| Codestral | 32K | Code generation | Fast |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,mistral
|
||
|
|
llm-api-key,xxxxxxxxxxxxxxxx
|
||
|
|
llm-model,mistral-large-latest
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- European data sovereignty (GDPR)
|
||
|
|
- Excellent code generation (Codestral)
|
||
|
|
- Open-weight models available
|
||
|
|
- Competitive pricing
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- Smaller context than competitors
|
||
|
|
- Less brand recognition
|
||
|
|
- Fewer fine-tuning options
|
||
|
|
|
||
|
|
### DeepSeek
|
||
|
|
|
||
|
|
Chinese AI company known for efficient, capable models.
|
||
|
|
|
||
|
|
| Model | Context | Best For | Speed |
|
||
|
|
|-------|---------|----------|-------|
|
||
|
|
| DeepSeek-V3 | 128K | General purpose | Fast |
|
||
|
|
| DeepSeek-R1 | 128K | Reasoning, math | Medium |
|
||
|
|
| DeepSeek-Coder | 128K | Programming | Fast |
|
||
|
|
|
||
|
|
**Configuration:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,deepseek
|
||
|
|
llm-api-key,sk-xxxxxxxxxxxxxxxx
|
||
|
|
llm-model,deepseek-chat
|
||
|
|
llm-server-url,https://api.deepseek.com
|
||
|
|
```
|
||
|
|
|
||
|
|
**Strengths:**
|
||
|
|
- Extremely cost-effective
|
||
|
|
- Strong reasoning (R1 model)
|
||
|
|
- Excellent code generation
|
||
|
|
- Open-weight versions available
|
||
|
|
|
||
|
|
**Considerations:**
|
||
|
|
- Data processed in China
|
||
|
|
- Newer provider
|
||
|
|
- May have content restrictions
|
||
|
|
|
||
|
|
## Local Models
|
||
|
|
|
||
|
|
Run models on your own hardware for privacy, cost control, and offline operation.
|
||
|
|
|
||
|
|
### Setting Up Local LLM
|
||
|
|
|
||
|
|
General Bots uses **llama.cpp** server for local inference:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-provider,local
|
||
|
|
llm-server-url,https://localhost:8081
|
||
|
|
llm-model,DeepSeek-R1-Distill-Qwen-1.5B
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recommended Local Models
|
||
|
|
|
||
|
|
#### For High-End GPU (24GB+ VRAM)
|
||
|
|
|
||
|
|
| Model | Size | VRAM | Quality |
|
||
|
|
|-------|------|------|---------|
|
||
|
|
| GPT-OSS 120B Q4 | 70GB | 48GB+ | Excellent |
|
||
|
|
| Llama 3.1 70B Q4 | 40GB | 48GB+ | Excellent |
|
||
|
|
| DeepSeek-R1 32B Q4 | 20GB | 24GB | Very Good |
|
||
|
|
| Qwen 2.5 72B Q4 | 42GB | 48GB+ | Excellent |
|
||
|
|
|
||
|
|
#### For Mid-Range GPU (12-16GB VRAM)
|
||
|
|
|
||
|
|
| Model | Size | VRAM | Quality |
|
||
|
|
|-------|------|------|---------|
|
||
|
|
| GPT-OSS 20B F16 | 40GB | 16GB | Very Good |
|
||
|
|
| Llama 3.1 8B Q8 | 9GB | 12GB | Good |
|
||
|
|
| DeepSeek-R1-Distill 14B Q4 | 8GB | 12GB | Good |
|
||
|
|
| Mistral Nemo 12B Q4 | 7GB | 10GB | Good |
|
||
|
|
|
||
|
|
#### For Small GPU or CPU (8GB VRAM or less)
|
||
|
|
|
||
|
|
| Model | Size | VRAM | Quality |
|
||
|
|
|-------|------|------|---------|
|
||
|
|
| DeepSeek-R1-Distill 1.5B Q4 | 1GB | 4GB | Basic |
|
||
|
|
| Phi-3 Mini 3.8B Q4 | 2.5GB | 6GB | Acceptable |
|
||
|
|
| Gemma 2 2B Q8 | 3GB | 6GB | Acceptable |
|
||
|
|
| Qwen 2.5 3B Q4 | 2GB | 4GB | Basic |
|
||
|
|
|
||
|
|
### Model Download URLs
|
||
|
|
|
||
|
|
Add models to `installer.rs` data_download_list:
|
||
|
|
|
||
|
|
```rust
|
||
|
|
// GPT-OSS 20B - Recommended for small GPU
|
||
|
|
"https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf"
|
||
|
|
|
||
|
|
// DeepSeek R1 Distill - For CPU or minimal GPU
|
||
|
|
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf"
|
||
|
|
|
||
|
|
// Llama 3.1 8B - Good balance
|
||
|
|
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Embedding Models
|
||
|
|
|
||
|
|
For vector search, you need an embedding model:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
embedding-provider,local
|
||
|
|
embedding-server-url,https://localhost:8082
|
||
|
|
embedding-model,bge-small-en-v1.5
|
||
|
|
```
|
||
|
|
|
||
|
|
Recommended embedding models:
|
||
|
|
|
||
|
|
| Model | Dimensions | Size | Quality |
|
||
|
|
|-------|------------|------|---------|
|
||
|
|
| bge-small-en-v1.5 | 384 | 130MB | Good |
|
||
|
|
| bge-base-en-v1.5 | 768 | 440MB | Better |
|
||
|
|
| bge-large-en-v1.5 | 1024 | 1.3GB | Best |
|
||
|
|
| nomic-embed-text | 768 | 550MB | Good |
|
||
|
|
|
||
|
|
## Hybrid Configuration
|
||
|
|
|
||
|
|
Use different models for different tasks:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
# Primary model for complex conversations
|
||
|
|
llm-provider,anthropic
|
||
|
|
llm-model,claude-3-5-sonnet-20241022
|
||
|
|
|
||
|
|
# Fast model for simple tasks
|
||
|
|
llm-fast-provider,groq
|
||
|
|
llm-fast-model,llama-3.1-8b-instant
|
||
|
|
|
||
|
|
# Local fallback for offline operation
|
||
|
|
llm-fallback-provider,local
|
||
|
|
llm-fallback-model,DeepSeek-R1-Distill-Qwen-1.5B
|
||
|
|
|
||
|
|
# Embeddings always local
|
||
|
|
embedding-provider,local
|
||
|
|
embedding-model,bge-small-en-v1.5
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Selection Guide
|
||
|
|
|
||
|
|
### By Use Case
|
||
|
|
|
||
|
|
| Use Case | Recommended | Why |
|
||
|
|
|----------|-------------|-----|
|
||
|
|
| Customer support | Claude 3.5 Sonnet | Best at following guidelines |
|
||
|
|
| Code generation | DeepSeek-Coder, GPT-4o | Specialized for code |
|
||
|
|
| Document analysis | Gemini 1.5 Pro | 2M context window |
|
||
|
|
| Real-time chat | Groq Llama 3.1 8B | Fastest responses |
|
||
|
|
| Privacy-sensitive | Local DeepSeek-R1 | No external data transfer |
|
||
|
|
| Cost-sensitive | DeepSeek-V3, Local | Lowest cost per token |
|
||
|
|
| Complex reasoning | Claude 3 Opus, o1 | Best reasoning ability |
|
||
|
|
|
||
|
|
### By Budget
|
||
|
|
|
||
|
|
| Budget | Recommended Setup |
|
||
|
|
|--------|-------------------|
|
||
|
|
| Free | Local models only |
|
||
|
|
| Low ($10-50/mo) | Groq + Local fallback |
|
||
|
|
| Medium ($50-200/mo) | GPT-4o-mini + Claude Haiku |
|
||
|
|
| High ($200+/mo) | GPT-4o + Claude Sonnet |
|
||
|
|
| Enterprise | Private deployment + premium APIs |
|
||
|
|
|
||
|
|
## Configuration Reference
|
||
|
|
|
||
|
|
### Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Primary LLM
|
||
|
|
LLM_PROVIDER=openai
|
||
|
|
LLM_API_KEY=sk-xxx
|
||
|
|
LLM_MODEL=gpt-4o
|
||
|
|
LLM_SERVER_URL=https://api.openai.com
|
||
|
|
|
||
|
|
# Local LLM Server
|
||
|
|
LLM_LOCAL_URL=https://localhost:8081
|
||
|
|
LLM_LOCAL_MODEL=DeepSeek-R1-Distill-Qwen-1.5B
|
||
|
|
|
||
|
|
# Embedding
|
||
|
|
EMBEDDING_PROVIDER=local
|
||
|
|
EMBEDDING_URL=https://localhost:8082
|
||
|
|
EMBEDDING_MODEL=bge-small-en-v1.5
|
||
|
|
```
|
||
|
|
|
||
|
|
### config.csv Parameters
|
||
|
|
|
||
|
|
| Parameter | Description | Example |
|
||
|
|
|-----------|-------------|---------|
|
||
|
|
| `llm-provider` | Provider name | `openai`, `anthropic`, `local` |
|
||
|
|
| `llm-api-key` | API key for cloud providers | `sk-xxx` |
|
||
|
|
| `llm-model` | Model identifier | `gpt-4o` |
|
||
|
|
| `llm-server-url` | API endpoint | `https://api.openai.com` |
|
||
|
|
| `llm-server-ctx-size` | Context window size | `128000` |
|
||
|
|
| `llm-temperature` | Response randomness (0-2) | `0.7` |
|
||
|
|
| `llm-max-tokens` | Maximum response length | `4096` |
|
||
|
|
| `llm-cache-enabled` | Enable semantic caching | `true` |
|
||
|
|
| `llm-cache-ttl` | Cache time-to-live (seconds) | `3600` |
|
||
|
|
|
||
|
|
## Security Considerations
|
||
|
|
|
||
|
|
### Cloud Providers
|
||
|
|
|
||
|
|
- API keys should be stored in environment variables or secrets manager
|
||
|
|
- Consider data residency requirements (EU: Mistral, US: OpenAI)
|
||
|
|
- Review provider data retention policies
|
||
|
|
- Use separate keys for production/development
|
||
|
|
|
||
|
|
### Local Models
|
||
|
|
|
||
|
|
- All data stays on your infrastructure
|
||
|
|
- No internet required after model download
|
||
|
|
- Full control over model versions
|
||
|
|
- Consider GPU security for sensitive deployments
|
||
|
|
|
||
|
|
## Performance Optimization
|
||
|
|
|
||
|
|
### Caching
|
||
|
|
|
||
|
|
Enable semantic caching to reduce API calls:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-cache-enabled,true
|
||
|
|
llm-cache-ttl,3600
|
||
|
|
llm-cache-similarity-threshold,0.92
|
||
|
|
```
|
||
|
|
|
||
|
|
### Batching
|
||
|
|
|
||
|
|
For bulk operations, use batch APIs when available:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-batch-enabled,true
|
||
|
|
llm-batch-size,10
|
||
|
|
```
|
||
|
|
|
||
|
|
### Context Management
|
||
|
|
|
||
|
|
Optimize context window usage:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-context-compaction,true
|
||
|
|
llm-max-history-turns,10
|
||
|
|
llm-summarize-long-contexts,true
|
||
|
|
```
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
|
||
|
|
**API Key Invalid**
|
||
|
|
- Verify key is correct and not expired
|
||
|
|
- Check if key has required permissions
|
||
|
|
- Ensure billing is active
|
||
|
|
|
||
|
|
**Model Not Found**
|
||
|
|
- Check model name spelling
|
||
|
|
- Verify model is available in your region
|
||
|
|
- Some models require waitlist access
|
||
|
|
|
||
|
|
**Rate Limits**
|
||
|
|
- Implement exponential backoff
|
||
|
|
- Use caching to reduce calls
|
||
|
|
- Consider upgrading API tier
|
||
|
|
|
||
|
|
**Local Model Slow**
|
||
|
|
- Check GPU memory usage
|
||
|
|
- Reduce context size
|
||
|
|
- Use quantized models (Q4 instead of F16)
|
||
|
|
|
||
|
|
### Logging
|
||
|
|
|
||
|
|
Enable LLM logging for debugging:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
llm-log-requests,true
|
||
|
|
llm-log-responses,false
|
||
|
|
llm-log-timing,true
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
- [LLM Configuration](../chapter-08-config/llm-config.md) - Detailed configuration guide
|
||
|
|
- [Semantic Caching](../chapter-03/caching.md) - Cache configuration
|
||
|
|
- [NVIDIA GPU Setup](../chapter-09-api/nvidia-gpu-setup.md) - GPU configuration for local models
|