Chapter renames: - chapter-09-api -> chapter-09-tools (LLM Tools) - chapter-10-api -> chapter-10-rest (REST Endpoints) New documentation: - chapter-04-gbui/admin-user-views.md: Complete guide to User vs Admin interfaces - User Settings (/api/user/*): profile, security, notifications, storage - Admin Panel (/api/admin/*): users, groups, bots, DNS, audit - Permission levels: guest, user, manager, admin - Desktop sync considerations Updated: - Drive app docs with sync feature (rclone, desktop-only) - All cross-references to renamed chapters - SUMMARY.md with new structure and admin-user-views entry
11 KiB
LLM Providers
General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.
Overview
LLMs are the intelligence behind General Bots' conversational capabilities. You can configure:
- Cloud Providers - External APIs (OpenAI, Anthropic, Groq, etc.)
- Local Models - Self-hosted models via llama.cpp
- Hybrid - Use local for simple tasks, cloud for complex reasoning
Cloud Providers
OpenAI (GPT Series)
The most widely known LLM provider, offering GPT-4 and GPT-4o models.
| Model | Context | Best For | Speed |
|---|---|---|---|
| GPT-4o | 128K | General purpose, vision | Fast |
| GPT-4o-mini | 128K | Cost-effective tasks | Very Fast |
| GPT-4 Turbo | 128K | Complex reasoning | Medium |
| o1-preview | 128K | Advanced reasoning, math | Slow |
| o1-mini | 128K | Code, logic tasks | Medium |
Configuration:
llm-provider,openai
llm-api-key,sk-xxxxxxxxxxxxxxxxxxxxxxxx
llm-model,gpt-4o
Strengths:
- Excellent general knowledge
- Strong code generation
- Good instruction following
- Vision capabilities (GPT-4o)
Considerations:
- API costs can add up
- Data sent to external servers
- Rate limits apply
Anthropic (Claude Series)
Known for safety, helpfulness, and large context windows.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Claude 3.5 Sonnet | 200K | Best balance of capability/speed | Fast |
| Claude 3.5 Haiku | 200K | Quick, everyday tasks | Very Fast |
| Claude 3 Opus | 200K | Most capable, complex tasks | Slow |
Configuration:
llm-provider,anthropic
llm-api-key,sk-ant-xxxxxxxxxxxxxxxx
llm-model,claude-3-5-sonnet-20241022
Strengths:
- Largest context window (200K tokens)
- Excellent at following complex instructions
- Strong coding abilities
- Better at refusing harmful requests
Considerations:
- Premium pricing
- No vision in all models
- Newer provider, smaller ecosystem
Groq
Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Llama 3.3 70B | 128K | Complex reasoning | Very Fast |
| Llama 3.1 8B | 128K | Quick responses | Extremely Fast |
| Mixtral 8x7B | 32K | Balanced performance | Very Fast |
| Gemma 2 9B | 8K | Lightweight tasks | Extremely Fast |
Configuration:
llm-provider,groq
llm-api-key,gsk_xxxxxxxxxxxxxxxx
llm-model,llama-3.3-70b-versatile
Strengths:
- Fastest inference speeds (500+ tokens/sec)
- Competitive pricing
- Open-source models
- Great for real-time applications
Considerations:
- Limited model selection
- Rate limits on free tier
- Models may be less capable than GPT-4/Claude
Google (Gemini Series)
Google's multimodal AI models with strong reasoning capabilities.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | Extremely long documents | Medium |
| Gemini 1.5 Flash | 1M | Fast multimodal | Fast |
| Gemini 2.0 Flash | 1M | Latest capabilities | Fast |
Configuration:
llm-provider,google
llm-api-key,AIzaxxxxxxxxxxxxxxxx
llm-model,gemini-1.5-pro
Strengths:
- Largest context window (2M tokens)
- Native multimodal (text, image, video, audio)
- Strong at structured data
- Good coding abilities
Considerations:
- Newer ecosystem
- Some features region-limited
- API changes more frequently
Mistral AI
European AI company offering efficient, open-weight models.
| Model | Context | Best For | Speed |
|---|---|---|---|
| Mistral Large | 128K | Complex tasks | Medium |
| Mistral Medium | 32K | Balanced performance | Fast |
| Mistral Small | 32K | Cost-effective | Very Fast |
| Codestral | 32K | Code generation | Fast |
Configuration:
llm-provider,mistral
llm-api-key,xxxxxxxxxxxxxxxx
llm-model,mistral-large-latest
Strengths:
- European data sovereignty (GDPR)
- Excellent code generation (Codestral)
- Open-weight models available
- Competitive pricing
Considerations:
- Smaller context than competitors
- Less brand recognition
- Fewer fine-tuning options
DeepSeek
Chinese AI company known for efficient, capable models.
| Model | Context | Best For | Speed |
|---|---|---|---|
| DeepSeek-V3 | 128K | General purpose | Fast |
| DeepSeek-R1 | 128K | Reasoning, math | Medium |
| DeepSeek-Coder | 128K | Programming | Fast |
Configuration:
llm-provider,deepseek
llm-api-key,sk-xxxxxxxxxxxxxxxx
llm-model,deepseek-chat
llm-server-url,https://api.deepseek.com
Strengths:
- Extremely cost-effective
- Strong reasoning (R1 model)
- Excellent code generation
- Open-weight versions available
Considerations:
- Data processed in China
- Newer provider
- May have content restrictions
Local Models
Run models on your own hardware for privacy, cost control, and offline operation.
Setting Up Local LLM
General Bots uses llama.cpp server for local inference:
llm-provider,local
llm-server-url,https://localhost:8081
llm-model,DeepSeek-R1-Distill-Qwen-1.5B
Recommended Local Models
For High-End GPU (24GB+ VRAM)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| GPT-OSS 120B Q4 | 70GB | 48GB+ | Excellent |
| Llama 3.1 70B Q4 | 40GB | 48GB+ | Excellent |
| DeepSeek-R1 32B Q4 | 20GB | 24GB | Very Good |
| Qwen 2.5 72B Q4 | 42GB | 48GB+ | Excellent |
For Mid-Range GPU (12-16GB VRAM)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| GPT-OSS 20B F16 | 40GB | 16GB | Very Good |
| Llama 3.1 8B Q8 | 9GB | 12GB | Good |
| DeepSeek-R1-Distill 14B Q4 | 8GB | 12GB | Good |
| Mistral Nemo 12B Q4 | 7GB | 10GB | Good |
For Small GPU or CPU (8GB VRAM or less)
| Model | Size | VRAM | Quality |
|---|---|---|---|
| DeepSeek-R1-Distill 1.5B Q4 | 1GB | 4GB | Basic |
| Phi-3 Mini 3.8B Q4 | 2.5GB | 6GB | Acceptable |
| Gemma 2 2B Q8 | 3GB | 6GB | Acceptable |
| Qwen 2.5 3B Q4 | 2GB | 4GB | Basic |
Model Download URLs
Add models to installer.rs data_download_list:
// GPT-OSS 20B - Recommended for small GPU
"https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf"
// DeepSeek R1 Distill - For CPU or minimal GPU
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf"
// Llama 3.1 8B - Good balance
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
Embedding Models
For vector search, you need an embedding model:
embedding-provider,local
embedding-server-url,https://localhost:8082
embedding-model,bge-small-en-v1.5
Recommended embedding models:
| Model | Dimensions | Size | Quality |
|---|---|---|---|
| bge-small-en-v1.5 | 384 | 130MB | Good |
| bge-base-en-v1.5 | 768 | 440MB | Better |
| bge-large-en-v1.5 | 1024 | 1.3GB | Best |
| nomic-embed-text | 768 | 550MB | Good |
Hybrid Configuration
Use different models for different tasks:
# Primary model for complex conversations
llm-provider,anthropic
llm-model,claude-3-5-sonnet-20241022
# Fast model for simple tasks
llm-fast-provider,groq
llm-fast-model,llama-3.1-8b-instant
# Local fallback for offline operation
llm-fallback-provider,local
llm-fallback-model,DeepSeek-R1-Distill-Qwen-1.5B
# Embeddings always local
embedding-provider,local
embedding-model,bge-small-en-v1.5
Model Selection Guide
By Use Case
| Use Case | Recommended | Why |
|---|---|---|
| Customer support | Claude 3.5 Sonnet | Best at following guidelines |
| Code generation | DeepSeek-Coder, GPT-4o | Specialized for code |
| Document analysis | Gemini 1.5 Pro | 2M context window |
| Real-time chat | Groq Llama 3.1 8B | Fastest responses |
| Privacy-sensitive | Local DeepSeek-R1 | No external data transfer |
| Cost-sensitive | DeepSeek-V3, Local | Lowest cost per token |
| Complex reasoning | Claude 3 Opus, o1 | Best reasoning ability |
By Budget
| Budget | Recommended Setup |
|---|---|
| Free | Local models only |
| Low ($10-50/mo) | Groq + Local fallback |
| Medium ($50-200/mo) | GPT-4o-mini + Claude Haiku |
| High ($200+/mo) | GPT-4o + Claude Sonnet |
| Enterprise | Private deployment + premium APIs |
Configuration Reference
Environment Variables
# Primary LLM
LLM_PROVIDER=openai
LLM_API_KEY=sk-xxx
LLM_MODEL=gpt-4o
LLM_SERVER_URL=https://api.openai.com
# Local LLM Server
LLM_LOCAL_URL=https://localhost:8081
LLM_LOCAL_MODEL=DeepSeek-R1-Distill-Qwen-1.5B
# Embedding
EMBEDDING_PROVIDER=local
EMBEDDING_URL=https://localhost:8082
EMBEDDING_MODEL=bge-small-en-v1.5
config.csv Parameters
| Parameter | Description | Example |
|---|---|---|
llm-provider |
Provider name | openai, anthropic, local |
llm-api-key |
API key for cloud providers | sk-xxx |
llm-model |
Model identifier | gpt-4o |
llm-server-url |
API endpoint | https://api.openai.com |
llm-server-ctx-size |
Context window size | 128000 |
llm-temperature |
Response randomness (0-2) | 0.7 |
llm-max-tokens |
Maximum response length | 4096 |
llm-cache-enabled |
Enable semantic caching | true |
llm-cache-ttl |
Cache time-to-live (seconds) | 3600 |
Security Considerations
Cloud Providers
- API keys should be stored in environment variables or secrets manager
- Consider data residency requirements (EU: Mistral, US: OpenAI)
- Review provider data retention policies
- Use separate keys for production/development
Local Models
- All data stays on your infrastructure
- No internet required after model download
- Full control over model versions
- Consider GPU security for sensitive deployments
Performance Optimization
Caching
Enable semantic caching to reduce API calls:
llm-cache-enabled,true
llm-cache-ttl,3600
llm-cache-similarity-threshold,0.92
Batching
For bulk operations, use batch APIs when available:
llm-batch-enabled,true
llm-batch-size,10
Context Management
Optimize context window usage with episodic memory:
episodic-memory-enabled,true
episodic-memory-threshold,4
episodic-memory-history,2
episodic-memory-auto-summarize,true
See Episodic Memory for details.
Troubleshooting
Common Issues
API Key Invalid
- Verify key is correct and not expired
- Check if key has required permissions
- Ensure billing is active
Model Not Found
- Check model name spelling
- Verify model is available in your region
- Some models require waitlist access
Rate Limits
- Implement exponential backoff
- Use caching to reduce calls
- Consider upgrading API tier
Local Model Slow
- Check GPU memory usage
- Reduce context size
- Use quantized models (Q4 instead of F16)
Logging
Enable LLM logging for debugging:
llm-log-requests,true
llm-log-responses,false
llm-log-timing,true
Next Steps
- LLM Configuration - Detailed configuration guide
- Semantic Caching - Cache configuration
- NVIDIA GPU Setup - GPU configuration for local models