botbook/src/appendix-external-services/llm-providers.md
Rodrigo Rodriguez (Pragmatismo) 49eb6696ea Reorganize chapters and add admin/user views documentation
Chapter renames:
- chapter-09-api -> chapter-09-tools (LLM Tools)
- chapter-10-api -> chapter-10-rest (REST Endpoints)

New documentation:
- chapter-04-gbui/admin-user-views.md: Complete guide to User vs Admin interfaces
  - User Settings (/api/user/*): profile, security, notifications, storage
  - Admin Panel (/api/admin/*): users, groups, bots, DNS, audit
  - Permission levels: guest, user, manager, admin
  - Desktop sync considerations

Updated:
- Drive app docs with sync feature (rclone, desktop-only)
- All cross-references to renamed chapters
- SUMMARY.md with new structure and admin-user-views entry
2025-12-05 06:50:56 -03:00

11 KiB

LLM Providers

General Bots supports multiple Large Language Model (LLM) providers, both cloud-based services and local deployments. This guide helps you choose the right provider for your use case.

Overview

LLMs are the intelligence behind General Bots' conversational capabilities. You can configure:

  • Cloud Providers - External APIs (OpenAI, Anthropic, Groq, etc.)
  • Local Models - Self-hosted models via llama.cpp
  • Hybrid - Use local for simple tasks, cloud for complex reasoning

Cloud Providers

OpenAI (GPT Series)

The most widely known LLM provider, offering GPT-4 and GPT-4o models.

Model Context Best For Speed
GPT-4o 128K General purpose, vision Fast
GPT-4o-mini 128K Cost-effective tasks Very Fast
GPT-4 Turbo 128K Complex reasoning Medium
o1-preview 128K Advanced reasoning, math Slow
o1-mini 128K Code, logic tasks Medium

Configuration:

llm-provider,openai
llm-api-key,sk-xxxxxxxxxxxxxxxxxxxxxxxx
llm-model,gpt-4o

Strengths:

  • Excellent general knowledge
  • Strong code generation
  • Good instruction following
  • Vision capabilities (GPT-4o)

Considerations:

  • API costs can add up
  • Data sent to external servers
  • Rate limits apply

Anthropic (Claude Series)

Known for safety, helpfulness, and large context windows.

Model Context Best For Speed
Claude 3.5 Sonnet 200K Best balance of capability/speed Fast
Claude 3.5 Haiku 200K Quick, everyday tasks Very Fast
Claude 3 Opus 200K Most capable, complex tasks Slow

Configuration:

llm-provider,anthropic
llm-api-key,sk-ant-xxxxxxxxxxxxxxxx
llm-model,claude-3-5-sonnet-20241022

Strengths:

  • Largest context window (200K tokens)
  • Excellent at following complex instructions
  • Strong coding abilities
  • Better at refusing harmful requests

Considerations:

  • Premium pricing
  • No vision in all models
  • Newer provider, smaller ecosystem

Groq

Ultra-fast inference using custom LPU hardware. Offers open-source models at high speed.

Model Context Best For Speed
Llama 3.3 70B 128K Complex reasoning Very Fast
Llama 3.1 8B 128K Quick responses Extremely Fast
Mixtral 8x7B 32K Balanced performance Very Fast
Gemma 2 9B 8K Lightweight tasks Extremely Fast

Configuration:

llm-provider,groq
llm-api-key,gsk_xxxxxxxxxxxxxxxx
llm-model,llama-3.3-70b-versatile

Strengths:

  • Fastest inference speeds (500+ tokens/sec)
  • Competitive pricing
  • Open-source models
  • Great for real-time applications

Considerations:

  • Limited model selection
  • Rate limits on free tier
  • Models may be less capable than GPT-4/Claude

Google (Gemini Series)

Google's multimodal AI models with strong reasoning capabilities.

Model Context Best For Speed
Gemini 1.5 Pro 2M Extremely long documents Medium
Gemini 1.5 Flash 1M Fast multimodal Fast
Gemini 2.0 Flash 1M Latest capabilities Fast

Configuration:

llm-provider,google
llm-api-key,AIzaxxxxxxxxxxxxxxxx
llm-model,gemini-1.5-pro

Strengths:

  • Largest context window (2M tokens)
  • Native multimodal (text, image, video, audio)
  • Strong at structured data
  • Good coding abilities

Considerations:

  • Newer ecosystem
  • Some features region-limited
  • API changes more frequently

Mistral AI

European AI company offering efficient, open-weight models.

Model Context Best For Speed
Mistral Large 128K Complex tasks Medium
Mistral Medium 32K Balanced performance Fast
Mistral Small 32K Cost-effective Very Fast
Codestral 32K Code generation Fast

Configuration:

llm-provider,mistral
llm-api-key,xxxxxxxxxxxxxxxx
llm-model,mistral-large-latest

Strengths:

  • European data sovereignty (GDPR)
  • Excellent code generation (Codestral)
  • Open-weight models available
  • Competitive pricing

Considerations:

  • Smaller context than competitors
  • Less brand recognition
  • Fewer fine-tuning options

DeepSeek

Chinese AI company known for efficient, capable models.

Model Context Best For Speed
DeepSeek-V3 128K General purpose Fast
DeepSeek-R1 128K Reasoning, math Medium
DeepSeek-Coder 128K Programming Fast

Configuration:

llm-provider,deepseek
llm-api-key,sk-xxxxxxxxxxxxxxxx
llm-model,deepseek-chat
llm-server-url,https://api.deepseek.com

Strengths:

  • Extremely cost-effective
  • Strong reasoning (R1 model)
  • Excellent code generation
  • Open-weight versions available

Considerations:

  • Data processed in China
  • Newer provider
  • May have content restrictions

Local Models

Run models on your own hardware for privacy, cost control, and offline operation.

Setting Up Local LLM

General Bots uses llama.cpp server for local inference:

llm-provider,local
llm-server-url,https://localhost:8081
llm-model,DeepSeek-R1-Distill-Qwen-1.5B

For High-End GPU (24GB+ VRAM)

Model Size VRAM Quality
GPT-OSS 120B Q4 70GB 48GB+ Excellent
Llama 3.1 70B Q4 40GB 48GB+ Excellent
DeepSeek-R1 32B Q4 20GB 24GB Very Good
Qwen 2.5 72B Q4 42GB 48GB+ Excellent

For Mid-Range GPU (12-16GB VRAM)

Model Size VRAM Quality
GPT-OSS 20B F16 40GB 16GB Very Good
Llama 3.1 8B Q8 9GB 12GB Good
DeepSeek-R1-Distill 14B Q4 8GB 12GB Good
Mistral Nemo 12B Q4 7GB 10GB Good

For Small GPU or CPU (8GB VRAM or less)

Model Size VRAM Quality
DeepSeek-R1-Distill 1.5B Q4 1GB 4GB Basic
Phi-3 Mini 3.8B Q4 2.5GB 6GB Acceptable
Gemma 2 2B Q8 3GB 6GB Acceptable
Qwen 2.5 3B Q4 2GB 4GB Basic

Model Download URLs

Add models to installer.rs data_download_list:

// GPT-OSS 20B - Recommended for small GPU
"https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf"

// DeepSeek R1 Distill - For CPU or minimal GPU
"https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/resolve/main/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf"

// Llama 3.1 8B - Good balance
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

Embedding Models

For vector search, you need an embedding model:

embedding-provider,local
embedding-server-url,https://localhost:8082
embedding-model,bge-small-en-v1.5

Recommended embedding models:

Model Dimensions Size Quality
bge-small-en-v1.5 384 130MB Good
bge-base-en-v1.5 768 440MB Better
bge-large-en-v1.5 1024 1.3GB Best
nomic-embed-text 768 550MB Good

Hybrid Configuration

Use different models for different tasks:

# Primary model for complex conversations
llm-provider,anthropic
llm-model,claude-3-5-sonnet-20241022

# Fast model for simple tasks
llm-fast-provider,groq
llm-fast-model,llama-3.1-8b-instant

# Local fallback for offline operation
llm-fallback-provider,local
llm-fallback-model,DeepSeek-R1-Distill-Qwen-1.5B

# Embeddings always local
embedding-provider,local
embedding-model,bge-small-en-v1.5

Model Selection Guide

By Use Case

Use Case Recommended Why
Customer support Claude 3.5 Sonnet Best at following guidelines
Code generation DeepSeek-Coder, GPT-4o Specialized for code
Document analysis Gemini 1.5 Pro 2M context window
Real-time chat Groq Llama 3.1 8B Fastest responses
Privacy-sensitive Local DeepSeek-R1 No external data transfer
Cost-sensitive DeepSeek-V3, Local Lowest cost per token
Complex reasoning Claude 3 Opus, o1 Best reasoning ability

By Budget

Budget Recommended Setup
Free Local models only
Low ($10-50/mo) Groq + Local fallback
Medium ($50-200/mo) GPT-4o-mini + Claude Haiku
High ($200+/mo) GPT-4o + Claude Sonnet
Enterprise Private deployment + premium APIs

Configuration Reference

Environment Variables

# Primary LLM
LLM_PROVIDER=openai
LLM_API_KEY=sk-xxx
LLM_MODEL=gpt-4o
LLM_SERVER_URL=https://api.openai.com

# Local LLM Server
LLM_LOCAL_URL=https://localhost:8081
LLM_LOCAL_MODEL=DeepSeek-R1-Distill-Qwen-1.5B

# Embedding
EMBEDDING_PROVIDER=local
EMBEDDING_URL=https://localhost:8082
EMBEDDING_MODEL=bge-small-en-v1.5

config.csv Parameters

Parameter Description Example
llm-provider Provider name openai, anthropic, local
llm-api-key API key for cloud providers sk-xxx
llm-model Model identifier gpt-4o
llm-server-url API endpoint https://api.openai.com
llm-server-ctx-size Context window size 128000
llm-temperature Response randomness (0-2) 0.7
llm-max-tokens Maximum response length 4096
llm-cache-enabled Enable semantic caching true
llm-cache-ttl Cache time-to-live (seconds) 3600

Security Considerations

Cloud Providers

  • API keys should be stored in environment variables or secrets manager
  • Consider data residency requirements (EU: Mistral, US: OpenAI)
  • Review provider data retention policies
  • Use separate keys for production/development

Local Models

  • All data stays on your infrastructure
  • No internet required after model download
  • Full control over model versions
  • Consider GPU security for sensitive deployments

Performance Optimization

Caching

Enable semantic caching to reduce API calls:

llm-cache-enabled,true
llm-cache-ttl,3600
llm-cache-similarity-threshold,0.92

Batching

For bulk operations, use batch APIs when available:

llm-batch-enabled,true
llm-batch-size,10

Context Management

Optimize context window usage with episodic memory:

episodic-memory-enabled,true
episodic-memory-threshold,4
episodic-memory-history,2
episodic-memory-auto-summarize,true

See Episodic Memory for details.

Troubleshooting

Common Issues

API Key Invalid

  • Verify key is correct and not expired
  • Check if key has required permissions
  • Ensure billing is active

Model Not Found

  • Check model name spelling
  • Verify model is available in your region
  • Some models require waitlist access

Rate Limits

  • Implement exponential backoff
  • Use caching to reduce calls
  • Consider upgrading API tier

Local Model Slow

  • Check GPU memory usage
  • Reduce context size
  • Use quantized models (Q4 instead of F16)

Logging

Enable LLM logging for debugging:

llm-log-requests,true
llm-log-responses,false
llm-log-timing,true

Next Steps