botbook/src/chapter-07-gbapp/scaling.md

672 lines
16 KiB
Markdown
Raw Normal View History

2025-12-03 19:56:35 -03:00
# Scaling and Load Balancing
General Bots is designed to scale from a single instance to a distributed cluster using LXC containers. This chapter covers auto-scaling, load balancing, sharding strategies, and failover systems.
## Scaling Architecture
General Bots uses a **horizontal scaling** approach with LXC containers:
```
┌─────────────────┐
│ Caddy Proxy │
│ (Load Balancer)│
└────────┬────────┘
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ LXC Container │ │ LXC Container │ │ LXC Container │
│ botserver-1 │ │ botserver-2 │ │ botserver-3 │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────────────┼───────────────────┘
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ Redis │ │ Qdrant │
│ (Primary) │ │ (Cluster) │ │ (Cluster) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Auto-Scaling Configuration
### config.csv Parameters
Configure auto-scaling behavior in your bot's `config.csv`:
```csv
# Auto-scaling settings
scale-enabled,true
scale-min-instances,1
scale-max-instances,10
scale-cpu-threshold,70
scale-memory-threshold,80
scale-request-threshold,1000
scale-cooldown-seconds,300
scale-check-interval,30
```
| Parameter | Description | Default |
|-----------|-------------|---------|
| `scale-enabled` | Enable auto-scaling | `false` |
| `scale-min-instances` | Minimum container count | `1` |
| `scale-max-instances` | Maximum container count | `10` |
| `scale-cpu-threshold` | CPU % to trigger scale-up | `70` |
| `scale-memory-threshold` | Memory % to trigger scale-up | `80` |
| `scale-request-threshold` | Requests/min to trigger scale-up | `1000` |
| `scale-cooldown-seconds` | Wait time between scaling events | `300` |
| `scale-check-interval` | Seconds between metric checks | `30` |
### Scaling Rules
Define custom scaling rules:
```csv
# Scale up when average response time exceeds 2 seconds
scale-rule-response-time,2000
scale-rule-response-action,up
# Scale down when CPU drops below 30%
scale-rule-cpu-low,30
scale-rule-cpu-low-action,down
# Scale up on queue depth
scale-rule-queue-depth,100
scale-rule-queue-action,up
```
## LXC Container Management
### Creating Scaled Instances
```bash
# Create additional botserver containers
for i in {2..5}; do
lxc launch images:debian/12 botserver-$i
lxc config device add botserver-$i port-$((8080+i)) proxy \
listen=tcp:0.0.0.0:$((8080+i)) connect=tcp:127.0.0.1:8080
done
```
### Container Resource Limits
Set resource limits per container:
```bash
# CPU limits (number of cores)
lxc config set botserver-1 limits.cpu 4
# Memory limits
lxc config set botserver-1 limits.memory 8GB
# Disk I/O priority (0-10)
lxc config set botserver-1 limits.disk.priority 5
# Network bandwidth (ingress/egress)
lxc config device set botserver-1 eth0 limits.ingress 100Mbit
lxc config device set botserver-1 eth0 limits.egress 100Mbit
```
### Auto-Scaling Script
Create `/opt/gbo/scripts/autoscale.sh`:
```bash
#!/bin/bash
# Configuration
MIN_INSTANCES=1
MAX_INSTANCES=10
CPU_THRESHOLD=70
SCALE_COOLDOWN=300
LAST_SCALE_FILE="/tmp/last_scale_time"
get_avg_cpu() {
local total=0
local count=0
for container in $(lxc list -c n --format csv | grep "^botserver-"); do
cpu=$(lxc exec $container -- cat /proc/loadavg | awk '{print $1}')
total=$(echo "$total + $cpu" | bc)
count=$((count + 1))
done
echo "scale=2; $total / $count * 100" | bc
}
get_instance_count() {
lxc list -c n --format csv | grep -c "^botserver-"
}
can_scale() {
if [ ! -f "$LAST_SCALE_FILE" ]; then
return 0
fi
last_scale=$(cat "$LAST_SCALE_FILE")
now=$(date +%s)
diff=$((now - last_scale))
[ $diff -gt $SCALE_COOLDOWN ]
}
scale_up() {
current=$(get_instance_count)
if [ $current -ge $MAX_INSTANCES ]; then
echo "Already at max instances ($MAX_INSTANCES)"
return 1
fi
new_id=$((current + 1))
echo "Scaling up: creating botserver-$new_id"
lxc launch images:debian/12 botserver-$new_id
lxc config set botserver-$new_id limits.cpu 4
lxc config set botserver-$new_id limits.memory 8GB
# Copy configuration
lxc file push /opt/gbo/conf/botserver.env botserver-$new_id/opt/gbo/conf/
# Start botserver
lxc exec botserver-$new_id -- /opt/gbo/bin/botserver &
# Update load balancer
update_load_balancer
date +%s > "$LAST_SCALE_FILE"
echo "Scale up complete"
}
scale_down() {
current=$(get_instance_count)
if [ $current -le $MIN_INSTANCES ]; then
echo "Already at min instances ($MIN_INSTANCES)"
return 1
fi
# Remove highest numbered instance
target="botserver-$current"
echo "Scaling down: removing $target"
# Drain connections
lxc exec $target -- /opt/gbo/bin/botserver drain
sleep 30
# Stop and delete
lxc stop $target
lxc delete $target
# Update load balancer
update_load_balancer
date +%s > "$LAST_SCALE_FILE"
echo "Scale down complete"
}
update_load_balancer() {
# Generate upstream list
upstreams=""
for container in $(lxc list -c n --format csv | grep "^botserver-"); do
ip=$(lxc list $container -c 4 --format csv | cut -d' ' -f1)
upstreams="$upstreams\n to $ip:8080"
done
# Update Caddy config
cat > /opt/gbo/conf/caddy/upstream.conf << EOF
upstream botserver {
$upstreams
lb_policy round_robin
health_uri /api/health
health_interval 10s
}
EOF
# Reload Caddy
lxc exec proxy-1 -- caddy reload --config /etc/caddy/Caddyfile
}
# Main loop
while true; do
avg_cpu=$(get_avg_cpu)
echo "Average CPU: $avg_cpu%"
if can_scale; then
if (( $(echo "$avg_cpu > $CPU_THRESHOLD" | bc -l) )); then
scale_up
elif (( $(echo "$avg_cpu < 30" | bc -l) )); then
scale_down
fi
fi
sleep 30
done
```
## Load Balancing
### Caddy Configuration
Primary load balancer configuration (`/opt/gbo/conf/caddy/Caddyfile`):
```caddyfile
{
admin off
auto_https on
}
(common) {
encode gzip zstd
header {
-Server
X-Content-Type-Options "nosniff"
X-Frame-Options "DENY"
Referrer-Policy "strict-origin-when-cross-origin"
}
}
bot.example.com {
import common
# Health check endpoint (no load balancing)
handle /api/health {
reverse_proxy localhost:8080
}
# WebSocket connections (sticky sessions)
handle /ws* {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy cookie
lb_try_duration 5s
health_uri /api/health
health_interval 10s
health_timeout 5s
}
}
# API requests (round robin)
handle /api/* {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy round_robin
lb_try_duration 5s
health_uri /api/health
health_interval 10s
fail_duration 30s
}
}
# Static files (any instance)
handle {
reverse_proxy botserver-1:8080 botserver-2:8080 botserver-3:8080 {
lb_policy first
}
}
}
```
### Load Balancing Policies
| Policy | Description | Use Case |
|--------|-------------|----------|
| `round_robin` | Rotate through backends | General API requests |
| `first` | Use first available | Static content |
| `least_conn` | Fewest active connections | Long-running requests |
| `ip_hash` | Consistent by client IP | Session affinity |
| `cookie` | Sticky sessions via cookie | WebSocket, stateful |
| `random` | Random selection | Testing |
### Rate Limiting
Configure rate limits in `config.csv`:
```csv
# Rate limiting
rate-limit-enabled,true
rate-limit-requests,100
rate-limit-window,60
rate-limit-burst,20
rate-limit-by,ip
# Per-endpoint limits
rate-limit-api-chat,30
rate-limit-api-files,50
rate-limit-api-auth,10
```
Rate limiting in Caddy:
```caddyfile
bot.example.com {
# Global rate limit
rate_limit {
zone global {
key {remote_host}
events 100
window 1m
}
}
# Stricter limit for auth endpoints
handle /api/auth/* {
rate_limit {
zone auth {
key {remote_host}
events 10
window 1m
}
}
reverse_proxy botserver:8080
}
}
```
## Sharding Strategies
### Database Sharding Options
#### Option 1: Tenant-Based Sharding
Each tenant gets their own database:
```
┌─────────────────┐
│ Router/Proxy │
└────────┬────────┘
┌────┴────┬──────────┐
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│Tenant1│ │Tenant2│ │Tenant3│
│ DB │ │ DB │ │ DB │
└───────┘ └───────┘ └───────┘
```
Configuration:
```csv
# Tenant sharding
shard-strategy,tenant
shard-tenant-db-prefix,gb_tenant_
shard-auto-create,true
```
#### Option 2: Hash-Based Sharding
Distribute data by hash of primary key:
```
User ID: 12345
Hash: 12345 % 4 = 1
Shard: shard-1
```
Configuration:
```csv
# Hash sharding
shard-strategy,hash
shard-count,4
shard-key,user_id
shard-algorithm,modulo
```
#### Option 3: Range-Based Sharding
Partition by ID ranges:
```csv
# Range sharding
shard-strategy,range
shard-ranges,0-999999:shard1,1000000-1999999:shard2,2000000-:shard3
```
#### Option 4: Geographic Sharding
Route by user location:
```csv
# Geographic sharding
shard-strategy,geo
shard-geo-us,postgres-us.example.com
shard-geo-eu,postgres-eu.example.com
shard-geo-asia,postgres-asia.example.com
shard-default,postgres-us.example.com
```
### Vector Database Sharding (Qdrant)
Qdrant supports automatic sharding:
```csv
# Qdrant sharding
qdrant-shard-count,4
qdrant-replication-factor,2
qdrant-write-consistency,majority
```
Collection creation with sharding:
```rust
// In vectordb code
let collection_config = CreateCollection {
collection_name: format!("kb_{}", bot_id),
vectors_config: VectorsConfig::Single(VectorParams {
size: 384,
distance: Distance::Cosine,
}),
shard_number: Some(4),
replication_factor: Some(2),
write_consistency_factor: Some(1),
..Default::default()
};
```
### Redis Cluster
For high-availability caching:
```csv
# Redis cluster
cache-mode,cluster
cache-nodes,redis-1:6379,redis-2:6379,redis-3:6379
cache-replicas,1
```
## Failover Systems
### Health Checks
Configure health check endpoints:
```csv
# Health check configuration
health-enabled,true
health-endpoint,/api/health
health-interval,10
health-timeout,5
health-retries,3
```
Health check response:
```json
{
"status": "healthy",
"version": "6.1.0",
"uptime": 86400,
"checks": {
"database": "ok",
"cache": "ok",
"vectordb": "ok",
"llm": "ok"
},
"metrics": {
"cpu": 45.2,
"memory": 62.1,
"connections": 150
}
}
```
### Automatic Failover
#### Database Failover (PostgreSQL)
Using Patroni for PostgreSQL HA:
```yaml
# patroni.yml
scope: botserver-cluster
name: postgres-1
restapi:
listen: 0.0.0.0:8008
connect_address: postgres-1:8008
etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
max_connections: 200
shared_buffers: 2GB
postgresql:
listen: 0.0.0.0:5432
connect_address: postgres-1:5432
data_dir: /var/lib/postgresql/data
authentication:
superuser:
username: postgres
password: ${POSTGRES_PASSWORD}
replication:
username: replicator
password: ${REPLICATION_PASSWORD}
```
#### Cache Failover (Redis Sentinel)
```csv
# Redis Sentinel configuration
cache-mode,sentinel
cache-sentinel-master,mymaster
cache-sentinel-nodes,sentinel-1:26379,sentinel-2:26379,sentinel-3:26379
```
### Circuit Breaker
Prevent cascade failures:
```csv
# Circuit breaker settings
circuit-breaker-enabled,true
circuit-breaker-threshold,5
circuit-breaker-timeout,30
circuit-breaker-half-open-requests,3
```
States:
- **Closed**: Normal operation
- **Open**: Failing, reject requests immediately
- **Half-Open**: Testing if service recovered
### Graceful Degradation
Configure fallback behavior:
```csv
# Fallback configuration
fallback-llm-enabled,true
fallback-llm-provider,local
fallback-llm-model,DeepSeek-R1-Distill-Qwen-1.5B
fallback-cache-enabled,true
fallback-cache-mode,memory
fallback-vectordb-enabled,true
fallback-vectordb-mode,keyword-search
```
## Monitoring Scaling
### Metrics Collection
Key metrics to monitor:
```csv
# Scaling metrics
metrics-scaling-enabled,true
metrics-container-count,true
metrics-scaling-events,true
metrics-load-distribution,true
```
### Alerting Rules
Configure alerts for scaling issues:
```yaml
# alerting-rules.yml
groups:
- name: scaling
rules:
- alert: HighCPUUsage
expr: avg(cpu_usage) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
- alert: MaxInstancesReached
expr: container_count >= max_instances
for: 1m
labels:
severity: critical
annotations:
summary: "Maximum instances reached, cannot scale up"
- alert: ScalingFailed
expr: scaling_errors > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Scaling operation failed"
```
## Best Practices
### Scaling
1. **Start small** - Begin with auto-scaling disabled, monitor patterns first
2. **Set appropriate thresholds** - Too low causes thrashing, too high causes poor performance
3. **Use cooldown periods** - Prevent rapid scale up/down cycles
4. **Test failover** - Regularly test your failover procedures
5. **Monitor costs** - More instances = higher infrastructure costs
### Load Balancing
1. **Use sticky sessions for WebSockets** - Required for real-time features
2. **Enable health checks** - Remove unhealthy instances automatically
3. **Configure timeouts** - Prevent hanging connections
4. **Use connection pooling** - Reduce connection overhead
### Sharding
1. **Choose the right strategy** - Tenant-based is simplest for SaaS
2. **Plan for rebalancing** - Have procedures to move data between shards
3. **Avoid cross-shard queries** - Design to minimize these
4. **Monitor shard balance** - Uneven distribution causes hotspots
## Next Steps
- [Container Deployment](./containers.md) - LXC container basics
- [Architecture Overview](./architecture.md) - System design
- [Monitoring Dashboard](../chapter-04-gbui/monitoring.md) - Observe your cluster