Add Scale chapter with sharding and database optimization docs
New Chapter 18 - Scale: - README with overview of scaling architecture - Sharding Architecture: tenant routing, shard config, data model, operations - Database Optimization: SMALLINT enums, indexing, partitioning, connection pooling Enum reference tables for all domain values (Channel, Message, Task, etc.) Best practices for billion-user scale deployments
This commit is contained in:
parent
c7bc1355c2
commit
6bfab51293
4 changed files with 830 additions and 0 deletions
79
src/21-scale/README.md
Normal file
79
src/21-scale/README.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
# Scale
|
||||
|
||||
This chapter covers horizontal scaling strategies for General Bots to support millions to billions of users across global deployments.
|
||||
|
||||
## Overview
|
||||
|
||||
General Bots is designed from the ground up to scale horizontally. The architecture supports:
|
||||
|
||||
- **Multi-tenancy**: Complete isolation between organizations
|
||||
- **Regional sharding**: Data locality for compliance and performance
|
||||
- **Database partitioning**: Efficient handling of high-volume tables
|
||||
- **Stateless services**: Easy horizontal pod autoscaling
|
||||
|
||||
## Chapter Contents
|
||||
|
||||
- [Sharding Architecture](./sharding.md) - How data is distributed across shards
|
||||
- [Database Optimization](./database-optimization.md) - Schema design for billion-scale
|
||||
- [Regional Deployment](./regional-deployment.md) - Multi-region setup
|
||||
- [Performance Tuning](./performance-tuning.md) - Optimization strategies
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Tenant Isolation
|
||||
|
||||
Every piece of data in General Bots is associated with a `tenant_id`. This enables:
|
||||
|
||||
1. Complete data isolation between organizations
|
||||
2. Per-tenant resource limits and quotas
|
||||
3. Tenant-specific configurations
|
||||
4. Easy data export/deletion for compliance
|
||||
|
||||
### Shard Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Load Balancer │
|
||||
└─────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────┼─────────────────┐
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ Region │ │ Region │ │ Region │
|
||||
│ USA │ │ EUR │ │ APAC │
|
||||
└────┬────┘ └────┬────┘ └────┬────┘
|
||||
│ │ │
|
||||
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
|
||||
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
|
||||
│ Shard 4 │ │ Shard 5 │ │ Shard 6 │
|
||||
└─────────┘ └─────────┘ └─────────┘
|
||||
```
|
||||
|
||||
### Database Design Principles
|
||||
|
||||
1. **SMALLINT enums** instead of VARCHAR for domain values (2 bytes vs 20+ bytes)
|
||||
2. **Partitioned tables** for high-volume data (messages, sessions, analytics)
|
||||
3. **Composite primary keys** including `shard_id` for distributed queries
|
||||
4. **Snowflake-like IDs** for globally unique, time-sortable identifiers
|
||||
|
||||
## When to Scale
|
||||
|
||||
| Users | Sessions/day | Messages/day | Recommended Setup |
|
||||
|-------|--------------|--------------|-------------------|
|
||||
| < 10K | < 100K | < 1M | Single node |
|
||||
| 10K-100K | 100K-1M | 1M-10M | 2-3 nodes, single region |
|
||||
| 100K-1M | 1M-10M | 10M-100M | Multi-node, consider sharding |
|
||||
| 1M-10M | 10M-100M | 100M-1B | Regional shards |
|
||||
| > 10M | > 100M | > 1B | Global shards with Citus/CockroachDB |
|
||||
|
||||
## Quick Start
|
||||
|
||||
To enable sharding in your deployment:
|
||||
|
||||
1. Configure shard mapping in `shard_config` table
|
||||
2. Set `SHARD_ID` environment variable per instance
|
||||
3. Deploy region-specific instances
|
||||
4. Configure load balancer routing rules
|
||||
|
||||
See [Sharding Architecture](./sharding.md) for detailed setup instructions.
|
||||
436
src/21-scale/database-optimization.md
Normal file
436
src/21-scale/database-optimization.md
Normal file
|
|
@ -0,0 +1,436 @@
|
|||
# Database Optimization
|
||||
|
||||
This document covers database schema design and optimization strategies for billion-user scale deployments.
|
||||
|
||||
## Schema Design Principles
|
||||
|
||||
### Use SMALLINT Enums Instead of VARCHAR
|
||||
|
||||
One of the most impactful optimizations is using integer enums instead of string-based status fields.
|
||||
|
||||
**Before (inefficient):**
|
||||
```sql
|
||||
CREATE TABLE auto_tasks (
|
||||
id UUID PRIMARY KEY,
|
||||
status VARCHAR(50) NOT NULL DEFAULT 'pending',
|
||||
priority VARCHAR(20) NOT NULL DEFAULT 'normal',
|
||||
execution_mode VARCHAR(50) NOT NULL DEFAULT 'supervised',
|
||||
CONSTRAINT check_status CHECK (status IN ('pending', 'ready', 'running', 'paused', 'waiting_approval', 'completed', 'failed', 'cancelled'))
|
||||
);
|
||||
```
|
||||
|
||||
**After (optimized):**
|
||||
```sql
|
||||
CREATE TABLE auto_tasks (
|
||||
id UUID PRIMARY KEY,
|
||||
status SMALLINT NOT NULL DEFAULT 0, -- 2 bytes
|
||||
priority SMALLINT NOT NULL DEFAULT 1, -- 2 bytes
|
||||
execution_mode SMALLINT NOT NULL DEFAULT 1 -- 2 bytes
|
||||
);
|
||||
```
|
||||
|
||||
### Storage Comparison
|
||||
|
||||
| Field Type | Storage | Example Values |
|
||||
|------------|---------|----------------|
|
||||
| VARCHAR(50) | 1-51 bytes | 'waiting_approval' = 17 bytes |
|
||||
| TEXT | 1+ bytes | 'completed' = 10 bytes |
|
||||
| SMALLINT | 2 bytes | 4 = 2 bytes (always) |
|
||||
| INTEGER | 4 bytes | 4 = 4 bytes (always) |
|
||||
|
||||
**Savings per row with 5 enum fields:**
|
||||
- VARCHAR: ~50 bytes average
|
||||
- SMALLINT: 10 bytes fixed
|
||||
- **Savings: 40 bytes per row = 40GB per billion rows**
|
||||
|
||||
## Enum Value Reference
|
||||
|
||||
All domain values in General Bots use SMALLINT. Reference table:
|
||||
|
||||
### Channel Types
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | web | Web chat interface |
|
||||
| 1 | whatsapp | WhatsApp Business |
|
||||
| 2 | telegram | Telegram Bot |
|
||||
| 3 | msteams | Microsoft Teams |
|
||||
| 4 | slack | Slack |
|
||||
| 5 | email | Email channel |
|
||||
| 6 | sms | SMS/Text messages |
|
||||
| 7 | voice | Voice/Phone |
|
||||
| 8 | instagram | Instagram DM |
|
||||
| 9 | api | Direct API |
|
||||
|
||||
### Message Role
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 1 | user | User message |
|
||||
| 2 | assistant | Bot response |
|
||||
| 3 | system | System prompt |
|
||||
| 4 | tool | Tool call/result |
|
||||
| 9 | episodic | Episodic memory summary |
|
||||
| 10 | compact | Compacted conversation |
|
||||
|
||||
### Message Type
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | text | Plain text |
|
||||
| 1 | image | Image attachment |
|
||||
| 2 | audio | Audio file |
|
||||
| 3 | video | Video file |
|
||||
| 4 | document | Document/PDF |
|
||||
| 5 | location | GPS location |
|
||||
| 6 | contact | Contact card |
|
||||
| 7 | sticker | Sticker |
|
||||
| 8 | reaction | Message reaction |
|
||||
|
||||
### LLM Provider
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | openai | OpenAI API |
|
||||
| 1 | anthropic | Anthropic Claude |
|
||||
| 2 | azure_openai | Azure OpenAI |
|
||||
| 3 | azure_claude | Azure Claude |
|
||||
| 4 | google | Google AI |
|
||||
| 5 | local | Local llama.cpp |
|
||||
| 6 | ollama | Ollama |
|
||||
| 7 | groq | Groq |
|
||||
| 8 | mistral | Mistral AI |
|
||||
| 9 | cohere | Cohere |
|
||||
|
||||
### Task Status
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | pending | Waiting to start |
|
||||
| 1 | ready | Ready to execute |
|
||||
| 2 | running | Currently executing |
|
||||
| 3 | paused | Paused by user |
|
||||
| 4 | waiting_approval | Needs approval |
|
||||
| 5 | completed | Successfully finished |
|
||||
| 6 | failed | Failed with error |
|
||||
| 7 | cancelled | Cancelled by user |
|
||||
|
||||
### Task Priority
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | low | Low priority |
|
||||
| 1 | normal | Normal priority |
|
||||
| 2 | high | High priority |
|
||||
| 3 | urgent | Urgent |
|
||||
| 4 | critical | Critical |
|
||||
|
||||
### Execution Mode
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | manual | Manual execution only |
|
||||
| 1 | supervised | Requires approval |
|
||||
| 2 | autonomous | Fully automatic |
|
||||
|
||||
### Risk Level
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | none | No risk |
|
||||
| 1 | low | Low risk |
|
||||
| 2 | medium | Medium risk |
|
||||
| 3 | high | High risk |
|
||||
| 4 | critical | Critical risk |
|
||||
|
||||
### Approval Status
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | pending | Awaiting decision |
|
||||
| 1 | approved | Approved |
|
||||
| 2 | rejected | Rejected |
|
||||
| 3 | expired | Timed out |
|
||||
| 4 | skipped | Skipped |
|
||||
|
||||
### Intent Type
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | unknown | Unclassified |
|
||||
| 1 | app_create | Create application |
|
||||
| 2 | todo | Create task/reminder |
|
||||
| 3 | monitor | Set up monitoring |
|
||||
| 4 | action | Execute action |
|
||||
| 5 | schedule | Create schedule |
|
||||
| 6 | goal | Set goal |
|
||||
| 7 | tool | Create tool |
|
||||
| 8 | query | Query/search |
|
||||
|
||||
### Memory Type
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | short | Short-term |
|
||||
| 1 | long | Long-term |
|
||||
| 2 | episodic | Episodic |
|
||||
| 3 | semantic | Semantic |
|
||||
| 4 | procedural | Procedural |
|
||||
|
||||
### Sync Status
|
||||
| Value | Name | Description |
|
||||
|-------|------|-------------|
|
||||
| 0 | synced | Fully synced |
|
||||
| 1 | pending | Sync pending |
|
||||
| 2 | conflict | Conflict detected |
|
||||
| 3 | error | Sync error |
|
||||
| 4 | deleted | Marked for deletion |
|
||||
|
||||
## Indexing Strategies
|
||||
|
||||
### Composite Indexes for Common Queries
|
||||
|
||||
```sql
|
||||
-- Session lookup by user
|
||||
CREATE INDEX idx_sessions_user ON user_sessions(user_id, created_at DESC);
|
||||
|
||||
-- Messages by session (most common query)
|
||||
CREATE INDEX idx_messages_session ON message_history(session_id, message_index);
|
||||
|
||||
-- Active tasks by status and priority
|
||||
CREATE INDEX idx_tasks_status ON auto_tasks(status, priority) WHERE status < 5;
|
||||
|
||||
-- Tenant-scoped queries
|
||||
CREATE INDEX idx_sessions_tenant ON user_sessions(tenant_id, created_at DESC);
|
||||
```
|
||||
|
||||
### Partial Indexes for Active Records
|
||||
|
||||
```sql
|
||||
-- Only index active bots (saves space)
|
||||
CREATE INDEX idx_bots_active ON bots(tenant_id, is_active) WHERE is_active = true;
|
||||
|
||||
-- Only index pending approvals
|
||||
CREATE INDEX idx_approvals_pending ON task_approvals(task_id, expires_at) WHERE status = 0;
|
||||
|
||||
-- Only index unread messages
|
||||
CREATE INDEX idx_messages_unread ON message_history(user_id, created_at) WHERE is_read = false;
|
||||
```
|
||||
|
||||
### BRIN Indexes for Time-Series Data
|
||||
|
||||
```sql
|
||||
-- BRIN index for time-ordered data (much smaller than B-tree)
|
||||
CREATE INDEX idx_messages_created_brin ON message_history USING BRIN (created_at);
|
||||
CREATE INDEX idx_analytics_date_brin ON analytics_events USING BRIN (created_at);
|
||||
```
|
||||
|
||||
## Table Partitioning
|
||||
|
||||
### Partition High-Volume Tables by Time
|
||||
|
||||
```sql
|
||||
-- Partitioned messages table
|
||||
CREATE TABLE message_history (
|
||||
id UUID NOT NULL,
|
||||
session_id UUID NOT NULL,
|
||||
tenant_id BIGINT NOT NULL,
|
||||
created_at TIMESTAMPTZ NOT NULL,
|
||||
-- other columns...
|
||||
PRIMARY KEY (id, created_at)
|
||||
) PARTITION BY RANGE (created_at);
|
||||
|
||||
-- Monthly partitions
|
||||
CREATE TABLE message_history_2025_01 PARTITION OF message_history
|
||||
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
|
||||
CREATE TABLE message_history_2025_02 PARTITION OF message_history
|
||||
FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
|
||||
-- ... continue for each month
|
||||
|
||||
-- Default partition for future data
|
||||
CREATE TABLE message_history_default PARTITION OF message_history DEFAULT;
|
||||
```
|
||||
|
||||
### Automatic Partition Management
|
||||
|
||||
```sql
|
||||
-- Function to create next month's partition
|
||||
CREATE OR REPLACE FUNCTION create_monthly_partition(
|
||||
table_name TEXT,
|
||||
partition_date DATE
|
||||
) RETURNS VOID AS $$
|
||||
DECLARE
|
||||
partition_name TEXT;
|
||||
start_date DATE;
|
||||
end_date DATE;
|
||||
BEGIN
|
||||
partition_name := table_name || '_' || to_char(partition_date, 'YYYY_MM');
|
||||
start_date := date_trunc('month', partition_date);
|
||||
end_date := start_date + INTERVAL '1 month';
|
||||
|
||||
EXECUTE format(
|
||||
'CREATE TABLE IF NOT EXISTS %I PARTITION OF %I FOR VALUES FROM (%L) TO (%L)',
|
||||
partition_name, table_name, start_date, end_date
|
||||
);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
-- Create partitions for next 3 months
|
||||
SELECT create_monthly_partition('message_history', NOW() + INTERVAL '1 month');
|
||||
SELECT create_monthly_partition('message_history', NOW() + INTERVAL '2 months');
|
||||
SELECT create_monthly_partition('message_history', NOW() + INTERVAL '3 months');
|
||||
```
|
||||
|
||||
## Connection Pooling
|
||||
|
||||
### PgBouncer Configuration
|
||||
|
||||
```ini
|
||||
; pgbouncer.ini
|
||||
[databases]
|
||||
gb_shard1 = host=shard1.db port=5432 dbname=gb
|
||||
gb_shard2 = host=shard2.db port=5432 dbname=gb
|
||||
|
||||
[pgbouncer]
|
||||
listen_port = 6432
|
||||
listen_addr = *
|
||||
auth_type = md5
|
||||
auth_file = /etc/pgbouncer/userlist.txt
|
||||
|
||||
; Pool settings
|
||||
pool_mode = transaction
|
||||
max_client_conn = 10000
|
||||
default_pool_size = 50
|
||||
min_pool_size = 10
|
||||
reserve_pool_size = 25
|
||||
reserve_pool_timeout = 3
|
||||
|
||||
; Timeouts
|
||||
server_connect_timeout = 3
|
||||
server_idle_timeout = 600
|
||||
server_lifetime = 3600
|
||||
client_idle_timeout = 0
|
||||
```
|
||||
|
||||
### Application Connection Settings
|
||||
|
||||
```toml
|
||||
# config.toml
|
||||
[database]
|
||||
max_connections = 100
|
||||
min_connections = 10
|
||||
connection_timeout_secs = 5
|
||||
idle_timeout_secs = 300
|
||||
max_lifetime_secs = 1800
|
||||
```
|
||||
|
||||
## Query Optimization
|
||||
|
||||
### Use EXPLAIN ANALYZE
|
||||
|
||||
```sql
|
||||
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
|
||||
SELECT * FROM message_history
|
||||
WHERE session_id = 'abc-123'
|
||||
ORDER BY message_index;
|
||||
```
|
||||
|
||||
### Avoid N+1 Queries
|
||||
|
||||
**Bad:**
|
||||
```sql
|
||||
-- 1 query for sessions
|
||||
SELECT * FROM user_sessions WHERE user_id = 'xyz';
|
||||
-- N queries for messages (one per session)
|
||||
SELECT * FROM message_history WHERE session_id = ?;
|
||||
```
|
||||
|
||||
**Good:**
|
||||
```sql
|
||||
-- Single query with JOIN
|
||||
SELECT s.*, m.*
|
||||
FROM user_sessions s
|
||||
LEFT JOIN message_history m ON m.session_id = s.id
|
||||
WHERE s.user_id = 'xyz'
|
||||
ORDER BY s.created_at DESC, m.message_index;
|
||||
```
|
||||
|
||||
### Use Covering Indexes
|
||||
|
||||
```sql
|
||||
-- Index includes all needed columns (no table lookup)
|
||||
CREATE INDEX idx_sessions_covering ON user_sessions(user_id, created_at DESC)
|
||||
INCLUDE (title, message_count, last_activity_at);
|
||||
```
|
||||
|
||||
## Vacuum and Maintenance
|
||||
|
||||
### Aggressive Autovacuum for High-Churn Tables
|
||||
|
||||
```sql
|
||||
ALTER TABLE message_history SET (
|
||||
autovacuum_vacuum_scale_factor = 0.01,
|
||||
autovacuum_analyze_scale_factor = 0.005,
|
||||
autovacuum_vacuum_cost_delay = 2
|
||||
);
|
||||
|
||||
ALTER TABLE user_sessions SET (
|
||||
autovacuum_vacuum_scale_factor = 0.02,
|
||||
autovacuum_analyze_scale_factor = 0.01
|
||||
);
|
||||
```
|
||||
|
||||
### Regular Maintenance Tasks
|
||||
|
||||
```sql
|
||||
-- Weekly: Reindex bloated indexes
|
||||
REINDEX INDEX CONCURRENTLY idx_messages_session;
|
||||
|
||||
-- Monthly: Update statistics
|
||||
ANALYZE VERBOSE message_history;
|
||||
|
||||
-- Quarterly: Cluster heavily-queried tables
|
||||
CLUSTER message_history USING idx_messages_session;
|
||||
```
|
||||
|
||||
## Monitoring Queries
|
||||
|
||||
### Table Bloat Check
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
schemaname || '.' || tablename AS table,
|
||||
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
|
||||
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
|
||||
pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS index_size
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
### Slow Query Log
|
||||
|
||||
```sql
|
||||
-- postgresql.conf
|
||||
log_min_duration_statement = 100 -- Log queries > 100ms
|
||||
log_statement = 'none'
|
||||
log_lock_waits = on
|
||||
```
|
||||
|
||||
### Index Usage Statistics
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
schemaname || '.' || relname AS table,
|
||||
indexrelname AS index,
|
||||
idx_scan AS scans,
|
||||
idx_tup_read AS tuples_read,
|
||||
idx_tup_fetch AS tuples_fetched,
|
||||
pg_size_pretty(pg_relation_size(indexrelid)) AS size
|
||||
FROM pg_stat_user_indexes
|
||||
ORDER BY idx_scan DESC
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Use SMALLINT for enums** - 2 bytes vs 10-50 bytes per field
|
||||
2. **Partition time-series tables** - Monthly partitions for messages/analytics
|
||||
3. **Create partial indexes** - Only index active/relevant rows
|
||||
4. **Use connection pooling** - PgBouncer with transaction mode
|
||||
5. **Enable aggressive autovacuum** - For high-churn tables
|
||||
6. **Monitor query performance** - Log slow queries, check EXPLAIN plans
|
||||
7. **Use covering indexes** - Include frequently-accessed columns
|
||||
8. **Avoid cross-shard queries** - Keep tenant data together
|
||||
9. **Regular maintenance** - Reindex, analyze, cluster as needed
|
||||
10. **Test at scale** - Use production-like data volumes in staging
|
||||
309
src/21-scale/sharding.md
Normal file
309
src/21-scale/sharding.md
Normal file
|
|
@ -0,0 +1,309 @@
|
|||
# Sharding Architecture
|
||||
|
||||
This document describes how General Bots distributes data across multiple database shards for horizontal scaling.
|
||||
|
||||
## Overview
|
||||
|
||||
Sharding enables General Bots to scale beyond single-database limits by distributing data across multiple database instances. Each shard contains a subset of tenants, and data never crosses shard boundaries during normal operations.
|
||||
|
||||
## Shard Configuration
|
||||
|
||||
### Shard Config Table
|
||||
|
||||
The `shard_config` table defines all available shards:
|
||||
|
||||
```sql
|
||||
CREATE TABLE shard_config (
|
||||
shard_id SMALLINT PRIMARY KEY,
|
||||
region_code CHAR(3) NOT NULL, -- ISO 3166-1 alpha-3: USA, BRA, DEU
|
||||
datacenter VARCHAR(32) NOT NULL, -- e.g., 'us-east-1', 'eu-west-1'
|
||||
connection_string TEXT NOT NULL, -- Encrypted connection string
|
||||
is_primary BOOLEAN DEFAULT false,
|
||||
is_active BOOLEAN DEFAULT true,
|
||||
min_tenant_id BIGINT NOT NULL,
|
||||
max_tenant_id BIGINT NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### Example Configuration
|
||||
|
||||
```sql
|
||||
-- Americas
|
||||
INSERT INTO shard_config VALUES
|
||||
(1, 'USA', 'us-east-1', 'postgresql://shard1.db:5432/gb', true, true, 1, 1000000),
|
||||
(2, 'USA', 'us-west-2', 'postgresql://shard2.db:5432/gb', false, true, 1000001, 2000000),
|
||||
(3, 'BRA', 'sa-east-1', 'postgresql://shard3.db:5432/gb', false, true, 2000001, 3000000);
|
||||
|
||||
-- Europe
|
||||
INSERT INTO shard_config VALUES
|
||||
(4, 'DEU', 'eu-central-1', 'postgresql://shard4.db:5432/gb', false, true, 3000001, 4000000),
|
||||
(5, 'GBR', 'eu-west-2', 'postgresql://shard5.db:5432/gb', false, true, 4000001, 5000000);
|
||||
|
||||
-- Asia Pacific
|
||||
INSERT INTO shard_config VALUES
|
||||
(6, 'SGP', 'ap-southeast-1', 'postgresql://shard6.db:5432/gb', false, true, 5000001, 6000000),
|
||||
(7, 'JPN', 'ap-northeast-1', 'postgresql://shard7.db:5432/gb', false, true, 6000001, 7000000);
|
||||
```
|
||||
|
||||
## Tenant-to-Shard Mapping
|
||||
|
||||
### Mapping Table
|
||||
|
||||
```sql
|
||||
CREATE TABLE tenant_shard_map (
|
||||
tenant_id BIGINT PRIMARY KEY,
|
||||
shard_id SMALLINT NOT NULL REFERENCES shard_config(shard_id),
|
||||
region_code CHAR(3) NOT NULL,
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
### Routing Logic
|
||||
|
||||
When a request comes in, the system:
|
||||
|
||||
1. Extracts `tenant_id` from the request context
|
||||
2. Looks up `shard_id` from `tenant_shard_map`
|
||||
3. Routes the query to the appropriate database connection
|
||||
|
||||
```rust
|
||||
// Rust routing example
|
||||
pub fn get_shard_connection(tenant_id: i64) -> Result<DbConnection> {
|
||||
let shard_id = SHARD_MAP.get(&tenant_id)
|
||||
.ok_or_else(|| Error::TenantNotFound(tenant_id))?;
|
||||
|
||||
CONNECTION_POOLS.get(shard_id)
|
||||
.ok_or_else(|| Error::ShardNotAvailable(*shard_id))
|
||||
}
|
||||
```
|
||||
|
||||
## Data Model Requirements
|
||||
|
||||
### Every Table Includes Shard Keys
|
||||
|
||||
All tables must include `tenant_id` and `shard_id` columns:
|
||||
|
||||
```sql
|
||||
CREATE TABLE user_sessions (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
tenant_id BIGINT NOT NULL, -- Required for routing
|
||||
shard_id SMALLINT NOT NULL, -- Denormalized for queries
|
||||
user_id UUID NOT NULL,
|
||||
bot_id UUID NOT NULL,
|
||||
-- ... other columns
|
||||
);
|
||||
```
|
||||
|
||||
### Foreign Keys Within Shard Only
|
||||
|
||||
Foreign keys only reference tables within the same shard:
|
||||
|
||||
```sql
|
||||
-- Good: Same shard reference
|
||||
ALTER TABLE message_history
|
||||
ADD CONSTRAINT fk_session
|
||||
FOREIGN KEY (session_id) REFERENCES user_sessions(id);
|
||||
|
||||
-- Bad: Cross-shard reference (never do this)
|
||||
-- FOREIGN KEY (other_tenant_data) REFERENCES other_shard.table(id)
|
||||
```
|
||||
|
||||
## Snowflake ID Generation
|
||||
|
||||
For globally unique, time-sortable IDs across shards:
|
||||
|
||||
```sql
|
||||
CREATE OR REPLACE FUNCTION generate_snowflake_id(p_shard_id SMALLINT)
|
||||
RETURNS BIGINT AS $$
|
||||
DECLARE
|
||||
epoch BIGINT := 1704067200000; -- 2024-01-01 00:00:00 UTC
|
||||
ts BIGINT;
|
||||
seq BIGINT;
|
||||
BEGIN
|
||||
-- 41 bits: timestamp (milliseconds since epoch)
|
||||
ts := (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - epoch;
|
||||
|
||||
-- 10 bits: shard_id (0-1023)
|
||||
-- 12 bits: sequence (0-4095)
|
||||
seq := nextval('global_seq') & 4095;
|
||||
|
||||
RETURN (ts << 22) | ((p_shard_id & 1023) << 12) | seq;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
```
|
||||
|
||||
### ID Structure
|
||||
|
||||
```
|
||||
64-bit Snowflake ID
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 41 bits timestamp │ 10 bits shard │ 12 bits sequence │
|
||||
│ (69 years range) │ (1024 shards) │ (4096/ms/shard) │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Shard Operations
|
||||
|
||||
### Creating a New Shard
|
||||
|
||||
1. Provision new database instance
|
||||
2. Run migrations
|
||||
3. Add to `shard_config`
|
||||
4. Update routing configuration
|
||||
5. Begin assigning new tenants
|
||||
|
||||
```bash
|
||||
# 1. Run migrations on new shard
|
||||
DATABASE_URL="postgresql://new-shard:5432/gb" diesel migration run
|
||||
|
||||
# 2. Add shard config
|
||||
psql -c "INSERT INTO shard_config VALUES (8, 'AUS', 'ap-southeast-2', '...', false, true, 7000001, 8000000);"
|
||||
|
||||
# 3. Reload routing
|
||||
curl -X POST http://localhost:3000/api/admin/reload-shard-config
|
||||
```
|
||||
|
||||
### Tenant Migration Between Shards
|
||||
|
||||
Moving a tenant to a different shard (e.g., for data locality):
|
||||
|
||||
```sql
|
||||
-- 1. Set tenant to read-only mode
|
||||
UPDATE tenants SET settings = settings || '{"read_only": true}' WHERE id = 12345;
|
||||
|
||||
-- 2. Export tenant data
|
||||
pg_dump -t 'user_sessions' -t 'message_history' --where="tenant_id=12345" source_db > tenant_12345.sql
|
||||
|
||||
-- 3. Import to new shard
|
||||
psql target_db < tenant_12345.sql
|
||||
|
||||
-- 4. Update routing
|
||||
UPDATE tenant_shard_map SET shard_id = 5, region_code = 'DEU' WHERE tenant_id = 12345;
|
||||
|
||||
-- 5. Remove read-only mode
|
||||
UPDATE tenants SET settings = settings - 'read_only' WHERE id = 12345;
|
||||
|
||||
-- 6. Clean up source shard (after verification)
|
||||
DELETE FROM user_sessions WHERE tenant_id = 12345;
|
||||
DELETE FROM message_history WHERE tenant_id = 12345;
|
||||
```
|
||||
|
||||
## Query Patterns
|
||||
|
||||
### Single-Tenant Queries (Most Common)
|
||||
|
||||
```sql
|
||||
-- Efficient: Uses shard routing
|
||||
SELECT * FROM user_sessions
|
||||
WHERE tenant_id = 12345 AND user_id = 'abc-123';
|
||||
```
|
||||
|
||||
### Cross-Shard Queries (Admin Only)
|
||||
|
||||
For global analytics, use a federation layer:
|
||||
|
||||
```sql
|
||||
-- Using postgres_fdw for cross-shard reads
|
||||
SELECT shard_id, COUNT(*) as session_count
|
||||
FROM all_shards.user_sessions
|
||||
WHERE created_at > NOW() - INTERVAL '1 day'
|
||||
GROUP BY shard_id;
|
||||
```
|
||||
|
||||
### Scatter-Gather Pattern
|
||||
|
||||
For queries that must touch multiple shards:
|
||||
|
||||
```rust
|
||||
async fn get_global_stats() -> Stats {
|
||||
let futures: Vec<_> = SHARDS.iter()
|
||||
.map(|shard| get_shard_stats(shard.id))
|
||||
.collect();
|
||||
|
||||
let results = futures::future::join_all(futures).await;
|
||||
|
||||
results.into_iter().fold(Stats::default(), |acc, s| acc.merge(s))
|
||||
}
|
||||
```
|
||||
|
||||
## High Availability
|
||||
|
||||
### Per-Shard Replication
|
||||
|
||||
Each shard should have:
|
||||
|
||||
- 1 Primary (read/write)
|
||||
- 1-2 Replicas (read-only, failover)
|
||||
- Async replication with < 1s lag
|
||||
|
||||
```
|
||||
Shard 1 Architecture:
|
||||
┌─────────────┐
|
||||
│ Primary │◄──── Writes
|
||||
└──────┬──────┘
|
||||
│ Streaming Replication
|
||||
┌───┴───┐
|
||||
▼ ▼
|
||||
┌──────┐ ┌──────┐
|
||||
│Rep 1 │ │Rep 2 │◄──── Reads
|
||||
└──────┘ └──────┘
|
||||
```
|
||||
|
||||
### Failover Configuration
|
||||
|
||||
```yaml
|
||||
# config.csv
|
||||
shard-1-primary,postgresql://shard1-primary:5432/gb
|
||||
shard-1-replica-1,postgresql://shard1-replica1:5432/gb
|
||||
shard-1-replica-2,postgresql://shard1-replica2:5432/gb
|
||||
shard-1-failover-priority,replica-1,replica-2
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Key Metrics Per Shard
|
||||
|
||||
| Metric | Warning | Critical |
|
||||
|--------|---------|----------|
|
||||
| Connection pool usage | > 70% | > 90% |
|
||||
| Query latency p99 | > 100ms | > 500ms |
|
||||
| Replication lag | > 1s | > 10s |
|
||||
| Disk usage | > 70% | > 85% |
|
||||
| Tenant count | > 80% capacity | > 95% capacity |
|
||||
|
||||
### Shard Health Check
|
||||
|
||||
```sql
|
||||
-- Run on each shard
|
||||
SELECT
|
||||
current_setting('cluster_name') as shard,
|
||||
pg_is_in_recovery() as is_replica,
|
||||
pg_last_wal_receive_lsn() as wal_position,
|
||||
pg_postmaster_start_time() as uptime_since,
|
||||
(SELECT count(*) FROM pg_stat_activity) as connections,
|
||||
(SELECT count(DISTINCT tenant_id) FROM tenants) as tenant_count;
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Shard by tenant, not by table** - Keep all tenant data together
|
||||
2. **Avoid cross-shard transactions** - Design for eventual consistency where needed
|
||||
3. **Pre-allocate tenant ranges** - Leave room for growth in each shard
|
||||
4. **Monitor shard hotspots** - Rebalance if one shard gets too busy
|
||||
5. **Test failover regularly** - Ensure replicas can be promoted
|
||||
6. **Use connection pooling** - PgBouncer or similar for each shard
|
||||
7. **Cache shard routing** - Don't query `tenant_shard_map` on every request
|
||||
|
||||
## Migration from Single Database
|
||||
|
||||
To migrate an existing single-database deployment to sharded:
|
||||
|
||||
1. Add `shard_id` column to all tables (default to 1)
|
||||
2. Deploy shard routing code (disabled)
|
||||
3. Set up additional shard databases
|
||||
4. Enable routing for new tenants only
|
||||
5. Gradually migrate existing tenants during low-traffic windows
|
||||
6. Decommission original database when empty
|
||||
|
||||
See [Regional Deployment](./regional-deployment.md) for multi-region considerations.
|
||||
|
|
@ -390,6 +390,12 @@
|
|||
- [Examples](./17-autonomous-tasks/examples.md)
|
||||
- [Designer](./17-autonomous-tasks/designer.md)
|
||||
|
||||
# Part XVII - Scale
|
||||
|
||||
- [Chapter 18: Scale](./21-scale/README.md)
|
||||
- [Sharding Architecture](./21-scale/sharding.md)
|
||||
- [Database Optimization](./21-scale/database-optimization.md)
|
||||
|
||||
# Appendices
|
||||
|
||||
- [Appendix A: Database Model](./15-appendix/README.md)
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue