From 6bfab51293805483105150b85c75baf41573030c Mon Sep 17 00:00:00 2001 From: "Rodrigo Rodriguez (Pragmatismo)" Date: Mon, 29 Dec 2025 12:55:32 -0300 Subject: [PATCH] Add Scale chapter with sharding and database optimization docs New Chapter 18 - Scale: - README with overview of scaling architecture - Sharding Architecture: tenant routing, shard config, data model, operations - Database Optimization: SMALLINT enums, indexing, partitioning, connection pooling Enum reference tables for all domain values (Channel, Message, Task, etc.) Best practices for billion-user scale deployments --- src/21-scale/README.md | 79 +++++ src/21-scale/database-optimization.md | 436 ++++++++++++++++++++++++++ src/21-scale/sharding.md | 309 ++++++++++++++++++ src/SUMMARY.md | 6 + 4 files changed, 830 insertions(+) create mode 100644 src/21-scale/README.md create mode 100644 src/21-scale/database-optimization.md create mode 100644 src/21-scale/sharding.md diff --git a/src/21-scale/README.md b/src/21-scale/README.md new file mode 100644 index 00000000..8785d229 --- /dev/null +++ b/src/21-scale/README.md @@ -0,0 +1,79 @@ +# Scale + +This chapter covers horizontal scaling strategies for General Bots to support millions to billions of users across global deployments. + +## Overview + +General Bots is designed from the ground up to scale horizontally. The architecture supports: + +- **Multi-tenancy**: Complete isolation between organizations +- **Regional sharding**: Data locality for compliance and performance +- **Database partitioning**: Efficient handling of high-volume tables +- **Stateless services**: Easy horizontal pod autoscaling + +## Chapter Contents + +- [Sharding Architecture](./sharding.md) - How data is distributed across shards +- [Database Optimization](./database-optimization.md) - Schema design for billion-scale +- [Regional Deployment](./regional-deployment.md) - Multi-region setup +- [Performance Tuning](./performance-tuning.md) - Optimization strategies + +## Key Concepts + +### Tenant Isolation + +Every piece of data in General Bots is associated with a `tenant_id`. This enables: + +1. Complete data isolation between organizations +2. Per-tenant resource limits and quotas +3. Tenant-specific configurations +4. Easy data export/deletion for compliance + +### Shard Architecture + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Load Balancer │ +└─────────────────────────┬───────────────────────────────────┘ + │ + ┌─────────────────┼─────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────┐ ┌─────────┐ ┌─────────┐ + │ Region │ │ Region │ │ Region │ + │ USA │ │ EUR │ │ APAC │ + └────┬────┘ └────┬────┘ └────┬────┘ + │ │ │ + ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ + │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ + │ Shard 4 │ │ Shard 5 │ │ Shard 6 │ + └─────────┘ └─────────┘ └─────────┘ +``` + +### Database Design Principles + +1. **SMALLINT enums** instead of VARCHAR for domain values (2 bytes vs 20+ bytes) +2. **Partitioned tables** for high-volume data (messages, sessions, analytics) +3. **Composite primary keys** including `shard_id` for distributed queries +4. **Snowflake-like IDs** for globally unique, time-sortable identifiers + +## When to Scale + +| Users | Sessions/day | Messages/day | Recommended Setup | +|-------|--------------|--------------|-------------------| +| < 10K | < 100K | < 1M | Single node | +| 10K-100K | 100K-1M | 1M-10M | 2-3 nodes, single region | +| 100K-1M | 1M-10M | 10M-100M | Multi-node, consider sharding | +| 1M-10M | 10M-100M | 100M-1B | Regional shards | +| > 10M | > 100M | > 1B | Global shards with Citus/CockroachDB | + +## Quick Start + +To enable sharding in your deployment: + +1. Configure shard mapping in `shard_config` table +2. Set `SHARD_ID` environment variable per instance +3. Deploy region-specific instances +4. Configure load balancer routing rules + +See [Sharding Architecture](./sharding.md) for detailed setup instructions. \ No newline at end of file diff --git a/src/21-scale/database-optimization.md b/src/21-scale/database-optimization.md new file mode 100644 index 00000000..377a7919 --- /dev/null +++ b/src/21-scale/database-optimization.md @@ -0,0 +1,436 @@ +# Database Optimization + +This document covers database schema design and optimization strategies for billion-user scale deployments. + +## Schema Design Principles + +### Use SMALLINT Enums Instead of VARCHAR + +One of the most impactful optimizations is using integer enums instead of string-based status fields. + +**Before (inefficient):** +```sql +CREATE TABLE auto_tasks ( + id UUID PRIMARY KEY, + status VARCHAR(50) NOT NULL DEFAULT 'pending', + priority VARCHAR(20) NOT NULL DEFAULT 'normal', + execution_mode VARCHAR(50) NOT NULL DEFAULT 'supervised', + CONSTRAINT check_status CHECK (status IN ('pending', 'ready', 'running', 'paused', 'waiting_approval', 'completed', 'failed', 'cancelled')) +); +``` + +**After (optimized):** +```sql +CREATE TABLE auto_tasks ( + id UUID PRIMARY KEY, + status SMALLINT NOT NULL DEFAULT 0, -- 2 bytes + priority SMALLINT NOT NULL DEFAULT 1, -- 2 bytes + execution_mode SMALLINT NOT NULL DEFAULT 1 -- 2 bytes +); +``` + +### Storage Comparison + +| Field Type | Storage | Example Values | +|------------|---------|----------------| +| VARCHAR(50) | 1-51 bytes | 'waiting_approval' = 17 bytes | +| TEXT | 1+ bytes | 'completed' = 10 bytes | +| SMALLINT | 2 bytes | 4 = 2 bytes (always) | +| INTEGER | 4 bytes | 4 = 4 bytes (always) | + +**Savings per row with 5 enum fields:** +- VARCHAR: ~50 bytes average +- SMALLINT: 10 bytes fixed +- **Savings: 40 bytes per row = 40GB per billion rows** + +## Enum Value Reference + +All domain values in General Bots use SMALLINT. Reference table: + +### Channel Types +| Value | Name | Description | +|-------|------|-------------| +| 0 | web | Web chat interface | +| 1 | whatsapp | WhatsApp Business | +| 2 | telegram | Telegram Bot | +| 3 | msteams | Microsoft Teams | +| 4 | slack | Slack | +| 5 | email | Email channel | +| 6 | sms | SMS/Text messages | +| 7 | voice | Voice/Phone | +| 8 | instagram | Instagram DM | +| 9 | api | Direct API | + +### Message Role +| Value | Name | Description | +|-------|------|-------------| +| 1 | user | User message | +| 2 | assistant | Bot response | +| 3 | system | System prompt | +| 4 | tool | Tool call/result | +| 9 | episodic | Episodic memory summary | +| 10 | compact | Compacted conversation | + +### Message Type +| Value | Name | Description | +|-------|------|-------------| +| 0 | text | Plain text | +| 1 | image | Image attachment | +| 2 | audio | Audio file | +| 3 | video | Video file | +| 4 | document | Document/PDF | +| 5 | location | GPS location | +| 6 | contact | Contact card | +| 7 | sticker | Sticker | +| 8 | reaction | Message reaction | + +### LLM Provider +| Value | Name | Description | +|-------|------|-------------| +| 0 | openai | OpenAI API | +| 1 | anthropic | Anthropic Claude | +| 2 | azure_openai | Azure OpenAI | +| 3 | azure_claude | Azure Claude | +| 4 | google | Google AI | +| 5 | local | Local llama.cpp | +| 6 | ollama | Ollama | +| 7 | groq | Groq | +| 8 | mistral | Mistral AI | +| 9 | cohere | Cohere | + +### Task Status +| Value | Name | Description | +|-------|------|-------------| +| 0 | pending | Waiting to start | +| 1 | ready | Ready to execute | +| 2 | running | Currently executing | +| 3 | paused | Paused by user | +| 4 | waiting_approval | Needs approval | +| 5 | completed | Successfully finished | +| 6 | failed | Failed with error | +| 7 | cancelled | Cancelled by user | + +### Task Priority +| Value | Name | Description | +|-------|------|-------------| +| 0 | low | Low priority | +| 1 | normal | Normal priority | +| 2 | high | High priority | +| 3 | urgent | Urgent | +| 4 | critical | Critical | + +### Execution Mode +| Value | Name | Description | +|-------|------|-------------| +| 0 | manual | Manual execution only | +| 1 | supervised | Requires approval | +| 2 | autonomous | Fully automatic | + +### Risk Level +| Value | Name | Description | +|-------|------|-------------| +| 0 | none | No risk | +| 1 | low | Low risk | +| 2 | medium | Medium risk | +| 3 | high | High risk | +| 4 | critical | Critical risk | + +### Approval Status +| Value | Name | Description | +|-------|------|-------------| +| 0 | pending | Awaiting decision | +| 1 | approved | Approved | +| 2 | rejected | Rejected | +| 3 | expired | Timed out | +| 4 | skipped | Skipped | + +### Intent Type +| Value | Name | Description | +|-------|------|-------------| +| 0 | unknown | Unclassified | +| 1 | app_create | Create application | +| 2 | todo | Create task/reminder | +| 3 | monitor | Set up monitoring | +| 4 | action | Execute action | +| 5 | schedule | Create schedule | +| 6 | goal | Set goal | +| 7 | tool | Create tool | +| 8 | query | Query/search | + +### Memory Type +| Value | Name | Description | +|-------|------|-------------| +| 0 | short | Short-term | +| 1 | long | Long-term | +| 2 | episodic | Episodic | +| 3 | semantic | Semantic | +| 4 | procedural | Procedural | + +### Sync Status +| Value | Name | Description | +|-------|------|-------------| +| 0 | synced | Fully synced | +| 1 | pending | Sync pending | +| 2 | conflict | Conflict detected | +| 3 | error | Sync error | +| 4 | deleted | Marked for deletion | + +## Indexing Strategies + +### Composite Indexes for Common Queries + +```sql +-- Session lookup by user +CREATE INDEX idx_sessions_user ON user_sessions(user_id, created_at DESC); + +-- Messages by session (most common query) +CREATE INDEX idx_messages_session ON message_history(session_id, message_index); + +-- Active tasks by status and priority +CREATE INDEX idx_tasks_status ON auto_tasks(status, priority) WHERE status < 5; + +-- Tenant-scoped queries +CREATE INDEX idx_sessions_tenant ON user_sessions(tenant_id, created_at DESC); +``` + +### Partial Indexes for Active Records + +```sql +-- Only index active bots (saves space) +CREATE INDEX idx_bots_active ON bots(tenant_id, is_active) WHERE is_active = true; + +-- Only index pending approvals +CREATE INDEX idx_approvals_pending ON task_approvals(task_id, expires_at) WHERE status = 0; + +-- Only index unread messages +CREATE INDEX idx_messages_unread ON message_history(user_id, created_at) WHERE is_read = false; +``` + +### BRIN Indexes for Time-Series Data + +```sql +-- BRIN index for time-ordered data (much smaller than B-tree) +CREATE INDEX idx_messages_created_brin ON message_history USING BRIN (created_at); +CREATE INDEX idx_analytics_date_brin ON analytics_events USING BRIN (created_at); +``` + +## Table Partitioning + +### Partition High-Volume Tables by Time + +```sql +-- Partitioned messages table +CREATE TABLE message_history ( + id UUID NOT NULL, + session_id UUID NOT NULL, + tenant_id BIGINT NOT NULL, + created_at TIMESTAMPTZ NOT NULL, + -- other columns... + PRIMARY KEY (id, created_at) +) PARTITION BY RANGE (created_at); + +-- Monthly partitions +CREATE TABLE message_history_2025_01 PARTITION OF message_history + FOR VALUES FROM ('2025-01-01') TO ('2025-02-01'); +CREATE TABLE message_history_2025_02 PARTITION OF message_history + FOR VALUES FROM ('2025-02-01') TO ('2025-03-01'); +-- ... continue for each month + +-- Default partition for future data +CREATE TABLE message_history_default PARTITION OF message_history DEFAULT; +``` + +### Automatic Partition Management + +```sql +-- Function to create next month's partition +CREATE OR REPLACE FUNCTION create_monthly_partition( + table_name TEXT, + partition_date DATE +) RETURNS VOID AS $$ +DECLARE + partition_name TEXT; + start_date DATE; + end_date DATE; +BEGIN + partition_name := table_name || '_' || to_char(partition_date, 'YYYY_MM'); + start_date := date_trunc('month', partition_date); + end_date := start_date + INTERVAL '1 month'; + + EXECUTE format( + 'CREATE TABLE IF NOT EXISTS %I PARTITION OF %I FOR VALUES FROM (%L) TO (%L)', + partition_name, table_name, start_date, end_date + ); +END; +$$ LANGUAGE plpgsql; + +-- Create partitions for next 3 months +SELECT create_monthly_partition('message_history', NOW() + INTERVAL '1 month'); +SELECT create_monthly_partition('message_history', NOW() + INTERVAL '2 months'); +SELECT create_monthly_partition('message_history', NOW() + INTERVAL '3 months'); +``` + +## Connection Pooling + +### PgBouncer Configuration + +```ini +; pgbouncer.ini +[databases] +gb_shard1 = host=shard1.db port=5432 dbname=gb +gb_shard2 = host=shard2.db port=5432 dbname=gb + +[pgbouncer] +listen_port = 6432 +listen_addr = * +auth_type = md5 +auth_file = /etc/pgbouncer/userlist.txt + +; Pool settings +pool_mode = transaction +max_client_conn = 10000 +default_pool_size = 50 +min_pool_size = 10 +reserve_pool_size = 25 +reserve_pool_timeout = 3 + +; Timeouts +server_connect_timeout = 3 +server_idle_timeout = 600 +server_lifetime = 3600 +client_idle_timeout = 0 +``` + +### Application Connection Settings + +```toml +# config.toml +[database] +max_connections = 100 +min_connections = 10 +connection_timeout_secs = 5 +idle_timeout_secs = 300 +max_lifetime_secs = 1800 +``` + +## Query Optimization + +### Use EXPLAIN ANALYZE + +```sql +EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) +SELECT * FROM message_history +WHERE session_id = 'abc-123' +ORDER BY message_index; +``` + +### Avoid N+1 Queries + +**Bad:** +```sql +-- 1 query for sessions +SELECT * FROM user_sessions WHERE user_id = 'xyz'; +-- N queries for messages (one per session) +SELECT * FROM message_history WHERE session_id = ?; +``` + +**Good:** +```sql +-- Single query with JOIN +SELECT s.*, m.* +FROM user_sessions s +LEFT JOIN message_history m ON m.session_id = s.id +WHERE s.user_id = 'xyz' +ORDER BY s.created_at DESC, m.message_index; +``` + +### Use Covering Indexes + +```sql +-- Index includes all needed columns (no table lookup) +CREATE INDEX idx_sessions_covering ON user_sessions(user_id, created_at DESC) +INCLUDE (title, message_count, last_activity_at); +``` + +## Vacuum and Maintenance + +### Aggressive Autovacuum for High-Churn Tables + +```sql +ALTER TABLE message_history SET ( + autovacuum_vacuum_scale_factor = 0.01, + autovacuum_analyze_scale_factor = 0.005, + autovacuum_vacuum_cost_delay = 2 +); + +ALTER TABLE user_sessions SET ( + autovacuum_vacuum_scale_factor = 0.02, + autovacuum_analyze_scale_factor = 0.01 +); +``` + +### Regular Maintenance Tasks + +```sql +-- Weekly: Reindex bloated indexes +REINDEX INDEX CONCURRENTLY idx_messages_session; + +-- Monthly: Update statistics +ANALYZE VERBOSE message_history; + +-- Quarterly: Cluster heavily-queried tables +CLUSTER message_history USING idx_messages_session; +``` + +## Monitoring Queries + +### Table Bloat Check + +```sql +SELECT + schemaname || '.' || tablename AS table, + pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size, + pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size, + pg_size_pretty(pg_indexes_size(schemaname || '.' || tablename)) AS index_size +FROM pg_tables +WHERE schemaname = 'public' +ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC +LIMIT 20; +``` + +### Slow Query Log + +```sql +-- postgresql.conf +log_min_duration_statement = 100 -- Log queries > 100ms +log_statement = 'none' +log_lock_waits = on +``` + +### Index Usage Statistics + +```sql +SELECT + schemaname || '.' || relname AS table, + indexrelname AS index, + idx_scan AS scans, + idx_tup_read AS tuples_read, + idx_tup_fetch AS tuples_fetched, + pg_size_pretty(pg_relation_size(indexrelid)) AS size +FROM pg_stat_user_indexes +ORDER BY idx_scan DESC +LIMIT 20; +``` + +## Best Practices Summary + +1. **Use SMALLINT for enums** - 2 bytes vs 10-50 bytes per field +2. **Partition time-series tables** - Monthly partitions for messages/analytics +3. **Create partial indexes** - Only index active/relevant rows +4. **Use connection pooling** - PgBouncer with transaction mode +5. **Enable aggressive autovacuum** - For high-churn tables +6. **Monitor query performance** - Log slow queries, check EXPLAIN plans +7. **Use covering indexes** - Include frequently-accessed columns +8. **Avoid cross-shard queries** - Keep tenant data together +9. **Regular maintenance** - Reindex, analyze, cluster as needed +10. **Test at scale** - Use production-like data volumes in staging \ No newline at end of file diff --git a/src/21-scale/sharding.md b/src/21-scale/sharding.md new file mode 100644 index 00000000..06fffe0b --- /dev/null +++ b/src/21-scale/sharding.md @@ -0,0 +1,309 @@ +# Sharding Architecture + +This document describes how General Bots distributes data across multiple database shards for horizontal scaling. + +## Overview + +Sharding enables General Bots to scale beyond single-database limits by distributing data across multiple database instances. Each shard contains a subset of tenants, and data never crosses shard boundaries during normal operations. + +## Shard Configuration + +### Shard Config Table + +The `shard_config` table defines all available shards: + +```sql +CREATE TABLE shard_config ( + shard_id SMALLINT PRIMARY KEY, + region_code CHAR(3) NOT NULL, -- ISO 3166-1 alpha-3: USA, BRA, DEU + datacenter VARCHAR(32) NOT NULL, -- e.g., 'us-east-1', 'eu-west-1' + connection_string TEXT NOT NULL, -- Encrypted connection string + is_primary BOOLEAN DEFAULT false, + is_active BOOLEAN DEFAULT true, + min_tenant_id BIGINT NOT NULL, + max_tenant_id BIGINT NOT NULL, + created_at TIMESTAMPTZ DEFAULT NOW() +); +``` + +### Example Configuration + +```sql +-- Americas +INSERT INTO shard_config VALUES +(1, 'USA', 'us-east-1', 'postgresql://shard1.db:5432/gb', true, true, 1, 1000000), +(2, 'USA', 'us-west-2', 'postgresql://shard2.db:5432/gb', false, true, 1000001, 2000000), +(3, 'BRA', 'sa-east-1', 'postgresql://shard3.db:5432/gb', false, true, 2000001, 3000000); + +-- Europe +INSERT INTO shard_config VALUES +(4, 'DEU', 'eu-central-1', 'postgresql://shard4.db:5432/gb', false, true, 3000001, 4000000), +(5, 'GBR', 'eu-west-2', 'postgresql://shard5.db:5432/gb', false, true, 4000001, 5000000); + +-- Asia Pacific +INSERT INTO shard_config VALUES +(6, 'SGP', 'ap-southeast-1', 'postgresql://shard6.db:5432/gb', false, true, 5000001, 6000000), +(7, 'JPN', 'ap-northeast-1', 'postgresql://shard7.db:5432/gb', false, true, 6000001, 7000000); +``` + +## Tenant-to-Shard Mapping + +### Mapping Table + +```sql +CREATE TABLE tenant_shard_map ( + tenant_id BIGINT PRIMARY KEY, + shard_id SMALLINT NOT NULL REFERENCES shard_config(shard_id), + region_code CHAR(3) NOT NULL, + created_at TIMESTAMPTZ DEFAULT NOW() +); +``` + +### Routing Logic + +When a request comes in, the system: + +1. Extracts `tenant_id` from the request context +2. Looks up `shard_id` from `tenant_shard_map` +3. Routes the query to the appropriate database connection + +```rust +// Rust routing example +pub fn get_shard_connection(tenant_id: i64) -> Result { + let shard_id = SHARD_MAP.get(&tenant_id) + .ok_or_else(|| Error::TenantNotFound(tenant_id))?; + + CONNECTION_POOLS.get(shard_id) + .ok_or_else(|| Error::ShardNotAvailable(*shard_id)) +} +``` + +## Data Model Requirements + +### Every Table Includes Shard Keys + +All tables must include `tenant_id` and `shard_id` columns: + +```sql +CREATE TABLE user_sessions ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + tenant_id BIGINT NOT NULL, -- Required for routing + shard_id SMALLINT NOT NULL, -- Denormalized for queries + user_id UUID NOT NULL, + bot_id UUID NOT NULL, + -- ... other columns +); +``` + +### Foreign Keys Within Shard Only + +Foreign keys only reference tables within the same shard: + +```sql +-- Good: Same shard reference +ALTER TABLE message_history +ADD CONSTRAINT fk_session +FOREIGN KEY (session_id) REFERENCES user_sessions(id); + +-- Bad: Cross-shard reference (never do this) +-- FOREIGN KEY (other_tenant_data) REFERENCES other_shard.table(id) +``` + +## Snowflake ID Generation + +For globally unique, time-sortable IDs across shards: + +```sql +CREATE OR REPLACE FUNCTION generate_snowflake_id(p_shard_id SMALLINT) +RETURNS BIGINT AS $$ +DECLARE + epoch BIGINT := 1704067200000; -- 2024-01-01 00:00:00 UTC + ts BIGINT; + seq BIGINT; +BEGIN + -- 41 bits: timestamp (milliseconds since epoch) + ts := (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - epoch; + + -- 10 bits: shard_id (0-1023) + -- 12 bits: sequence (0-4095) + seq := nextval('global_seq') & 4095; + + RETURN (ts << 22) | ((p_shard_id & 1023) << 12) | seq; +END; +$$ LANGUAGE plpgsql; +``` + +### ID Structure + +``` + 64-bit Snowflake ID +┌─────────────────────────────────────────────────────────────────┐ +│ 41 bits timestamp │ 10 bits shard │ 12 bits sequence │ +│ (69 years range) │ (1024 shards) │ (4096/ms/shard) │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Shard Operations + +### Creating a New Shard + +1. Provision new database instance +2. Run migrations +3. Add to `shard_config` +4. Update routing configuration +5. Begin assigning new tenants + +```bash +# 1. Run migrations on new shard +DATABASE_URL="postgresql://new-shard:5432/gb" diesel migration run + +# 2. Add shard config +psql -c "INSERT INTO shard_config VALUES (8, 'AUS', 'ap-southeast-2', '...', false, true, 7000001, 8000000);" + +# 3. Reload routing +curl -X POST http://localhost:3000/api/admin/reload-shard-config +``` + +### Tenant Migration Between Shards + +Moving a tenant to a different shard (e.g., for data locality): + +```sql +-- 1. Set tenant to read-only mode +UPDATE tenants SET settings = settings || '{"read_only": true}' WHERE id = 12345; + +-- 2. Export tenant data +pg_dump -t 'user_sessions' -t 'message_history' --where="tenant_id=12345" source_db > tenant_12345.sql + +-- 3. Import to new shard +psql target_db < tenant_12345.sql + +-- 4. Update routing +UPDATE tenant_shard_map SET shard_id = 5, region_code = 'DEU' WHERE tenant_id = 12345; + +-- 5. Remove read-only mode +UPDATE tenants SET settings = settings - 'read_only' WHERE id = 12345; + +-- 6. Clean up source shard (after verification) +DELETE FROM user_sessions WHERE tenant_id = 12345; +DELETE FROM message_history WHERE tenant_id = 12345; +``` + +## Query Patterns + +### Single-Tenant Queries (Most Common) + +```sql +-- Efficient: Uses shard routing +SELECT * FROM user_sessions +WHERE tenant_id = 12345 AND user_id = 'abc-123'; +``` + +### Cross-Shard Queries (Admin Only) + +For global analytics, use a federation layer: + +```sql +-- Using postgres_fdw for cross-shard reads +SELECT shard_id, COUNT(*) as session_count +FROM all_shards.user_sessions +WHERE created_at > NOW() - INTERVAL '1 day' +GROUP BY shard_id; +``` + +### Scatter-Gather Pattern + +For queries that must touch multiple shards: + +```rust +async fn get_global_stats() -> Stats { + let futures: Vec<_> = SHARDS.iter() + .map(|shard| get_shard_stats(shard.id)) + .collect(); + + let results = futures::future::join_all(futures).await; + + results.into_iter().fold(Stats::default(), |acc, s| acc.merge(s)) +} +``` + +## High Availability + +### Per-Shard Replication + +Each shard should have: + +- 1 Primary (read/write) +- 1-2 Replicas (read-only, failover) +- Async replication with < 1s lag + +``` +Shard 1 Architecture: +┌─────────────┐ +│ Primary │◄──── Writes +└──────┬──────┘ + │ Streaming Replication + ┌───┴───┐ + ▼ ▼ +┌──────┐ ┌──────┐ +│Rep 1 │ │Rep 2 │◄──── Reads +└──────┘ └──────┘ +``` + +### Failover Configuration + +```yaml +# config.csv +shard-1-primary,postgresql://shard1-primary:5432/gb +shard-1-replica-1,postgresql://shard1-replica1:5432/gb +shard-1-replica-2,postgresql://shard1-replica2:5432/gb +shard-1-failover-priority,replica-1,replica-2 +``` + +## Monitoring + +### Key Metrics Per Shard + +| Metric | Warning | Critical | +|--------|---------|----------| +| Connection pool usage | > 70% | > 90% | +| Query latency p99 | > 100ms | > 500ms | +| Replication lag | > 1s | > 10s | +| Disk usage | > 70% | > 85% | +| Tenant count | > 80% capacity | > 95% capacity | + +### Shard Health Check + +```sql +-- Run on each shard +SELECT + current_setting('cluster_name') as shard, + pg_is_in_recovery() as is_replica, + pg_last_wal_receive_lsn() as wal_position, + pg_postmaster_start_time() as uptime_since, + (SELECT count(*) FROM pg_stat_activity) as connections, + (SELECT count(DISTINCT tenant_id) FROM tenants) as tenant_count; +``` + +## Best Practices + +1. **Shard by tenant, not by table** - Keep all tenant data together +2. **Avoid cross-shard transactions** - Design for eventual consistency where needed +3. **Pre-allocate tenant ranges** - Leave room for growth in each shard +4. **Monitor shard hotspots** - Rebalance if one shard gets too busy +5. **Test failover regularly** - Ensure replicas can be promoted +6. **Use connection pooling** - PgBouncer or similar for each shard +7. **Cache shard routing** - Don't query `tenant_shard_map` on every request + +## Migration from Single Database + +To migrate an existing single-database deployment to sharded: + +1. Add `shard_id` column to all tables (default to 1) +2. Deploy shard routing code (disabled) +3. Set up additional shard databases +4. Enable routing for new tenants only +5. Gradually migrate existing tenants during low-traffic windows +6. Decommission original database when empty + +See [Regional Deployment](./regional-deployment.md) for multi-region considerations. \ No newline at end of file diff --git a/src/SUMMARY.md b/src/SUMMARY.md index f8d320d8..e274a1f7 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -390,6 +390,12 @@ - [Examples](./17-autonomous-tasks/examples.md) - [Designer](./17-autonomous-tasks/designer.md) +# Part XVII - Scale + +- [Chapter 18: Scale](./21-scale/README.md) + - [Sharding Architecture](./21-scale/sharding.md) + - [Database Optimization](./21-scale/database-optimization.md) + # Appendices - [Appendix A: Database Model](./15-appendix/README.md)