diff --git a/AGENTS.md b/AGENTS.md index abbd89fd..5e300291 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -189,8 +189,8 @@ Verify: PostgreSQL 5432, Valkey 6379, BotServer 8080, BotUI 3000 4. Test: cargo check -p botserver, ./restart.sh, verify in browser 5. Commit: clear message with root cause, impact, files, testing notes -Logs: /opt/gbo/logs/err.log (errors) | /opt/gbo/logs/out.log (output) | botserver.log (dev only) | botui.log | [drive_monitor] prefix | CLIENT: prefix -On staging/production: check err.log and out.log in /opt/gbo/logs/ +Logs: check container logs via `sudo incus exec system -- journalctl -u botserver` | botserver.log (dev only) | botui.log | [drive_monitor] prefix | CLIENT: prefix +On staging/production: check logs in container via `sudo incus exec system -- tail -f logs/err.log` > Troubleshooting: botbook/src/12-ecosystem-reference/troubleshooting.md @@ -200,7 +200,7 @@ On staging/production: check err.log and out.log in /opt/gbo/logs/ Push to ALM → CI builds on alm-ci → deploys to system container via SSH NEVER deploy manually. CI path: alm-ci builds → tar+gzip → /opt/gbo/bin/botserver → restart -CI deploy: alm-ci at /opt/gbo/data/botserver/target/debug/botserver → SSH → system container +CI deploy: alm-ci at /opt/gbo/bin/botserver → SSH → system container Runner: gbuser uid 1000, workspace /opt/gbo/data/, SSH key /home/gbuser/.ssh/id_ed25519 > CI/CD details: botbook/src/12-ecosystem-reference/ci-cd.md @@ -242,19 +242,19 @@ ALM port is 4747. Runner token in action_runner_token table. - ALWAYS backup files to /tmp before editing ### Infrastructure Paths -- Base: /opt/gbo/ | Data: /opt/gbo/data | Bin: /opt/gbo/bin -- Conf: /opt/gbo/conf | Logs: /opt/gbo/logs +- Base: /opt/gbo/ | Bin: /opt/gbo/bin | Conf: /opt/gbo/conf | Logs: /opt/gbo/logs +- Bots are stored in MinIO (drive), NOT in /opt/gbo/data ### Service Operations -- DNS (CoreDNS): config /opt/gbo/conf/Corefile, zones /opt/gbo/data/domain.zone -- PostgreSQL: data /opt/gbo/data, backup pg_dump, restore pg_restore +- DNS (CoreDNS): config /opt/gbo/conf/Corefile, zones in MinIO +- PostgreSQL: backup pg_dump, restore pg_restore - Email (Stalwart): config /opt/gbo/conf/config.toml, check DKIM TXT records - Proxy (Caddy): config /opt/gbo/conf/config, validate then reload -- MinIO: internal API http://drive-ip:9000, data /opt/gbo/data +- MinIO: internal API http://drive-ip:9000, bots stored as buckets - Bot System: binary /opt/gbo/bin/botserver, Valkey port 6379 - ALM (Forgejo): port 4747, CI runner separate container, token from DB - CI Runner: config /opt/gbo/bin/config.yaml, runs as gbuser, systemd service - sccache at /usr/local/bin/sccache, workspace /opt/gbo/data/ +sccache at /usr/local/bin/sccache, workspace /opt/gbo/data/ ### Network — NAT Port Forwarding External ports DNAT to container IPs via iptables. Rules in /etc/iptables.rules diff --git a/INFRA.md b/INFRA.md new file mode 100644 index 00000000..5ce776b7 --- /dev/null +++ b/INFRA.md @@ -0,0 +1,450 @@ +# Infrastructure Operations Guide — Generic Across Incus Projects + +NEVER INCLUDE CREDENTIALS OR COMPANY INFORMATION — THIS IS COMPANY AGNOSTIC. + +## ENVIRONMENT CONTEXT + +Agent must identify which environment it is operating on by checking the hostname or asking the user: + +| Environment | Chat URL | System Domain | ALM Domain | Login Domain | Subnet | +|-------------|----------|---------------|------------|--------------|--------| +| PROD | chat.domain.com | system.domain.com | alm.domain.com | login.domain.com | 10.0.2.x | +| STAGE | chat.stage.domain.com | system.stage.domain.com | alm.stage.domain.com | login.stage.domain.com | 10.0.3.x | + +URL pattern: chat.{stage.}domain.com/botname for bot access. + +If edit conf/data make a backup first to /tmp with datetime suffix. + +Always manage services with systemctl inside the system container. Never run binaries directly — they fail without .env loading. Correct: sudo incus exec system -- systemctl start|stop|restart|status botserver + +--- + +## CRITICAL SAFETY RULES + +- NEVER modify iptables without explicit confirmation +- NEVER touch production without asking first +- ALWAYS backup files to /tmp before editing +- NEVER push secrets (API keys, passwords, tokens) to git +- NEVER commit init.json (contains Vault unseal keys) +- NEVER deploy manually via scp/ssh — always use CI/CD +- ALWAYS push all submodules before main repo +- ALWAYS ask before pushing to ALM +- NEVER include real IPs in documentation — use 10.x.x.x + +--- + +## INFRASTRUCTURE PATHS + +- Base: /opt/gbo/ | Bin: /opt/gbo/bin | Conf: /opt/gbo/conf | Logs: /opt/gbo/logs +- Bots are stored in MinIO (drive), NOT in /opt/gbo/data + +--- + +## CONTAINER ARCHITECTURE + +| Container | Service | Port | Notes | +|-----------|---------|------|-------| +| system | BotServer + Valkey | 8080/6379 | Main API + cache | +| tables | PostgreSQL | 5432 | Primary database | +| vault | Vault | 8200 | Secrets | +| drive | MinIO | 9000/9100 | Object storage | +| directory | Zitadel | 9000 | Identity provider | +| llm | llama.cpp | 8081 | Local LLM | +| vectordb | Qdrant | 6333 | Vector database | +| alm | Forgejo | 4747 | Git (NOT 3000!) | +| alm-ci | Runner | - | CI/CD | +| proxy | Caddy | 80/443 | Reverse proxy | +| email | Stalwart | 993/465/587 | Mail | +| dns | CoreDNS | 53 | DNS | +| meet | LiveKit | 7880 | Video | + +> Container deployment details: botbook/src/02-architecture-packages/containers.md +> Backup/recovery procedures: botbook/src/12-ecosystem-reference/backup-recovery.md + +--- + +## NETWORK — NAT PORT FORWARDING + +External ports DNAT to container IPs via iptables. Rules in /etc/iptables.rules. +Always use external interface (-i iface) to avoid loopback issues. + +Port Map: 53=DNS 80/443=HTTP/HTTPS 5432=PostgreSQL 993=IMAPS 465=SMTPS 587=Submission 4747=Forgejo 9000=MinIO 8200=Vault + +--- + +## CONTAINER OPERATIONS + +### Daily Health Check + +```bash +# Container status +sudo incus list + +# Service health - all should show active +sudo incus exec system -- systemctl is-active botserver +sudo incus exec system -- systemctl is-active ui +sudo incus exec tables -- pgrep -f postgres > /dev/null && echo OK || echo DOWN +sudo incus exec drive -- pgrep -f minio > /dev/null && echo OK || echo DOWN +sudo incus exec vault -- curl -ksf https://localhost:8200/v1/sys/health | grep -q sealed.*false && echo "Vault OK" || echo "Vault SEALED" + +# App health endpoint +curl -sf https:///api/health && echo OK || echo FAILED + +# Recent errors +sudo incus exec system -- tail -10 /opt/gbo/logs/err.log | grep -i "error|panic|failed" | head -5 +``` + +### Container Management + +```bash +sudo incus list # List all +sudo incus start|stop|restart # Lifecycle +sudo incus exec -- bash # Shell +sudo incus log --show-log # Logs +sudo incus snapshot create pre-change-$(date +%Y%m%d%H%M%S) # Backup +sudo incus snapshot restore # Restore +``` + +### Service Management (inside container) + +```bash +sudo incus exec -- pgrep -a # Check running +sudo incus exec -- systemctl restart # Restart +sudo incus exec -- ss -tlnp # Ports +``` + +> Full container docs: botbook/src/02-architecture-packages/containers.md + +--- + +## VAULT SECURITY ARCHITECTURE + +Vault is the single source of truth for all secrets. Botserver reads VAULT_ADDR and VAULT_TOKEN from /opt/gbo/bin/.env at startup. + +### Global Vault Paths + +| Path | Contents | +|------|----------| +| gbo/tables | PostgreSQL credentials | +| gbo/drive | MinIO access key and secret | +| gbo/cache | Valkey password | +| gbo/llm | LLM URL and API keys | +| gbo/directory | Zitadel config | +| gbo/email | SMTP credentials | +| gbo/vectordb | Qdrant config | +| gbo/jwt | JWT signing secret | +| gbo/encryption | Master encryption key | + +Organization-scoped: gbo/orgs/{org_id}/bots/{bot_id} +Tenant infrastructure: gbo/tenants/{tenant_id}/infrastructure + +### Credential Resolution Order + +org+bot level → default bot path → global path → env vars (dev only) + +### Vault Operations + +```bash +# Health check +sudo incus exec vault -- curl -ksf https://localhost:8200/v1/sys/health + +# Unseal (3 of 5 keys from init.json) +sudo incus exec vault -- vault operator unseal $KEY1 +sudo incus exec vault -- vault operator unseal $KEY2 +sudo incus exec vault -- vault operator unseal $KEY3 + +# Read secret +sudo incus exec vault -- vault kv get secret/gbo/tables + +# Generate new token +sudo incus exec vault -- vault token create -policy="botserver" -ttl="8760h" -format=json +``` + +### Vault Troubleshooting + +- Cannot connect: check systemd, token not expired (vault token lookup), CA cert path, network to vault container +- Secrets missing: vault kv get — if NOT FOUND, add with vault kv put +- Sealed after restart: unseal with 3 keys from init.json +- TLS errors: confirm /opt/gbo/conf/system/certificates/ca/ca.crt exists, copy from vault container if missing +- init.json at /opt/gbo/bin/botserver-stack/conf/vault/vault-conf/ — root token + 5 unseal keys. NEVER commit. + +--- + +## DNS MANAGEMENT + +### Critical Rules + +1. Update serial number in SOA record (format: YYYYMMDDNN) +2. Run sync-zones.sh to propagate to secondary nameservers +3. Anonymize IPs and credentials in all documentation + +### Workflow + +```bash +# 1. Edit zone file +sudo incus exec dns -- nano /opt/gbo/data/.zone + +# 2. Update serial +sudo incus exec dns -- sed -i 's/YYYYMMDD01/YYYYMMDD02/' /opt/gbo/data/.zone + +# 3. Reload CoreDNS +sudo incus exec dns -- pkill -HUP coredns + +# 4. Sync to secondary NS +sudo /opt/gbo/bin/sync-zones.sh + +# 5. Verify +dig @9.9.9.9 A +short +``` + +### Adding HTTPS Subdomain + +Order: DNS record → wait propagation → add Caddy config → Caddy auto-obtains Let's Encrypt cert + +```bash +# After DNS propagated, add Caddy config +sudo sh -c 'cat >> /opt/gbo/conf/config << CADDYEOF + +. { import tls_config; reverse_proxy http://: { header_up Host {host}; header_up X-Real-IP {remote}; header_up X-Forwarded-Proto https } } +CADDYEOF' +sudo incus exec proxy -- systemctl restart proxy +``` + +> DNS/Proxy details: botbook/src/02-architecture-packages/containers.md + +--- + +## CI/CD — FORGEJO ALM + +ALM port is 4747 (NOT 3000!). Runner token in action_runner_token table. +Runner: gbuser uid 1000, workspace /opt/gbo/data/, SSH key /home/gbuser/.ssh/id_ed25519 + +### CI Status Codes + +0=pending, 1=success, 2=failure, 3=cancelled, 6=running + +### CI Queries (PROD-ALM database) + +```bash +# List recent runs +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -c \ + "SELECT id, title, status, to_timestamp(created) AS created_at FROM action_run ORDER BY id DESC LIMIT 10;" +# Failed run jobs +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -c \ + "SELECT id, name, status, task_id FROM action_run_job WHERE run_id = ;" +# Step status +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -c \ + "SELECT name, status, log_index, log_length FROM action_task_step WHERE task_id = ORDER BY index;" +# Read build log (zstd-compressed) +sudo incus file pull alm/opt/gbo/data/data/actions_log/ /tmp/ci-log.log.zst +zstd -d /tmp/ci-log.log.zst -o /tmp/ci-log.log && cat /tmp/ci-log.log +``` + +### CI Runner Troubleshooting + +| Symptom | Cause | Fix | +|---------|-------|-----| +| Runner not connecting | Wrong ALM port | Use port 4747 | +| /tmp permission denied | Wrong permissions | chmod 1777 /tmp on alm-ci | +| Runner down | Process crashed | pkill -9 forgejo; restart daemon | +| Build stuck at status 6 | DB race condition | Reset status in action_task/action_run | +| GLIBC mismatch | Wrong build env | Rebuild inside system container (Debian 12) | + +### Reset Stuck CI Run + +```sql +UPDATE action_task SET status = 0 WHERE id = ; +UPDATE action_run_job SET status = 0 WHERE run_id = ; +UPDATE action_run SET status = 0 WHERE id = ; +``` + +### Verify Deployment + +```bash +sudo incus exec system -- stat -c '%y' /opt/gbo/bin/botserver +sudo incus exec system -- systemctl status botserver --no-pager +curl -sf https:///api/health && echo OK || echo FAILED +``` + +Build timing: 2-5 min cold, 30-60s incremental, ~5s deploy + +> CI/CD details: botbook/src/12-ecosystem-reference/ci-cd.md + +--- + +## MINIO (DRIVE) OPERATIONS + +All bot files live in MinIO buckets. Use mc CLI at /opt/gbo/bin/mc from drive container. + +### Bucket Structure Per Bot + +{bot}.gbai/{bot}.gbdialog/ — BASIC scripts +{bot}.gbai/{bot}.gbot/ — config.csv +{bot}.gbai/{bot}.gbkb/ — knowledge base + +### Common mc Commands + +```bash +# All mc commands need PATH set +sudo incus exec drive -- bash -c 'export PATH=/opt/gbo/bin:$PATH && mc ' + +mc ls local/ # List all buckets +mc ls local/.gbai/ # List bot bucket +mc cat local/.gbai/.gbdialog/start.bas # Read file +mc cp local/.gbai/.gbdialog/file /tmp/ # Download +mc cp /tmp/file local/.gbai/.gbot/config.csv # Upload (triggers DriveMonitor) +mc stat local/.gbai/.gbot/config.csv # Show ETag/metadata +mc mb local/newbot.gbai # Create bucket +mc admin info local # Health check + +# Force re-sync (change ETag without content change) +mc cp local/.gbai/.gbot/config.csv local/.gbai/.gbot/config.csv +``` + +### Upload config.csv workflow: download via mc cat → edit locally → push via mc cp → wait 15s → verify in logs + +--- + +## DRIVEMONITOR & BOT CONFIGURATION + +DriveMonitor watches MinIO buckets and syncs changes to local filesystem and database every 10 seconds. + +Monitors: .gbdialog/ (BASIC scripts, downloads+recompiles), .gbot/ (config.csv, syncs to bot_configuration table), .gbkb/ (KB docs, downloads+indexes for vector search) + +### Database Tables + +- bot_configuration: bot_id, config_key, config_value, config_type, is_encrypted, updated_at +- gbot_config_sync: bot_id, config_file_path, last_sync_at, file_hash, sync_count + +### Config CSV Format + +No header, each line: key,value (e.g. llm-provider,groq or theme-color1,#cc0000) + +### Check Config Status + +```bash +sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c \ + "SELECT config_key, config_value FROM bot_configuration WHERE bot_id = (SELECT id FROM bots WHERE name = '') ORDER BY config_key;" +``` + +### Debug DriveMonitor + +```bash +sudo incus exec system -- tail -f /opt/gbo/logs/out.log | grep -E "DRIVE_MONITOR|check_gbot|config" +``` + +Empty gbot_config_sync = DriveMonitor not synced yet. If no log entries after 30s, restart botserver. Force re-sync: mc cp file over itself to change ETag. + +--- + +## DIRECTORY MANAGEMENT (ZITADEL) + +### Access + +- Internal: http://:9000 +- External: https:// +- Console: https:///ui/console +- Always use v2 API (v1 is deprecated) +- Must include -H "Host: " header or API returns 404 + +### Get Admin PAT + +```bash +PAT=$(sudo incus exec directory -- cat /opt/gbo/conf/directory/admin-pat.txt) +``` + +### User Operations (v2) — always include Host header + +Create user: POST /v2/users/human with username, profile, email, password JSON +List users: POST /v2/users with query offset/limit JSON +Update password: POST /v2/users/{id}/password with newPassword JSON +Create org: POST /v2/organizations with name JSON +Add domain: POST /v2/organizations/{org-id}/domains with domainName JSON + +All require: -H "Authorization: Bearer $PAT" -H "Host: " + +> Directory auth details: botbook/src/09-security/ + +--- + + +## ALERT RESPONSE PLAYBOOK + +### No IPv4 → set static IP (sudo incus config device set eth0 ipv4.address ; write /etc/network/interfaces; restart) +### Vault Sealed → unseal with 3 of 5 keys from init.json +### Botserver Down → systemctl restart; check ldd for missing libs +### Email No Internet → fix DNS (nameserver 8.8.8.8); or fix IPv6-only (see No IPv4) +### CI Build Failed → see CI/CD section for log retrieval and stuck run reset + +--- + +## BASIC COMPILATION + +Compilation in BasicCompiler (DriveMonitor) → .ast in work/{bot}.gbai/{bot}.gbdialog/. Runtime loads .ast only via ScriptService::run(). No .bas fallback at runtime. Suggestion dedup: Redis SADD, key suggestions:{bot_id}:{session_id}, read SMEMBERS. + +--- + +## LOGGING + +```bash +sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "error|panic|failed" # Errors +sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "" # Bot activity +sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "drive|config" # DriveMonitor +sudo incus exec system -- tail -f /opt/gbo/logs/err.log | grep -i "model|llm" # LLM calls +sudo incus exec alm-ci -- tail -f /opt/gbo/logs/forgejo-runner.log # CI runner +``` + +> Full troubleshooting: botbook/src/12-ecosystem-reference/troubleshooting.md + +--- + +## PROGRAM ACCESS + +| Program | Container | Path | Notes | +|---------|-----------|------|-------| +| botserver | system | /opt/gbo/bin/botserver | systemctl only | +| botui | system | /opt/gbo/bin/botui | systemctl only | +| mc | drive | /opt/gbo/bin/mc | PATH=/opt/gbo/bin:$PATH | +| psql | tables | /usr/bin/psql | psql -h localhost -U postgres -d botserver | +| vault | vault | /opt/gbo/bin/vault | Needs VAULT_ADDR, VAULT_TOKEN, VAULT_CACERT | + +### Quick psql + +```bash +# Bot config +sudo incus exec tables -- psql -h localhost -U postgres -d botserver -c \ + "SELECT config_key, config_value FROM bot_configuration WHERE bot_id = (SELECT id FROM bots WHERE name = '') ORDER BY config_key;" +# ALM CI runs +sudo incus exec tables -- psql -h localhost -U postgres -d PROD-ALM -c \ + "SELECT id, status, created FROM action_run ORDER BY id DESC LIMIT 5;" +``` + +--- + +## COMMON ERRORS + +| Error | Cause | Fix | +|-------|-------|-----| +| No IPv4 | DHCP failed | Set static IP | +| /tmp permission denied | Wrong perms | chmod 1777 /tmp | +| Token.Invalid | PAT expired | Regenerate in Zitadel console | +| failed SASL auth | Wrong DB password | Check Vault gbo/tables | +| GLIBC not found | Wrong build env | Rebuild in system container (Debian 12) | +| connection refused | Service down | systemctl restart | +| exec format error | Arch mismatch | Recompile for target | +| address in use | Port conflict | lsof -i :port | +| cert verify failed | Wrong CA | Copy from vault container | +| DNS lookup failed | No IPv4 | Check network config | +| botui cant reach server | Wrong URL | BOTSERVER_URL=http://localhost:5858 | +| Suggestions missing | .bas error | Check logs, clear /opt/gbo/work/ AST cache | +| IPv6 DNS timeouts | AAAA no IPv6 | RES_OPTIONS=inet4, IPV6=no | +| Dev paths in logs | Missing .env | DATA_DIR=/opt/gbo/work/ WORK_DIR=/opt/gbo/work/ | + +--- + +## ESCALATION + +1. Capture logs: sudo incus exec system -- tar czf /tmp/debug-$(date +%Y%m%d).tar.gz /opt/gbo/logs/ +2. Check AGENTS.md for dev troubleshooting +3. Review recent commits for breaking changes +4. Snapshot rollback (last resort) diff --git a/bottemplates/default.gbai/default.gbdialog/start.bas b/bottemplates/default.gbai/default.gbdialog/start.bas new file mode 100644 index 00000000..e3c72a3b --- /dev/null +++ b/bottemplates/default.gbai/default.gbdialog/start.bas @@ -0,0 +1,13 @@ +' start.bas - Salesianos Bot Configuration +' This runs once per session + +' Add switchers with colors +ADD SWITCHER tables AS "Tabelas" +ADD SWITCHER list AS "Lista" +ADD SWITCHER cards AS "Cards" + +' Add suggestions +ADD SUGGESTION "Cartas" +ADD SUGGESTION "Procedimentos" +ADD SUGGESTION "Ramais" +ADD SUGGESTION "Todos"