From 09ccb5e0dd83262ffc3d9f6d7962f81808f8970b Mon Sep 17 00:00:00 2001 From: "Rodrigo Rodriguez (Pragmatismo)" Date: Mon, 1 Dec 2025 08:35:28 -0300 Subject: [PATCH] Update monitoring dashboard with animated SVG visualization Replace static grid layout with interactive live system view featuring: - Animated data packets flowing between service nodes - Real-time metrics panels with HTMX polling - Service status dots with pulse animations - Resource utilization bars - Live activity ticker - Toggle between Live and Grid views (V key) Documentation updated to reflect new visualization and API endpoints. --- docs/src/chapter-04-gbui/monitoring.md | 262 +++-- ui/suite/monitoring/monitoring.html | 1437 ++++++++++++++++++++++-- 2 files changed, 1491 insertions(+), 208 deletions(-) diff --git a/docs/src/chapter-04-gbui/monitoring.md b/docs/src/chapter-04-gbui/monitoring.md index 76bae2fb8..8d39c0ba7 100644 --- a/docs/src/chapter-04-gbui/monitoring.md +++ b/docs/src/chapter-04-gbui/monitoring.md @@ -1,74 +1,126 @@ # Monitoring Dashboard -The Monitoring Dashboard provides real-time visibility into your General Bots deployment, displaying system health, active sessions, and resource utilization in a clean tree-based interface. +The Monitoring Dashboard provides real-time visibility into your General Bots deployment through an animated, interactive SVG visualization showing system health, active sessions, and resource utilization. -## Live System Architecture +## Live System Visualization -Your General Bots deployment is a living ecosystem of interconnected components. The diagram below shows how all services work together in real-time: +Live Monitoring Dashboard -![Live Monitoring Organism](../assets/suite/live-monitoring-organism.svg) +The dashboard displays BotServer at the center orchestrating all interactions, with animated data packets flowing between components: -This animated diagram illustrates BotServer at the center orchestrating all interactions, with the data layer on the left comprising PostgreSQL, Qdrant, and MinIO for storage. Services on the right include BotModels, Vault, and Cache for AI and security functionality. Analytics at the bottom shows InfluxDB collecting metrics. The animated connection flows represent real-time data packets moving between components. +- **Left Side (Data Layer)**: PostgreSQL, Qdrant vector database, and MinIO storage +- **Right Side (Services)**: BotModels AI, Cache, and Vault security +- **Center**: BotServer core with pulsing rings indicating activity +- **Top**: Real-time metrics panels for sessions, messages, and response time +- **Bottom**: Resource utilization bars and activity ticker -## Overview +## Accessing the Dashboard -Access the Monitoring tab from the Suite interface to view active sessions and conversations, message throughput statistics, system resources including CPU, GPU, and memory utilization, service health status for all components, and bot activity metrics across your deployment. +Access monitoring from the Suite interface: +1. Click the apps menu (⋮⋮⋮) +2. Select **Monitoring** +3. Or navigate directly to `/monitoring` -## Dashboard Layout +## Dashboard Features -The monitoring interface uses a hierarchical tree view for organized data display. The Sessions panel shows active connections, peak usage, and average duration. The Messages panel displays daily totals, hourly rates, and response times. The Resources panel presents CPU, Memory, GPU, and Disk utilization with visual progress bars. The Services panel indicates health status of PostgreSQL, Qdrant, Cache, Drive, and BotModels. The Active Bots panel lists all running bots with their respective session counts. +### Animated System Architecture -## Metrics Explained +The SVG visualization shows real-time data flow: -### Sessions +| Component | Color | Description | +|-----------|-------|-------------| +| **BotServer** | Blue/Purple | Central orchestrator with rotating ring | +| **PostgreSQL** | Blue | Primary database with cylinder icon | +| **Qdrant** | Purple | Vector database with triangle nodes | +| **MinIO** | Amber | Object storage with disk icon | +| **BotModels** | Pink | AI/ML service with neural network icon | +| **Cache** | Cyan | In-memory cache with lightning icon | +| **Vault** | Green | Secrets management with lock icon | -| Metric | Description | -|--------|-------------| -| **Active** | Current open conversations | -| **Peak Today** | Maximum concurrent sessions today | -| **Avg Duration** | Average session length | -| **Unique Users** | Distinct users today | +### Status Indicators -### Messages +Each service has a status dot: -| Metric | Description | -|--------|-------------| -| **Today** | Total messages processed today | -| **This Hour** | Messages in the current hour | -| **Avg Response** | Average bot response time | -| **Success Rate** | Percentage of successful responses | +| Status | Color | Animation | +|--------|-------|-----------| +| **Running** | 🟢 Green | Gentle pulse | +| **Warning** | 🟡 Amber | Fast pulse | +| **Stopped** | 🔴 Red | No animation | -### Resources +### Real-Time Metrics -| Resource | Description | Warning Threshold | -|----------|-------------|-------------------| -| **CPU** | Processor utilization | > 80% | -| **Memory** | RAM usage | > 85% | -| **GPU** | Graphics processor (if available) | > 90% | -| **Disk** | Storage utilization | > 90% | +Three metric panels at the top update automatically: -### Services +| Panel | Update Interval | Description | +|-------|-----------------|-------------| +| **Active Sessions** | 5 seconds | Current open conversations with trend | +| **Messages Today** | 10 seconds | Total messages with hourly rate | +| **Avg Response** | 10 seconds | Average response time in milliseconds | -| Status | Indicator | Meaning | -|--------|-----------|---------| -| Running | Green dot | Service is healthy | -| Warning | Yellow dot | Service degraded | -| Stopped | Red dot | Service unavailable | +### Resource Utilization -## Real-Time Updates +Resource bars show system health: -The dashboard refreshes automatically using HTMX polling at different intervals depending on the metric type. Session counts update every 5 seconds for immediate visibility into user activity. Message metrics refresh every 10 seconds to show current throughput. Resource usage updates every 15 seconds since hardware metrics change more gradually. Service health checks run every 30 seconds to detect component issues without excessive overhead. +| Resource | Gradient | Warning Threshold | +|----------|----------|-------------------| +| **CPU** | Blue/Purple | > 80% | +| **Memory** | Green | > 85% | +| **GPU** | Purple | > 90% | +| **Disk** | Amber | > 90% | -## Accessing via API +### Activity Ticker -Monitoring data is available programmatically through the REST API: +A live ticker at the bottom shows the latest system events with a pulsing green indicator. -``` +## View Modes + +Toggle between two views using the grid button or press `V`: + +### Live View (Default) +The animated SVG visualization showing the complete system topology with flowing data packets. + +### Grid View +Traditional panel-based layout with detailed tree views for each metric category: +- Sessions panel with active, peak, and duration +- Messages panel with counts and rates +- Resources panel with progress bars +- Services panel with health status +- Bots panel with active bot list + +## Keyboard Shortcuts + +| Shortcut | Action | +|----------|--------| +| `V` | Toggle between Live and Grid view | +| `R` | Refresh all metrics | + +## HTMX Integration + +The dashboard uses HTMX for real-time updates without full page reloads: + +| Endpoint | Interval | Data | +|----------|----------|------| +| `/api/monitoring/metric/sessions` | 5s | Session count | +| `/api/monitoring/metric/messages` | 10s | Message count | +| `/api/monitoring/metric/response_time` | 10s | Avg response | +| `/api/monitoring/resources/bars` | 15s | Resource SVG bars | +| `/api/monitoring/services/status` | 30s | Service health JSON | +| `/api/monitoring/activity/latest` | 5s | Activity text | +| `/api/monitoring/timestamp` | 5s | Last updated time | + +## API Access + +Access monitoring data programmatically: + +### Get Full Status + +```/dev/null/monitoring-api.txt GET /api/monitoring/status ``` **Response:** -```json + +```/dev/null/monitoring-response.json { "sessions": { "active": 12, @@ -81,8 +133,8 @@ GET /api/monitoring/status "avg_response_ms": 1200 }, "resources": { - "cpu_percent": 78, - "memory_percent": 62, + "cpu_percent": 65, + "memory_percent": 72, "gpu_percent": 45, "disk_percent": 28 }, @@ -91,14 +143,22 @@ GET /api/monitoring/status "qdrant": "running", "cache": "running", "drive": "running", - "botmodels": "stopped" + "botmodels": "running", + "vault": "running" } } ``` -## Understanding Component Health +### Service-Specific Endpoints -Each component in the system has specific health indicators that help identify potential issues before they impact users. +| Endpoint | Returns | +|----------|---------| +| `/api/monitoring/services` | All service details | +| `/api/monitoring/bots` | Active bot list | +| `/api/monitoring/history?period=24h` | Historical metrics | +| `/api/monitoring/prometheus` | Prometheus format export | + +## Component Health Details | Component | Health Check | Warning Signs | |-----------|--------------|---------------| @@ -108,30 +168,12 @@ Each component in the system has specific health indicators that help identify p | **BotModels** | Token usage, response latency | > 2s response time | | **Vault** | Seal status, policy count | Unsealed without auth | | **Cache** | Hit rate, memory usage | < 80% hit rate | -| **InfluxDB** | Write rate, retention | Write failures | - -## Console Mode - -In console mode, monitoring displays as text output suitable for terminal environments and SSH sessions: - -```bash -./botserver --console --monitor -``` - -Output: -``` -[MONITOR] 2024-01-15 14:32:00 -Sessions: 12 active (peak: 47) -Messages: 1,234 today (89/hour) -CPU: 78% | MEM: 62% | GPU: 45% -Services: 4/5 running -``` ## Alerts Configuration -Configure alert thresholds in `config.csv` to receive notifications when metrics exceed acceptable levels: +Configure alert thresholds in `config.csv`: -```csv +```/dev/null/config-alerts.csv name,value alert-cpu-threshold,80 alert-memory-threshold,85 @@ -140,50 +182,78 @@ alert-response-time-ms,5000 alert-email,admin@example.com ``` -These are example configuration values that should be adjusted based on your infrastructure capacity and operational requirements. +## Console Mode -## Bot-Specific Metrics +For terminal-based monitoring: -View metrics for individual bots by querying the bot-specific endpoint: - -``` -GET /api/monitoring/bots/{bot_id} +```/dev/null/console-command.bash +./botserver --console --monitor ``` -This returns message count for the specific bot, active sessions currently connected to it, average response time for that bot's interactions, knowledge base query statistics showing search performance, and tool execution counts indicating which tools are used most frequently. - -## Historical Data - -Access historical metrics to analyze trends and patterns over time: - -``` -GET /api/monitoring/history?period=24h +Output: +```/dev/null/console-output.txt +[MONITOR] 2025-01-15 14:32:00 +Sessions: 12 active (peak: 47) +Messages: 1,234 today (89/hour) +CPU: 65% | MEM: 72% | GPU: 45% +Services: 6/6 running ``` -Supported periods include `1h` for the last hour with minute granularity, `24h` for the last 24 hours with hourly granularity, `7d` for the last 7 days with daily granularity, and `30d` for the last 30 days with daily granularity. Historical data helps identify patterns and plan capacity improvements. +## Tips & Best Practices -## Performance Tips +💡 **Watch the data packets** - Flowing animations indicate active communication between components -When experiencing high CPU usage, check for complex BASIC scripts that may be computationally expensive, review LLM call frequency to identify unnecessary AI invocations, and consider enabling semantic caching to reduce redundant processing. +💡 **Monitor trends** - The session trend indicator (↑/↓) shows direction of change -For high memory usage, reduce the `max-context-messages` configuration to limit conversation history size, clear unused KB collections that consume memory for vector storage, and restart services periodically to clear accumulated caches. +💡 **Click services** - Click any service node in Live view to see detailed status -When response times are slow, enable semantic caching to serve repeated queries instantly, optimize KB document sizes by splitting large documents, and consider using faster LLM models that trade some quality for speed. +💡 **Set up alerts** - Configure thresholds before issues become critical -## Integration with External Tools +💡 **Use keyboard shortcuts** - Press `R` for quick refresh, `V` to toggle views -Export metrics in Prometheus format for integration with external monitoring systems: +💡 **Check historical data** - Query `/api/monitoring/history` for trend analysis -``` -GET /api/monitoring/prometheus -``` +## Troubleshooting -This endpoint is compatible with Prometheus for metrics collection, Grafana for visualization dashboards, Datadog for cloud monitoring, and New Relic for application performance management. +### Dashboard not loading -## Monitoring Best Practices +**Possible causes:** +1. WebSocket connection failed +2. API endpoint unreachable +3. Browser blocking HTMX -Check the live diagram regularly since the animated SVG shows real-time data flow and helps visualize system behavior. Set up alerts early rather than waiting for problems to occur before configuring notifications. Monitor trends in addition to absolute values because a gradual increase in CPU usage can be as significant as a sudden spike. Keep historical data by configuring InfluxDB retention policies to maintain useful history for capacity planning. Correlate metrics when troubleshooting since high response time combined with high CPU usually indicates a need for scaling. +**Solution:** +1. Check browser console for errors +2. Verify `/api/monitoring/status` returns data +3. Refresh the page + +### Metrics showing "--" + +**Possible causes:** +1. Initial load in progress +2. API timeout +3. Service unavailable + +**Solution:** +1. Wait 5-10 seconds for first update +2. Check network tab for failed requests +3. Verify services are running + +### Animations stuttering + +**Possible causes:** +1. High CPU usage +2. Many browser tabs open +3. Hardware acceleration disabled + +**Solution:** +1. Close unused tabs +2. Enable hardware acceleration in browser +3. Use Grid view for lower resource usage ## See Also -The monitoring tutorial provides step-by-step guidance for monitoring your bot effectively. Console mode documentation covers the command-line interface for terminal-based monitoring. Configuration options explain all available settings. The Monitoring API reference provides complete endpoint documentation for programmatic access. \ No newline at end of file +- [Monitoring API Reference](../chapter-10-api/monitoring-api.md) +- [Console Mode](./console-mode.md) +- [Configuration Options](../chapter-08-config/README.md) +- [Analytics App](./apps/analytics.md) \ No newline at end of file diff --git a/ui/suite/monitoring/monitoring.html b/ui/suite/monitoring/monitoring.html index dda0382aa..b246b2951 100644 --- a/ui/suite/monitoring/monitoring.html +++ b/ui/suite/monitoring/monitoring.html @@ -13,9 +13,9 @@ cy="12" r="10" stroke="currentColor" - stroke-width="1.5" + stroke-width="1.5" fill="none" - opacity="0.3" + opacity="0.3" /> Monitoring Dashboard - -- +
+ + -- +
-
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + BotServer + + + ● Running + + + + + + + + + + + + + + + + + PostgreSQL + + + + + + + + + + + + + + + + Qdrant + + + + + + + + + + + + + + + + + MinIO + + + + + + + + + + + + + + + + + BotModels + + + + + + + + + + + + ⚡ + + + + Cache + + + + + + + + + + + + + + + + Vault + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ACTIVE SESSIONS + + + -- + + + ↑ 0% + + + + + + + + MESSAGES TODAY + + + -- + + + 0/hr + + + + + + + + AVG RESPONSE + + + -- + + + ms + + + + + + + + + + CPU + + + + + 65% + + + + + + + MEM + + + + + 72% + + + + + + + GPU + + + + + 45% + + + + + + + DISK + + + + + 28% + + + + + + + + + + + System monitoring active... + + + + + + + + + + Running + + + + + + Warning + + + + + + Stopped + + + + +
+ + +