- ADD SUGGESTION TOOL "name" WITH params AS "text" for pre-filled
params
- Add secrets module for Vault integration with minimal .env approach
- Update LLM providers documentation with model recommendations
- Refactor template dialogs for consistency:
- Use PARAM with proper types and DESCRIPTION
- Use WITH blocks for structured data
- Simplify TALK messages (remove emoji prefixes)
- Add RETURN statements to tools
- Add proper CLEAR SUGGESTIONS and ADD TOOL patterns
- Add analytics-dashboard template demonstrating KB Statistics usage ```
484 lines
No EOL
11 KiB
Markdown
484 lines
No EOL
11 KiB
Markdown
# Observability
|
|
|
|
General Bots uses a comprehensive observability stack for monitoring, logging, and metrics collection. This chapter explains how logging works and how Vector integrates without requiring code changes.
|
|
|
|
## Architecture Overview
|
|
|
|

|
|
|
|
*Vector Agent collects logs from BotServer without requiring any code changes.*
|
|
|
|
## No Code Changes Required
|
|
|
|
**You do NOT need to replace `log::trace!()`, `log::info!()`, `log::error!()` calls.**
|
|
|
|
Vector works by:
|
|
|
|
1. **Tailing log files** - Reads from `./botserver-stack/logs/`
|
|
2. **Parsing log lines** - Extracts level, timestamp, message
|
|
3. **Routing by level** - Sends errors to alerts, metrics to InfluxDB
|
|
4. **Enriching data** - Adds hostname, service name, etc.
|
|
|
|
Log directory structure:
|
|
- `logs/system/` - BotServer application logs
|
|
- `logs/drive/` - MinIO logs
|
|
- `logs/tables/` - PostgreSQL logs
|
|
- `logs/cache/` - Redis logs
|
|
- `logs/llm/` - LLM server logs
|
|
- `logs/email/` - Stalwart logs
|
|
- `logs/directory/` - Zitadel logs
|
|
- `logs/vectordb/` - Qdrant logs
|
|
- `logs/meet/` - LiveKit logs
|
|
- `logs/alm/` - Forgejo logs
|
|
|
|
This approach:
|
|
- Requires zero code changes
|
|
- Works with existing logging
|
|
- Can be added/removed without recompilation
|
|
- Scales independently from the application
|
|
|
|
## Vector Configuration
|
|
|
|
### Installation
|
|
|
|
Vector is installed as the **observability** component:
|
|
|
|
```bash
|
|
./botserver install observability
|
|
```
|
|
|
|
### Configuration File
|
|
|
|
Configuration is at `./botserver-stack/conf/monitoring/vector.toml`:
|
|
|
|
```toml
|
|
# Vector Configuration for General Bots
|
|
# Collects logs without requiring code changes
|
|
# Component: observability (Vector)
|
|
# Config: ./botserver-stack/conf/monitoring/vector.toml
|
|
|
|
#
|
|
# SOURCES - Where logs come from
|
|
#
|
|
|
|
[sources.botserver_logs]
|
|
type = "file"
|
|
include = ["./botserver-stack/logs/system/*.log"]
|
|
read_from = "beginning"
|
|
|
|
[sources.drive_logs]
|
|
type = "file"
|
|
include = ["./botserver-stack/logs/drive/*.log"]
|
|
read_from = "beginning"
|
|
|
|
[sources.tables_logs]
|
|
type = "file"
|
|
include = ["./botserver-stack/logs/tables/*.log"]
|
|
read_from = "beginning"
|
|
|
|
[sources.cache_logs]
|
|
type = "file"
|
|
include = ["./botserver-stack/logs/cache/*.log"]
|
|
read_from = "beginning"
|
|
|
|
[sources.llm_logs]
|
|
type = "file"
|
|
include = ["./botserver-stack/logs/llm/*.log"]
|
|
read_from = "beginning"
|
|
|
|
[sources.service_logs]
|
|
type = "file"
|
|
include = [
|
|
"./botserver-stack/logs/email/*.log",
|
|
"./botserver-stack/logs/directory/*.log",
|
|
"./botserver-stack/logs/vectordb/*.log",
|
|
"./botserver-stack/logs/meet/*.log",
|
|
"./botserver-stack/logs/alm/*.log"
|
|
]
|
|
read_from = "beginning"
|
|
|
|
#
|
|
# TRANSFORMS - Parse and enrich logs
|
|
#
|
|
|
|
[transforms.parse_botserver]
|
|
type = "remap"
|
|
inputs = ["botserver_logs"]
|
|
source = '''
|
|
# Parse standard log format: [TIMESTAMP LEVEL target] message
|
|
. = parse_regex!(.message, r'^(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z?)\s+(?P<level>\w+)\s+(?P<target>\S+)\s+(?P<message>.*)$')
|
|
|
|
# Convert timestamp
|
|
.timestamp = parse_timestamp!(.timestamp, "%Y-%m-%dT%H:%M:%S%.fZ")
|
|
|
|
# Normalize level
|
|
.level = downcase!(.level)
|
|
|
|
# Add service name
|
|
.service = "botserver"
|
|
|
|
# Extract session_id if present
|
|
if contains(string!(.message), "session") {
|
|
session_match = parse_regex(.message, r'session[:\s]+(?P<session_id>[a-f0-9-]+)') ?? {}
|
|
if exists(session_match.session_id) {
|
|
.session_id = session_match.session_id
|
|
}
|
|
}
|
|
|
|
# Extract user_id if present
|
|
if contains(string!(.message), "user") {
|
|
user_match = parse_regex(.message, r'user[:\s]+(?P<user_id>[a-f0-9-]+)') ?? {}
|
|
if exists(user_match.user_id) {
|
|
.user_id = user_match.user_id
|
|
}
|
|
}
|
|
'''
|
|
|
|
[transforms.parse_service_logs]
|
|
type = "remap"
|
|
inputs = ["service_logs"]
|
|
source = '''
|
|
# Basic parsing for service logs
|
|
.timestamp = now()
|
|
.level = "info"
|
|
|
|
# Detect errors
|
|
if contains(string!(.message), "ERROR") || contains(string!(.message), "error") {
|
|
.level = "error"
|
|
}
|
|
if contains(string!(.message), "WARN") || contains(string!(.message), "warn") {
|
|
.level = "warn"
|
|
}
|
|
|
|
# Extract service name from file path
|
|
.service = replace(string!(.file), r'.*/(\w+)\.log$', "$1")
|
|
'''
|
|
|
|
#
|
|
# FILTERS - Route by log level
|
|
#
|
|
|
|
[transforms.filter_errors]
|
|
type = "filter"
|
|
inputs = ["parse_botserver", "parse_service_logs"]
|
|
condition = '.level == "error"'
|
|
|
|
[transforms.filter_warnings]
|
|
type = "filter"
|
|
inputs = ["parse_botserver", "parse_service_logs"]
|
|
condition = '.level == "warn"'
|
|
|
|
[transforms.filter_info]
|
|
type = "filter"
|
|
inputs = ["parse_botserver"]
|
|
condition = '.level == "info" || .level == "debug"'
|
|
|
|
#
|
|
# METRICS - Convert logs to metrics
|
|
#
|
|
|
|
[transforms.log_to_metrics]
|
|
type = "log_to_metric"
|
|
inputs = ["parse_botserver"]
|
|
|
|
[[transforms.log_to_metrics.metrics]]
|
|
type = "counter"
|
|
field = "level"
|
|
name = "log_events_total"
|
|
tags.level = "{{level}}"
|
|
tags.service = "{{service}}"
|
|
|
|
[[transforms.log_to_metrics.metrics]]
|
|
type = "counter"
|
|
field = "message"
|
|
name = "errors_total"
|
|
tags.service = "{{service}}"
|
|
increment_by_value = false
|
|
|
|
#
|
|
# SINKS - Where logs go
|
|
#
|
|
|
|
# All logs to file (backup)
|
|
[sinks.file_backup]
|
|
type = "file"
|
|
inputs = ["parse_botserver", "parse_service_logs"]
|
|
path = "./botserver-stack/logs/vector/all-%Y-%m-%d.log"
|
|
encoding.codec = "json"
|
|
|
|
# Metrics to InfluxDB
|
|
[sinks.influxdb]
|
|
type = "influxdb_metrics"
|
|
inputs = ["log_to_metrics"]
|
|
endpoint = "http://localhost:8086"
|
|
org = "pragmatismo"
|
|
bucket = "metrics"
|
|
token = "${INFLUXDB_TOKEN}"
|
|
|
|
# Errors to alerting (webhook)
|
|
[sinks.alert_webhook]
|
|
type = "http"
|
|
inputs = ["filter_errors"]
|
|
uri = "http://localhost:8080/api/admin/alerts"
|
|
method = "post"
|
|
encoding.codec = "json"
|
|
|
|
# Console output (for debugging)
|
|
[sinks.console]
|
|
type = "console"
|
|
inputs = ["filter_errors"]
|
|
encoding.codec = "text"
|
|
```
|
|
|
|
## Log Format
|
|
|
|
BotServer uses the standard Rust `log` crate format:
|
|
|
|
```
|
|
2024-01-15T10:30:45.123Z INFO botserver::core::bot Processing message for session: abc-123
|
|
2024-01-15T10:30:45.456Z DEBUG botserver::llm::cache Cache hit for prompt hash: xyz789
|
|
2024-01-15T10:30:45.789Z ERROR botserver::drive::upload Failed to upload file: permission denied
|
|
```
|
|
|
|
Vector parses this automatically without code changes.
|
|
|
|
## Metrics Collection
|
|
|
|
### Automatic Metrics
|
|
|
|
Vector converts log events to metrics:
|
|
|
|
| Metric | Description |
|
|
|--------|-------------|
|
|
| `log_events_total` | Total log events by level |
|
|
| `errors_total` | Error count by service |
|
|
| `warnings_total` | Warning count by service |
|
|
|
|
### Application Metrics
|
|
|
|
BotServer also exposes metrics via `/api/metrics` (Prometheus format):
|
|
|
|
```
|
|
# HELP botserver_sessions_active Current active sessions
|
|
# TYPE botserver_sessions_active gauge
|
|
botserver_sessions_active 42
|
|
|
|
# HELP botserver_messages_total Total messages processed
|
|
# TYPE botserver_messages_total counter
|
|
botserver_messages_total{channel="web"} 1234
|
|
botserver_messages_total{channel="whatsapp"} 567
|
|
|
|
# HELP botserver_llm_latency_seconds LLM response latency
|
|
# TYPE botserver_llm_latency_seconds histogram
|
|
botserver_llm_latency_seconds_bucket{le="0.5"} 100
|
|
botserver_llm_latency_seconds_bucket{le="1.0"} 150
|
|
botserver_llm_latency_seconds_bucket{le="2.0"} 180
|
|
```
|
|
|
|
Vector can scrape these directly:
|
|
|
|
```toml
|
|
[sources.prometheus_metrics]
|
|
type = "prometheus_scrape"
|
|
endpoints = ["http://localhost:8080/api/metrics"]
|
|
scrape_interval_secs = 15
|
|
```
|
|
|
|
## Alerting
|
|
|
|
### Error Alerts
|
|
|
|
Vector sends errors to a webhook for alerting:
|
|
|
|
```toml
|
|
[sinks.alert_webhook]
|
|
type = "http"
|
|
inputs = ["filter_errors"]
|
|
uri = "http://localhost:8080/api/admin/alerts"
|
|
method = "post"
|
|
encoding.codec = "json"
|
|
```
|
|
|
|
### Slack Integration
|
|
|
|
```toml
|
|
[sinks.slack_alerts]
|
|
type = "http"
|
|
inputs = ["filter_errors"]
|
|
uri = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
|
method = "post"
|
|
encoding.codec = "json"
|
|
|
|
[sinks.slack_alerts.request]
|
|
headers.content-type = "application/json"
|
|
|
|
[sinks.slack_alerts.encoding]
|
|
codec = "json"
|
|
```
|
|
|
|
### Email Alerts
|
|
|
|
Use with an SMTP relay or webhook-to-email service:
|
|
|
|
```toml
|
|
[sinks.email_alerts]
|
|
type = "http"
|
|
inputs = ["filter_errors"]
|
|
uri = "http://localhost:8025/api/send"
|
|
method = "post"
|
|
encoding.codec = "json"
|
|
```
|
|
|
|
## Grafana Dashboards
|
|
|
|
### Pre-built Dashboard
|
|
|
|
Import the General Bots dashboard from `templates/grafana-dashboard.json`:
|
|
|
|
1. Open Grafana at `http://localhost:3000`
|
|
2. Go to Dashboards → Import
|
|
3. Upload `grafana-dashboard.json`
|
|
4. Select InfluxDB data source
|
|
|
|
### Key Panels
|
|
|
|
| Panel | Query |
|
|
|-------|-------|
|
|
| Active Sessions | `from(bucket:"metrics") \|> filter(fn: (r) => r._measurement == "sessions_active")` |
|
|
| Messages/Minute | `from(bucket:"metrics") \|> filter(fn: (r) => r._measurement == "messages_total") \|> derivative()` |
|
|
| Error Rate | `from(bucket:"metrics") \|> filter(fn: (r) => r.level == "error") \|> count()` |
|
|
| LLM Latency P95 | `from(bucket:"metrics") \|> filter(fn: (r) => r._measurement == "llm_latency") \|> quantile(q: 0.95)` |
|
|
|
|
## Configuration Options
|
|
|
|
### config.csv Settings
|
|
|
|
```csv
|
|
# Observability settings
|
|
observability-enabled,true
|
|
observability-log-level,info
|
|
observability-metrics-endpoint,/api/metrics
|
|
observability-vector-enabled,true
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
| Level | When to Use |
|
|
|-------|-------------|
|
|
| `error` | Something failed, requires attention |
|
|
| `warn` | Unexpected but handled, worth noting |
|
|
| `info` | Normal operations, key events |
|
|
| `debug` | Detailed flow, development |
|
|
| `trace` | Very detailed, performance impact |
|
|
|
|
Set in config.csv:
|
|
|
|
```csv
|
|
log-level,info
|
|
```
|
|
|
|
Or environment:
|
|
|
|
```bash
|
|
RUST_LOG=info ./botserver
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Vector Not Collecting Logs
|
|
|
|
```bash
|
|
# Check Vector status
|
|
systemctl status gbo-observability
|
|
|
|
# View Vector logs
|
|
journalctl -u gbo-observability -f
|
|
|
|
# Test configuration
|
|
vector validate ./botserver-stack/conf/monitoring/vector.toml
|
|
```
|
|
|
|
### Missing Metrics in InfluxDB
|
|
|
|
```bash
|
|
# Check InfluxDB connection
|
|
curl http://localhost:8086/health
|
|
|
|
# Verify bucket exists
|
|
influx bucket list
|
|
|
|
# Check Vector sink status
|
|
vector top
|
|
```
|
|
|
|
### High Log Volume
|
|
|
|
If logs are too verbose:
|
|
|
|
1. Increase log level in config.csv
|
|
2. Add filters in Vector to drop debug logs
|
|
3. Set retention policies in InfluxDB
|
|
|
|
```toml
|
|
# Drop debug logs before sending to InfluxDB
|
|
[transforms.drop_debug]
|
|
type = "filter"
|
|
inputs = ["parse_botserver"]
|
|
condition = '.level != "debug" && .level != "trace"'
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Don't Log Sensitive Data
|
|
|
|
```rust
|
|
// Bad
|
|
log::info!("User password: {}", password);
|
|
|
|
// Good
|
|
log::info!("User {} authenticated successfully", user_id);
|
|
```
|
|
|
|
### 2. Use Structured Context
|
|
|
|
```rust
|
|
// Better for parsing
|
|
log::info!("session={} user={} action=message_sent", session_id, user_id);
|
|
```
|
|
|
|
### 3. Set Appropriate Levels
|
|
|
|
```rust
|
|
// Errors: things that failed
|
|
log::error!("Database connection failed: {}", err);
|
|
|
|
// Warnings: unusual but handled
|
|
log::warn!("Retrying LLM request after timeout");
|
|
|
|
// Info: normal operations
|
|
log::info!("Session {} started", session_id);
|
|
|
|
// Debug: development details
|
|
log::debug!("Cache lookup for key: {}", key);
|
|
|
|
// Trace: very detailed
|
|
log::trace!("Entering function process_message");
|
|
```
|
|
|
|
### 4. Keep Vector Config Simple
|
|
|
|
Start with basic collection, add transforms as needed.
|
|
|
|
## Summary
|
|
|
|
- **No code changes needed** - Vector collects from log files
|
|
- **Keep using log macros** - `log::info!()`, `log::error!()`, etc.
|
|
- **Vector handles routing** - Errors to alerts, all to storage
|
|
- **InfluxDB for metrics** - Time-series storage and queries
|
|
- **Grafana for dashboards** - Visualize everything
|
|
|
|
## Next Steps
|
|
|
|
- [Scaling and Load Balancing](./scaling.md) - Scale observability with your cluster
|
|
- [Infrastructure Design](./infrastructure.md) - Full architecture overview
|
|
- [Monitoring Dashboard](../chapter-04-gbui/monitoring.md) - Built-in monitoring UI |