botserver/docs/multimodal-config.md

# Multimodal Configuration Guide

This document describes how to configure botserver to use the botmodels service for image, video, audio generation, and vision/captioning capabilities.

## Overview

The multimodal feature connects botserver to botmodels - a Python-based service similar to llama.cpp but for multimodal AI tasks. This enables BASIC scripts to generate images, videos, audio, and analyze visual content.

## Configuration Keys

Add the following configuration to your bot's `config.csv` file:

### Image Generator Settings

| Key | Default | Description |
|-----|---------|-------------|
| `image-generator-model` | - | Path to the image generation model (e.g., `../../../../data/diffusion/sd_turbo_f16.gguf`) |
| `image-generator-steps` | `4` | Number of inference steps for image generation |
| `image-generator-width` | `512` | Output image width in pixels |
| `image-generator-height` | `512` | Output image height in pixels |
| `image-generator-gpu-layers` | `20` | Number of layers to offload to GPU |
| `image-generator-batch-size` | `1` | Batch size for generation |

### Video Generator Settings

| Key | Default | Description |
|-----|---------|-------------|
| `video-generator-model` | - | Path to the video generation model (e.g., `../../../../data/diffusion/zeroscope_v2_576w`) |
| `video-generator-frames` | `24` | Number of frames to generate |
| `video-generator-fps` | `8` | Frames per second for output video |
| `video-generator-width` | `320` | Output video width in pixels |
| `video-generator-height` | `576` | Output video height in pixels |
| `video-generator-gpu-layers` | `15` | Number of layers to offload to GPU |
| `video-generator-batch-size` | `1` | Batch size for generation |

### BotModels Service Settings

| Key | Default | Description |
|-----|---------|-------------|
| `botmodels-enabled` | `false` | Enable/disable botmodels integration |
| `botmodels-host` | `0.0.0.0` | Host address for botmodels service |
| `botmodels-port` | `8085` | Port for botmodels service |
| `botmodels-api-key` | - | API key for authentication with botmodels |
| `botmodels-https` | `false` | Use HTTPS for connection to botmodels |

## Example config.csv

```csv
key,value
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
image-generator-gpu-layers,20
image-generator-batch-size,1
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8
video-generator-width,320
video-generator-height,576
video-generator-gpu-layers,15
video-generator-batch-size,1
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false
```

## BASIC Keywords

Once configured, the following keywords become available in BASIC scripts:

### IMAGE

Generate an image from a text prompt.

```basic
file = IMAGE "a cute cat playing with yarn"
SEND FILE TO user, file
```

### VIDEO

Generate a video from a text prompt.

```basic
file = VIDEO "a rocket launching into space"
SEND FILE TO user, file
```

### AUDIO

Generate speech audio from text.

```basic
file = AUDIO "Hello, welcome to our service!"
SEND FILE TO user, file
```

### SEE

Get a caption/description of an image or video file.

```basic
caption = SEE "/path/to/image.jpg"
TALK caption

// Also works with video files
description = SEE "/path/to/video.mp4"
TALK description
```

## Starting BotModels Service

Before using multimodal features, start the botmodels service:

```bash
cd botmodels
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085
```

Or with HTTPS:

```bash
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem
```

## API Endpoints (BotModels)

The botmodels service exposes these REST endpoints:

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/image/generate` | POST | Generate image from prompt |
| `/api/video/generate` | POST | Generate video from prompt |
| `/api/speech/generate` | POST | Generate speech from text |
| `/api/speech/totext` | POST | Convert audio to text |
| `/api/vision/describe` | POST | Get description of an image |
| `/api/vision/describe_video` | POST | Get description of a video |
| `/api/vision/vqa` | POST | Visual question answering |
| `/api/health` | GET | Health check |

All endpoints require the `X-API-Key` header for authentication.

## Architecture

```
┌─────────────┐     HTTPS      ┌─────────────┐
│  botserver  │ ────────────▶  │  botmodels  │
│   (Rust)    │                │  (Python)   │
└─────────────┘                └─────────────┘
      │                              │
      │ BASIC Keywords               │ AI Models
      │ - IMAGE                      │ - Stable Diffusion
      │ - VIDEO                      │ - Zeroscope
      │ - AUDIO                      │ - TTS/Whisper
      │ - SEE                        │ - BLIP2
      ▼                              ▼
┌─────────────┐                ┌─────────────┐
│   config    │                │   outputs   │
│   .csv      │                │  (files)    │
└─────────────┘                └─────────────┘
```

## Troubleshooting

### "BotModels is not enabled"

Set `botmodels-enabled=true` in your config.csv.

### Connection refused

1. Ensure botmodels service is running
2. Check host/port configuration
3. Verify firewall settings

### Authentication failed

Ensure `botmodels-api-key` in config.csv matches `API_KEY` environment variable in botmodels.

### Model not found

Verify model paths are correct and models are downloaded to the expected locations.

## Security Notes

1. Always use HTTPS in production (`botmodels-https=true`)
2. Use strong, unique API keys
3. Restrict network access to botmodels service
4. Consider running botmodels on a separate GPU server