botserver/docs/multimodal-config.md
Rodrigo Rodriguez (Pragmatismo) a21292daa3 Add multimodal module for botmodels integration
Introduces IMAGE, VIDEO, AUDIO, and SEE keywords for BASIC scripts that
connect to the botmodels service for AI-powered media generation and
vision/captioning capabilities.

- Add BotModelsClient for HTTP communication with botmodels service
- Implement BASIC keywords: IMAGE, VIDEO, AUDIO (generation), SEE
  (captioning)
- Support configuration via config.csv for models
2025-11-29 20:40:08 -03:00

6.1 KiB

Multimodal Configuration Guide

This document describes how to configure botserver to use the botmodels service for image, video, audio generation, and vision/captioning capabilities.

Overview

The multimodal feature connects botserver to botmodels - a Python-based service similar to llama.cpp but for multimodal AI tasks. This enables BASIC scripts to generate images, videos, audio, and analyze visual content.

Configuration Keys

Add the following configuration to your bot's config.csv file:

Image Generator Settings

Key Default Description
image-generator-model - Path to the image generation model (e.g., ../../../../data/diffusion/sd_turbo_f16.gguf)
image-generator-steps 4 Number of inference steps for image generation
image-generator-width 512 Output image width in pixels
image-generator-height 512 Output image height in pixels
image-generator-gpu-layers 20 Number of layers to offload to GPU
image-generator-batch-size 1 Batch size for generation

Video Generator Settings

Key Default Description
video-generator-model - Path to the video generation model (e.g., ../../../../data/diffusion/zeroscope_v2_576w)
video-generator-frames 24 Number of frames to generate
video-generator-fps 8 Frames per second for output video
video-generator-width 320 Output video width in pixels
video-generator-height 576 Output video height in pixels
video-generator-gpu-layers 15 Number of layers to offload to GPU
video-generator-batch-size 1 Batch size for generation

BotModels Service Settings

Key Default Description
botmodels-enabled false Enable/disable botmodels integration
botmodels-host 0.0.0.0 Host address for botmodels service
botmodels-port 8085 Port for botmodels service
botmodels-api-key - API key for authentication with botmodels
botmodels-https false Use HTTPS for connection to botmodels

Example config.csv

key,value
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
image-generator-gpu-layers,20
image-generator-batch-size,1
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8
video-generator-width,320
video-generator-height,576
video-generator-gpu-layers,15
video-generator-batch-size,1
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false

BASIC Keywords

Once configured, the following keywords become available in BASIC scripts:

IMAGE

Generate an image from a text prompt.

file = IMAGE "a cute cat playing with yarn"
SEND FILE TO user, file

VIDEO

Generate a video from a text prompt.

file = VIDEO "a rocket launching into space"
SEND FILE TO user, file

AUDIO

Generate speech audio from text.

file = AUDIO "Hello, welcome to our service!"
SEND FILE TO user, file

SEE

Get a caption/description of an image or video file.

caption = SEE "/path/to/image.jpg"
TALK caption

// Also works with video files
description = SEE "/path/to/video.mp4"
TALK description

Starting BotModels Service

Before using multimodal features, start the botmodels service:

cd botmodels
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085

Or with HTTPS:

python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem

API Endpoints (BotModels)

The botmodels service exposes these REST endpoints:

Endpoint Method Description
/api/image/generate POST Generate image from prompt
/api/video/generate POST Generate video from prompt
/api/speech/generate POST Generate speech from text
/api/speech/totext POST Convert audio to text
/api/vision/describe POST Get description of an image
/api/vision/describe_video POST Get description of a video
/api/vision/vqa POST Visual question answering
/api/health GET Health check

All endpoints require the X-API-Key header for authentication.

Architecture

┌─────────────┐     HTTPS      ┌─────────────┐
│  botserver  │ ────────────▶  │  botmodels  │
│   (Rust)    │                │  (Python)   │
└─────────────┘                └─────────────┘
      │                              │
      │ BASIC Keywords               │ AI Models
      │ - IMAGE                      │ - Stable Diffusion
      │ - VIDEO                      │ - Zeroscope
      │ - AUDIO                      │ - TTS/Whisper
      │ - SEE                        │ - BLIP2
      ▼                              ▼
┌─────────────┐                ┌─────────────┐
│   config    │                │   outputs   │
│   .csv      │                │  (files)    │
└─────────────┘                └─────────────┘

Troubleshooting

"BotModels is not enabled"

Set botmodels-enabled=true in your config.csv.

Connection refused

  1. Ensure botmodels service is running
  2. Check host/port configuration
  3. Verify firewall settings

Authentication failed

Ensure botmodels-api-key in config.csv matches API_KEY environment variable in botmodels.

Model not found

Verify model paths are correct and models are downloaded to the expected locations.

Security Notes

  1. Always use HTTPS in production (botmodels-https=true)
  2. Use strong, unique API keys
  3. Restrict network access to botmodels service
  4. Consider running botmodels on a separate GPU server