Rodrigo Rodriguez (Pragmatismo) a21292daa3 Add multimodal module for botmodels integration

Introduces IMAGE, VIDEO, AUDIO, and SEE keywords for BASIC scripts that
connect to the botmodels service for AI-powered media generation and
vision/captioning capabilities.

- Add BotModelsClient for HTTP communication with botmodels service
- Implement BASIC keywords: IMAGE, VIDEO, AUDIO (generation), SEE
  (captioning)
- Support configuration via config.csv for models

2025-11-29 20:40:08 -03:00

6.1 KiB

Raw Blame History

Multimodal Configuration Guide

This document describes how to configure botserver to use the botmodels service for image, video, audio generation, and vision/captioning capabilities.

Overview

The multimodal feature connects botserver to botmodels - a Python-based service similar to llama.cpp but for multimodal AI tasks. This enables BASIC scripts to generate images, videos, audio, and analyze visual content.

Configuration Keys

Add the following configuration to your bot's config.csv file:

Image Generator Settings

Key	Default	Description
`image-generator-model`	-	Path to the image generation model (e.g., `../../../../data/diffusion/sd_turbo_f16.gguf`)
`image-generator-steps`	`4`	Number of inference steps for image generation
`image-generator-width`	`512`	Output image width in pixels
`image-generator-height`	`512`	Output image height in pixels
`image-generator-gpu-layers`	`20`	Number of layers to offload to GPU
`image-generator-batch-size`	`1`	Batch size for generation

Video Generator Settings

Key	Default	Description
`video-generator-model`	-	Path to the video generation model (e.g., `../../../../data/diffusion/zeroscope_v2_576w`)
`video-generator-frames`	`24`	Number of frames to generate
`video-generator-fps`	`8`	Frames per second for output video
`video-generator-width`	`320`	Output video width in pixels
`video-generator-height`	`576`	Output video height in pixels
`video-generator-gpu-layers`	`15`	Number of layers to offload to GPU
`video-generator-batch-size`	`1`	Batch size for generation

BotModels Service Settings

Key	Default	Description
`botmodels-enabled`	`false`	Enable/disable botmodels integration
`botmodels-host`	`0.0.0.0`	Host address for botmodels service
`botmodels-port`	`8085`	Port for botmodels service
`botmodels-api-key`	-	API key for authentication with botmodels
`botmodels-https`	`false`	Use HTTPS for connection to botmodels

Example config.csv

key,value
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
image-generator-steps,4
image-generator-width,512
image-generator-height,512
image-generator-gpu-layers,20
image-generator-batch-size,1
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
video-generator-frames,24
video-generator-fps,8
video-generator-width,320
video-generator-height,576
video-generator-gpu-layers,15
video-generator-batch-size,1
botmodels-enabled,true
botmodels-host,0.0.0.0
botmodels-port,8085
botmodels-api-key,your-secret-key
botmodels-https,false

BASIC Keywords

Once configured, the following keywords become available in BASIC scripts:

IMAGE

Generate an image from a text prompt.

file = IMAGE "a cute cat playing with yarn"
SEND FILE TO user, file

VIDEO

Generate a video from a text prompt.

file = VIDEO "a rocket launching into space"
SEND FILE TO user, file

AUDIO

Generate speech audio from text.

file = AUDIO "Hello, welcome to our service!"
SEND FILE TO user, file

SEE

Get a caption/description of an image or video file.

caption = SEE "/path/to/image.jpg"
TALK caption

// Also works with video files
description = SEE "/path/to/video.mp4"
TALK description

Starting BotModels Service

Before using multimodal features, start the botmodels service:

cd botmodels
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085

Or with HTTPS:

python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem

API Endpoints (BotModels)

The botmodels service exposes these REST endpoints:

Endpoint	Method	Description
`/api/image/generate`	POST	Generate image from prompt
`/api/video/generate`	POST	Generate video from prompt
`/api/speech/generate`	POST	Generate speech from text
`/api/speech/totext`	POST	Convert audio to text
`/api/vision/describe`	POST	Get description of an image
`/api/vision/describe_video`	POST	Get description of a video
`/api/vision/vqa`	POST	Visual question answering
`/api/health`	GET	Health check

All endpoints require the X-API-Key header for authentication.

Architecture

┌─────────────┐     HTTPS      ┌─────────────┐
│  botserver  │ ────────────▶  │  botmodels  │
│   (Rust)    │                │  (Python)   │
└─────────────┘                └─────────────┘
      │                              │
      │ BASIC Keywords               │ AI Models
      │ - IMAGE                      │ - Stable Diffusion
      │ - VIDEO                      │ - Zeroscope
      │ - AUDIO                      │ - TTS/Whisper
      │ - SEE                        │ - BLIP2
      ▼                              ▼
┌─────────────┐                ┌─────────────┐
│   config    │                │   outputs   │
│   .csv      │                │  (files)    │
└─────────────┘                └─────────────┘

Troubleshooting

"BotModels is not enabled"

Set botmodels-enabled=true in your config.csv.

Connection refused

Ensure botmodels service is running
Check host/port configuration
Verify firewall settings

Authentication failed

Ensure botmodels-api-key in config.csv matches API_KEY environment variable in botmodels.

Model not found

Verify model paths are correct and models are downloaded to the expected locations.

Security Notes

Always use HTTPS in production (botmodels-https=true)
Use strong, unique API keys
Restrict network access to botmodels service
Consider running botmodels on a separate GPU server

6.1 KiB Raw Blame History