# BotModels - AI Inference Service **Version:** 1.0.0 **Purpose:** Multimodal AI inference service for General Bots --- ## Overview BotModels is a Python-based AI inference service that provides multimodal capabilities to the General Bots platform. It serves as a companion to botserver (Rust), specializing in cutting-edge AI/ML models from the Python ecosystem including image generation, video creation, speech synthesis, and vision/captioning. While botserver handles business logic, networking, and systems-level operations, BotModels exists solely to leverage the extensive Python AI/ML ecosystem for inference tasks that are impractical to implement in Rust. For comprehensive documentation, see **[docs.pragmatismo.com.br](https://docs.pragmatismo.com.br)** or the **[BotBook](../botbook)** for detailed guides, API references, and tutorials. --- ## Features - **Image Generation**: Generate images from text prompts using Stable Diffusion - **Video Generation**: Create short videos from text descriptions using Zeroscope - **Speech Synthesis**: Text-to-speech using Coqui TTS - **Speech Recognition**: Audio transcription using OpenAI Whisper - **Vision/Captioning**: Image and video description using BLIP2 --- ## Quick Start ### Installation ```bash # Clone the repository cd botmodels # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # or .\venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt ``` ### Configuration Copy the example environment file and configure: ```bash cp .env.example .env ``` Edit `.env` with your settings: ```env HOST=0.0.0.0 PORT=8085 API_KEY=your-secret-key DEVICE=cuda IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5 VIDEO_MODEL_PATH=./models/zeroscope-v2 VISION_MODEL_PATH=./models/blip2 ``` ### Running the Server ```bash # Development mode python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --reload # Production mode python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --workers 4 # With HTTPS (production) python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem ``` --- ## ๐Ÿ Philosophy & Scope ### Why Python? - **Rust vs. Python Rule**: - If logic is deterministic, systems-level, or performance-critical: **Do it in Rust (botserver)** - If logic requires cutting-edge ML models, rapid experimentation with HuggingFace, or specific Python-only libraries: **Do it here** ### Architecture Principles - **Inference Only**: This service should NOT hold business state. It accepts inputs, runs inference, and returns predictions. - **Stateless**: Treated as a sidecar to `botserver`. - **API First**: Exposes strict HTTP/REST endpoints consumed by `botserver`. --- ## ๐Ÿ›  Technology Stack - **Runtime**: Python 3.10+ - **Web Framework**: FastAPI (preferred over Flask for async/performance) - **ML Frameworks**: PyTorch, HuggingFace Transformers, Diffusers - **Quality**: `ruff` (linting), `black` (formatting), `mypy` (typing) --- ## ๐Ÿ“ก API Endpoints All endpoints require the `X-API-Key` header for authentication. ### Image Generation ```http POST /api/image/generate Content-Type: application/json X-API-Key: your-api-key { "prompt": "a cute cat playing with yarn", "steps": 30, "width": 512, "height": 512, "guidance_scale": 7.5, "seed": 42 } ``` ### Video Generation ```http POST /api/video/generate Content-Type: application/json X-API-Key: your-api-key { "prompt": "a rocket launching into space", "num_frames": 24, "fps": 8, "steps": 50 } ``` ### Speech Generation (TTS) ```http POST /api/speech/generate Content-Type: application/json X-API-Key: your-api-key { "prompt": "Hello, welcome to our service!", "voice": "default", "language": "en" } ``` ### Speech to Text ```http POST /api/speech/totext Content-Type: multipart/form-data X-API-Key: your-api-key file: ``` ### Image Description ```http POST /api/vision/describe Content-Type: multipart/form-data X-API-Key: your-api-key file: prompt: "What is in this image?" (optional) ``` ### Video Description ```http POST /api/vision/describe_video Content-Type: multipart/form-data X-API-Key: your-api-key file: num_frames: 8 (optional) ``` ### Visual Question Answering ```http POST /api/vision/vqa Content-Type: multipart/form-data X-API-Key: your-api-key file: question: "How many people are in this image?" ``` ### Health Check ```http GET /api/health ``` Interactive API documentation: - Swagger UI: `http://localhost:8085/api/docs` - ReDoc: `http://localhost:8085/api/redoc` --- ## ๐Ÿ”— Integration with BotServer ### Configuration (config.csv) ```csv key,value botmodels-enabled,true botmodels-host,0.0.0.0 botmodels-port,8085 botmodels-api-key,your-secret-key botmodels-https,false image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf image-generator-steps,4 image-generator-width,512 image-generator-height,512 video-generator-model,../../../../data/diffusion/zeroscope_v2_576w video-generator-frames,24 video-generator-fps,8 ``` ### BASIC Script Keywords ```basic // Generate an image file = IMAGE "a beautiful sunset over mountains" SEND FILE TO user, file // Generate a video video = VIDEO "waves crashing on a beach" SEND FILE TO user, video // Generate speech audio = AUDIO "Welcome to General Bots!" SEND FILE TO user, audio // Get image/video description caption = SEE "/path/to/image.jpg" TALK caption ``` --- ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” HTTPS โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ botserver โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ botmodels โ”‚ โ”‚ (Rust) โ”‚ โ”‚ (Python) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ BASIC Keywords โ”‚ AI Models โ”‚ - IMAGE โ”‚ - Stable Diffusion โ”‚ - VIDEO โ”‚ - Zeroscope โ”‚ - AUDIO โ”‚ - TTS/Whisper โ”‚ - SEE โ”‚ - BLIP2 โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ config โ”‚ โ”‚ outputs โ”‚ โ”‚ .csv โ”‚ โ”‚ (files) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` --- ## โšก๏ธ Development Guidelines ### Modern Model Usage - **Deprecate Legacy**: Move away from outdated libs (e.g., old `allennlp`) in favor of **HuggingFace Transformers** and **Diffusers** - **Quantization**: Always consider quantized models (bitsandbytes, GGUF) to reduce VRAM usage ### Performance & Loading - **Lazy Loading**: Do NOT load 10GB models at module import time. Load on startup lifecycle or first request with locking - **GPU Handling**: Robustly detect CUDA/MPS (Mac) and fallback to CPU gracefully ### Code Quality - **Type Hints**: All functions MUST have type hints - **Error Handling**: No bare `except:`. Catch precise exceptions and return structured JSON errors to `botserver` ### Project Structure ``` botmodels/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ api/ โ”‚ โ”‚ โ”œโ”€โ”€ v1/ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ endpoints/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ image.py โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ video.py โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ speech.py โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ vision.py โ”‚ โ”‚ โ””โ”€โ”€ dependencies.py โ”‚ โ”œโ”€โ”€ core/ โ”‚ โ”‚ โ”œโ”€โ”€ config.py โ”‚ โ”‚ โ””โ”€โ”€ logging.py โ”‚ โ”œโ”€โ”€ schemas/ โ”‚ โ”‚ โ””โ”€โ”€ generation.py โ”‚ โ”œโ”€โ”€ services/ โ”‚ โ”‚ โ”œโ”€โ”€ image_service.py โ”‚ โ”‚ โ”œโ”€โ”€ video_service.py โ”‚ โ”‚ โ”œโ”€โ”€ speech_service.py โ”‚ โ”‚ โ””โ”€โ”€ vision_service.py โ”‚ โ””โ”€โ”€ main.py โ”œโ”€โ”€ outputs/ โ”œโ”€โ”€ models/ โ”œโ”€โ”€ tests/ โ”œโ”€โ”€ requirements.txt โ””โ”€โ”€ README.md ``` --- ## ๐Ÿงช Testing ```bash pytest tests/ ``` --- ## ๐Ÿ”’ Security 1. **Always use HTTPS in production** 2. Use strong, unique API keys 3. Restrict network access to the service 4. Consider running on a separate GPU server 5. Monitor resource usage and set appropriate limits --- ## ๐Ÿ“š Documentation For complete documentation, guides, and API references: - **[docs.pragmatismo.com.br](https://docs.pragmatismo.com.br)** - Full online documentation - **[BotBook](../botbook)** - Local comprehensive guide with tutorials and examples - **[General Bots Repository](https://github.com/GeneralBots/BotServer)** - Main project repository --- ## ๐Ÿ“ฆ Requirements - Python 3.10+ - CUDA-capable GPU (recommended, 8GB+ VRAM) - 16GB+ RAM --- ## ๐Ÿ”— Resources ### Education - [Computer Vision Course](https://pjreddie.com/courses/computer-vision/) - [Adversarial VQA Paper](https://arxiv.org/abs/2106.00245) - [LLM Visualization](https://bbycroft.net/llm) ### References - [VizWiz VQA PyTorch](https://github.com/DenisDsh/VizWiz-VQA-PyTorch) - [Diffusers Library](https://github.com/huggingface/diffusers) - [OpenAI Whisper](https://github.com/openai/whisper) - [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b) ### Community - [AI for Mankind](https://github.com/aiformankind) - [ManaAI](https://manaai.cn/) --- ## ๐Ÿ”‘ Remember - **Inference Only**: No business state, just predictions - **Modern Models**: Use HuggingFace Transformers, Diffusers - **Type Safety**: All functions must have type hints - **Lazy Loading**: Don't load models at import time - **GPU Detection**: Graceful fallback to CPU - **Version 1.0.0** - Do not change without approval - **GIT WORKFLOW** - ALWAYS push to ALL repositories (github, pragmatismo) --- ## ๐Ÿ“„ License See LICENSE file for details.