397 lines
No EOL
9.7 KiB
Markdown
397 lines
No EOL
9.7 KiB
Markdown
# BotModels - AI Inference Service
|
|
|
|
**Version:** 1.0.0
|
|
**Purpose:** Multimodal AI inference service for General Bots
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
BotModels is a Python-based AI inference service that provides multimodal capabilities to the General Bots platform. It serves as a companion to botserver (Rust), specializing in cutting-edge AI/ML models from the Python ecosystem including image generation, video creation, speech synthesis, and vision/captioning.
|
|
|
|
While botserver handles business logic, networking, and systems-level operations, BotModels exists solely to leverage the extensive Python AI/ML ecosystem for inference tasks that are impractical to implement in Rust.
|
|
|
|
For comprehensive documentation, see **[docs.pragmatismo.com.br](https://docs.pragmatismo.com.br)** or the **[BotBook](../botbook)** for detailed guides, API references, and tutorials.
|
|
|
|
---
|
|
|
|
## Features
|
|
|
|
- **Image Generation**: Generate images from text prompts using Stable Diffusion
|
|
- **Video Generation**: Create short videos from text descriptions using Zeroscope
|
|
- **Speech Synthesis**: Text-to-speech using Coqui TTS
|
|
- **Speech Recognition**: Audio transcription using OpenAI Whisper
|
|
- **Vision/Captioning**: Image and video description using BLIP2
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
cd botmodels
|
|
|
|
# Create virtual environment
|
|
python -m venv venv
|
|
source venv/bin/activate # Linux/Mac
|
|
# or
|
|
.\venv\Scripts\activate # Windows
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Copy the example environment file and configure:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
Edit `.env` with your settings:
|
|
|
|
```env
|
|
HOST=0.0.0.0
|
|
PORT=8085
|
|
API_KEY=your-secret-key
|
|
DEVICE=cuda
|
|
IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5
|
|
VIDEO_MODEL_PATH=./models/zeroscope-v2
|
|
VISION_MODEL_PATH=./models/blip2
|
|
```
|
|
|
|
### Running the Server
|
|
|
|
```bash
|
|
# Development mode
|
|
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --reload
|
|
|
|
# Production mode
|
|
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --workers 4
|
|
|
|
# With HTTPS (production)
|
|
python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem
|
|
```
|
|
|
|
---
|
|
|
|
## 🐍 Philosophy & Scope
|
|
|
|
### Why Python?
|
|
|
|
- **Rust vs. Python Rule**:
|
|
- If logic is deterministic, systems-level, or performance-critical: **Do it in Rust (botserver)**
|
|
- If logic requires cutting-edge ML models, rapid experimentation with HuggingFace, or specific Python-only libraries: **Do it here**
|
|
|
|
### Architecture Principles
|
|
|
|
- **Inference Only**: This service should NOT hold business state. It accepts inputs, runs inference, and returns predictions.
|
|
- **Stateless**: Treated as a sidecar to `botserver`.
|
|
- **API First**: Exposes strict HTTP/REST endpoints consumed by `botserver`.
|
|
|
|
---
|
|
|
|
## 🛠 Technology Stack
|
|
|
|
- **Runtime**: Python 3.10+
|
|
- **Web Framework**: FastAPI (preferred over Flask for async/performance)
|
|
- **ML Frameworks**: PyTorch, HuggingFace Transformers, Diffusers
|
|
- **Quality**: `ruff` (linting), `black` (formatting), `mypy` (typing)
|
|
|
|
---
|
|
|
|
## 📡 API Endpoints
|
|
|
|
All endpoints require the `X-API-Key` header for authentication.
|
|
|
|
### Image Generation
|
|
|
|
```http
|
|
POST /api/image/generate
|
|
Content-Type: application/json
|
|
X-API-Key: your-api-key
|
|
|
|
{
|
|
"prompt": "a cute cat playing with yarn",
|
|
"steps": 30,
|
|
"width": 512,
|
|
"height": 512,
|
|
"guidance_scale": 7.5,
|
|
"seed": 42
|
|
}
|
|
```
|
|
|
|
### Video Generation
|
|
|
|
```http
|
|
POST /api/video/generate
|
|
Content-Type: application/json
|
|
X-API-Key: your-api-key
|
|
|
|
{
|
|
"prompt": "a rocket launching into space",
|
|
"num_frames": 24,
|
|
"fps": 8,
|
|
"steps": 50
|
|
}
|
|
```
|
|
|
|
### Speech Generation (TTS)
|
|
|
|
```http
|
|
POST /api/speech/generate
|
|
Content-Type: application/json
|
|
X-API-Key: your-api-key
|
|
|
|
{
|
|
"prompt": "Hello, welcome to our service!",
|
|
"voice": "default",
|
|
"language": "en"
|
|
}
|
|
```
|
|
|
|
### Speech to Text
|
|
|
|
```http
|
|
POST /api/speech/totext
|
|
Content-Type: multipart/form-data
|
|
X-API-Key: your-api-key
|
|
|
|
file: <audio_file>
|
|
```
|
|
|
|
### Image Description
|
|
|
|
```http
|
|
POST /api/vision/describe
|
|
Content-Type: multipart/form-data
|
|
X-API-Key: your-api-key
|
|
|
|
file: <image_file>
|
|
prompt: "What is in this image?" (optional)
|
|
```
|
|
|
|
### Video Description
|
|
|
|
```http
|
|
POST /api/vision/describe_video
|
|
Content-Type: multipart/form-data
|
|
X-API-Key: your-api-key
|
|
|
|
file: <video_file>
|
|
num_frames: 8 (optional)
|
|
```
|
|
|
|
### Visual Question Answering
|
|
|
|
```http
|
|
POST /api/vision/vqa
|
|
Content-Type: multipart/form-data
|
|
X-API-Key: your-api-key
|
|
|
|
file: <image_file>
|
|
question: "How many people are in this image?"
|
|
```
|
|
|
|
### Health Check
|
|
|
|
```http
|
|
GET /api/health
|
|
```
|
|
|
|
Interactive API documentation:
|
|
- Swagger UI: `http://localhost:8085/api/docs`
|
|
- ReDoc: `http://localhost:8085/api/redoc`
|
|
|
|
---
|
|
|
|
## 🔗 Integration with BotServer
|
|
|
|
### Configuration (config.csv)
|
|
|
|
```csv
|
|
key,value
|
|
botmodels-enabled,true
|
|
botmodels-host,0.0.0.0
|
|
botmodels-port,8085
|
|
botmodels-api-key,your-secret-key
|
|
botmodels-https,false
|
|
image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
|
|
image-generator-steps,4
|
|
image-generator-width,512
|
|
image-generator-height,512
|
|
video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
|
|
video-generator-frames,24
|
|
video-generator-fps,8
|
|
```
|
|
|
|
### BASIC Script Keywords
|
|
|
|
```basic
|
|
// Generate an image
|
|
file = IMAGE "a beautiful sunset over mountains"
|
|
SEND FILE TO user, file
|
|
|
|
// Generate a video
|
|
video = VIDEO "waves crashing on a beach"
|
|
SEND FILE TO user, video
|
|
|
|
// Generate speech
|
|
audio = AUDIO "Welcome to General Bots!"
|
|
SEND FILE TO user, audio
|
|
|
|
// Get image/video description
|
|
caption = SEE "/path/to/image.jpg"
|
|
TALK caption
|
|
```
|
|
|
|
---
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
┌─────────────┐ HTTPS ┌─────────────┐
|
|
│ botserver │ ────────────▶ │ botmodels │
|
|
│ (Rust) │ │ (Python) │
|
|
└─────────────┘ └─────────────┘
|
|
│ │
|
|
│ BASIC Keywords │ AI Models
|
|
│ - IMAGE │ - Stable Diffusion
|
|
│ - VIDEO │ - Zeroscope
|
|
│ - AUDIO │ - TTS/Whisper
|
|
│ - SEE │ - BLIP2
|
|
▼ ▼
|
|
┌─────────────┐ ┌─────────────┐
|
|
│ config │ │ outputs │
|
|
│ .csv │ │ (files) │
|
|
└─────────────┘ └─────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## ⚡️ Development Guidelines
|
|
|
|
### Modern Model Usage
|
|
|
|
- **Deprecate Legacy**: Move away from outdated libs (e.g., old `allennlp`) in favor of **HuggingFace Transformers** and **Diffusers**
|
|
- **Quantization**: Always consider quantized models (bitsandbytes, GGUF) to reduce VRAM usage
|
|
|
|
### Performance & Loading
|
|
|
|
- **Lazy Loading**: Do NOT load 10GB models at module import time. Load on startup lifecycle or first request with locking
|
|
- **GPU Handling**: Robustly detect CUDA/MPS (Mac) and fallback to CPU gracefully
|
|
|
|
### Code Quality
|
|
|
|
- **Type Hints**: All functions MUST have type hints
|
|
- **Error Handling**: No bare `except:`. Catch precise exceptions and return structured JSON errors to `botserver`
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
botmodels/
|
|
├── src/
|
|
│ ├── api/
|
|
│ │ ├── v1/
|
|
│ │ │ └── endpoints/
|
|
│ │ │ ├── image.py
|
|
│ │ │ ├── video.py
|
|
│ │ │ ├── speech.py
|
|
│ │ │ └── vision.py
|
|
│ │ └── dependencies.py
|
|
│ ├── core/
|
|
│ │ ├── config.py
|
|
│ │ └── logging.py
|
|
│ ├── schemas/
|
|
│ │ └── generation.py
|
|
│ ├── services/
|
|
│ │ ├── image_service.py
|
|
│ │ ├── video_service.py
|
|
│ │ ├── speech_service.py
|
|
│ │ └── vision_service.py
|
|
│ └── main.py
|
|
├── outputs/
|
|
├── models/
|
|
├── tests/
|
|
├── requirements.txt
|
|
└── README.md
|
|
```
|
|
|
|
---
|
|
|
|
## 🧪 Testing
|
|
|
|
```bash
|
|
pytest tests/
|
|
```
|
|
|
|
---
|
|
|
|
## 🔒 Security
|
|
|
|
1. **Always use HTTPS in production**
|
|
2. Use strong, unique API keys
|
|
3. Restrict network access to the service
|
|
4. Consider running on a separate GPU server
|
|
5. Monitor resource usage and set appropriate limits
|
|
|
|
---
|
|
|
|
## 📚 Documentation
|
|
|
|
For complete documentation, guides, and API references:
|
|
|
|
- **[docs.pragmatismo.com.br](https://docs.pragmatismo.com.br)** - Full online documentation
|
|
- **[BotBook](../botbook)** - Local comprehensive guide with tutorials and examples
|
|
- **[General Bots Repository](https://github.com/GeneralBots/BotServer)** - Main project repository
|
|
|
|
---
|
|
|
|
## 📦 Requirements
|
|
|
|
- Python 3.10+
|
|
- CUDA-capable GPU (recommended, 8GB+ VRAM)
|
|
- 16GB+ RAM
|
|
|
|
---
|
|
|
|
## 🔗 Resources
|
|
|
|
### Education
|
|
|
|
- [Computer Vision Course](https://pjreddie.com/courses/computer-vision/)
|
|
- [Adversarial VQA Paper](https://arxiv.org/abs/2106.00245)
|
|
- [LLM Visualization](https://bbycroft.net/llm)
|
|
|
|
### References
|
|
|
|
- [VizWiz VQA PyTorch](https://github.com/DenisDsh/VizWiz-VQA-PyTorch)
|
|
- [Diffusers Library](https://github.com/huggingface/diffusers)
|
|
- [OpenAI Whisper](https://github.com/openai/whisper)
|
|
- [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b)
|
|
|
|
### Community
|
|
|
|
- [AI for Mankind](https://github.com/aiformankind)
|
|
- [ManaAI](https://manaai.cn/)
|
|
|
|
---
|
|
|
|
## 🔑 Remember
|
|
|
|
- **Inference Only**: No business state, just predictions
|
|
- **Modern Models**: Use HuggingFace Transformers, Diffusers
|
|
- **Type Safety**: All functions must have type hints
|
|
- **Lazy Loading**: Don't load models at import time
|
|
- **GPU Detection**: Graceful fallback to CPU
|
|
- **Version 1.0.0** - Do not change without approval
|
|
- **GIT WORKFLOW** - ALWAYS push to ALL repositories (github, pragmatismo)
|
|
|
|
---
|
|
|
|
## 📄 License
|
|
|
|
See LICENSE file for details. |