Rewrite BotModels as FastAPI multimodal AI service

Replace Azure Functions architecture with a modern FastAPI-based REST API providing image, video, speech, and vision capabilities for General Bots. Key changes: - Add FastAPI app with versioned API endpoints and OpenAPI docs - Implement services for Stable Diffusion, Zeroscope, TTS/Whisper, BLIP2 - Add pydantic schemas for request/response validation - Configure structured logging with structlog - Support lazy model loading and GPU acceleration - Update dependencies from Azure/TensorFlow stack to PyTorch/diffusers
2025-11-30 07:52:56 -03:00 · 2025-11-30 07:52:56 -03:00 · 5a43dc81c7
commit 5a43dc81c7
parent 7cb64ca0c4
24 changed files with 1577 additions and 39 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,38 @@
 # Server Configuration
 ENV=development
 HOST=0.0.0.0
 PORT=8085
 LOG_LEVEL=INFO
 # Security - IMPORTANT: Change this in production!
 API_KEY=change-me-in-production
 # Model Paths
 # These can be local paths or model identifiers for HuggingFace Hub
 IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5
 VIDEO_MODEL_PATH=./models/zeroscope-v2
 SPEECH_MODEL_PATH=./models/tts
 VISION_MODEL_PATH=./models/blip2
 WHISPER_MODEL_PATH=./models/whisper
 # Device Configuration
 # Options: cuda, cpu, mps (for Apple Silicon)
 DEVICE=cuda
 # Image Generation Defaults
 IMAGE_STEPS=4
 IMAGE_WIDTH=512
 IMAGE_HEIGHT=512
 IMAGE_GPU_LAYERS=20
 IMAGE_BATCH_SIZE=1
 # Video Generation Defaults
 VIDEO_FRAMES=24
 VIDEO_FPS=8
 VIDEO_WIDTH=320
 VIDEO_HEIGHT=576
 VIDEO_GPU_LAYERS=15
 VIDEO_BATCH_SIZE=1
 # Storage
 OUTPUT_DIR=./outputs
--- a/README.md
+++ b/README.md
@ -1,41 +1,326 @@
 # BotModels
-Models in Python for General Bots AI demands.
+A multimodal AI service for General Bots providing image, video, audio generation, and vision/captioning capabilities. Works as a companion service to botserver, similar to how llama.cpp provides LLM capabilities.
 # Environment
  1. Install Visual Studio Code (VSCode);
  2. Install VSCode Extension: Azure Functions;
  3. Install VSCode Extension: Azure Machine Learning;
  4. Install NodeJS;
  5. Run npm install -g azure-functions-core-tools@3 --unsafe-perm true.
 # Libraries
 - TensorFlow;
 - SciKit-Learn;
 - Pandas;
 - NumPy.
 ![General Bots Models Services](https://raw.githubusercontent.com/GeneralBots/BotModels/master/BotModels.png)
 # Tools
-1. LLM Visualization https://bbycroft.net/llm
+## Features
 2. 
-# Education
+- **Image Generation**: Generate images from text prompts using Stable Diffusion
 - **Video Generation**: Create short videos from text descriptions using Zeroscope
 - **Speech Synthesis**: Text-to-speech using Coqui TTS
 - **Speech Recognition**: Audio transcription using OpenAI Whisper
 - **Vision/Captioning**: Image and video description using BLIP2
-1. https://pjreddie.com/courses/computer-vision/
+## Quick Start
 2. https://arxiv.org/abs/2106.00245 (Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models)
-# References
+### Installation
-1. https://github.com/DenisDsh/VizWiz-VQA-PyTorch (VQA, Visual Question Answering)
+```bash
 # Clone the repository
 cd botmodels
-# Community
+# Create virtual environment
 python -m venv venv
 source venv/bin/activate  # Linux/Mac
 # or
 .\venv\Scripts\activate  # Windows
-1. https://github.com/aiformankind
+# Install dependencies
 pip install -r requirements.txt
 ```
-# Resources
+### Configuration
-1. https://manaai.cn/
+Copy the example environment file and configure:
 ```bash
 cp .env.example .env
 ```
 Edit `.env` with your settings:
 ```env
 HOST=0.0.0.0
 PORT=8085
 API_KEY=your-secret-key
 DEVICE=cuda
 IMAGE_MODEL_PATH=./models/stable-diffusion-v1-5
 VIDEO_MODEL_PATH=./models/zeroscope-v2
 VISION_MODEL_PATH=./models/blip2
 ```
 ### Running the Server
 ```bash
 # Development mode
 python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --reload
 # Production mode
 python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --workers 4
 # With HTTPS (production)
 python -m uvicorn src.main:app --host 0.0.0.0 --port 8085 --ssl-keyfile key.pem --ssl-certfile cert.pem
 ```
 ## API Endpoints
 All endpoints require the `X-API-Key` header for authentication.
 ### Image Generation
 ```http
 POST /api/image/generate
 Content-Type: application/json
 X-API-Key: your-api-key
 {
  "prompt": "a cute cat playing with yarn",
  "steps": 30,
  "width": 512,
  "height": 512,
  "guidance_scale": 7.5,
  "seed": 42
 }
 ```
 ### Video Generation
 ```http
 POST /api/video/generate
 Content-Type: application/json
 X-API-Key: your-api-key
 {
  "prompt": "a rocket launching into space",
  "num_frames": 24,
  "fps": 8,
  "steps": 50
 }
 ```
 ### Speech Generation (TTS)
 ```http
 POST /api/speech/generate
 Content-Type: application/json
 X-API-Key: your-api-key
 {
  "prompt": "Hello, welcome to our service!",
  "voice": "default",
  "language": "en"
 }
 ```
 ### Speech to Text
 ```http
 POST /api/speech/totext
 Content-Type: multipart/form-data
 X-API-Key: your-api-key
 file: <audio_file>
 ```
 ### Image Description
 ```http
 POST /api/vision/describe
 Content-Type: multipart/form-data
 X-API-Key: your-api-key
 file: <image_file>
 prompt: "What is in this image?" (optional)
 ```
 ### Video Description
 ```http
 POST /api/vision/describe_video
 Content-Type: multipart/form-data
 X-API-Key: your-api-key
 file: <video_file>
 num_frames: 8 (optional)
 ```
 ### Visual Question Answering
 ```http
 POST /api/vision/vqa
 Content-Type: multipart/form-data
 X-API-Key: your-api-key
 file: <image_file>
 question: "How many people are in this image?"
 ```
 ### Health Check
 ```http
 GET /api/health
 ```
 ## Integration with BotServer
 BotModels integrates with botserver through HTTPS, providing multimodal capabilities to BASIC scripts.
 ### BotServer Configuration (config.csv)
 ```csv
 key,value
 botmodels-enabled,true
 botmodels-host,0.0.0.0
 botmodels-port,8085
 botmodels-api-key,your-secret-key
 botmodels-https,false
 image-generator-model,../../../../data/diffusion/sd_turbo_f16.gguf
 image-generator-steps,4
 image-generator-width,512
 image-generator-height,512
 video-generator-model,../../../../data/diffusion/zeroscope_v2_576w
 video-generator-frames,24
 video-generator-fps,8
 ```
 ### BASIC Script Keywords
 Once configured, these keywords are available in BASIC:
 ```basic
 // Generate an image
 file = IMAGE "a beautiful sunset over mountains"
 SEND FILE TO user, file
 // Generate a video
 video = VIDEO "waves crashing on a beach"
 SEND FILE TO user, video
 // Generate speech
 audio = AUDIO "Welcome to General Bots!"
 SEND FILE TO user, audio
 // Get image/video description
 caption = SEE "/path/to/image.jpg"
 TALK caption
 ```
 ## Architecture
 ```
 ┌─────────────┐     HTTPS      ┌─────────────┐
 │  botserver  │ ────────────▶  │  botmodels  │
 │   (Rust)    │                │  (Python)   │
 └─────────────┘                └─────────────┘
      │                              │
      │ BASIC Keywords               │ AI Models
      │ - IMAGE                      │ - Stable Diffusion
      │ - VIDEO                      │ - Zeroscope
      │ - AUDIO                      │ - TTS/Whisper
      │ - SEE                        │ - BLIP2
      ▼                              ▼
 ┌─────────────┐                ┌─────────────┐
 │   config    │                │   outputs   │
 │   .csv      │                │  (files)    │
 └─────────────┘                └─────────────┘
 ```
 ## Model Downloads
 Models are downloaded automatically on first use, or you can pre-download them:
 ```bash
 # Stable Diffusion
 python -c "from diffusers import StableDiffusionPipeline; StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5')"
 # BLIP2 (Vision)
 python -c "from transformers import Blip2Processor, Blip2ForConditionalGeneration; Blip2Processor.from_pretrained('Salesforce/blip2-opt-2.7b'); Blip2ForConditionalGeneration.from_pretrained('Salesforce/blip2-opt-2.7b')"
 # Whisper (Speech-to-Text)
 python -c "import whisper; whisper.load_model('base')"
 ```
 ## API Documentation
 Interactive API documentation is available at:
 - Swagger UI: `http://localhost:8085/api/docs`
 - ReDoc: `http://localhost:8085/api/redoc`
 ## Development
 ### Project Structure
 ```
 botmodels/
 ├── src/
 │   ├── api/
 │   │   ├── v1/
 │   │   │   └── endpoints/
 │   │   │       ├── image.py
 │   │   │       ├── video.py
 │   │   │       ├── speech.py
 │   │   │       └── vision.py
 │   │   └── dependencies.py
 │   ├── core/
 │   │   ├── config.py
 │   │   └── logging.py
 │   ├── schemas/
 │   │   └── generation.py
 │   ├── services/
 │   │   ├── image_service.py
 │   │   ├── video_service.py
 │   │   ├── speech_service.py
 │   │   └── vision_service.py
 │   └── main.py
 ├── outputs/
 ├── models/
 ├── tests/
 ├── requirements.txt
 └── README.md
 ```
 ### Running Tests
 ```bash
 pytest tests/
 ```
 ## Security Notes
 1. **Always use HTTPS in production**
 2. Use strong, unique API keys
 3. Restrict network access to the service
 4. Consider running on a separate GPU server
 5. Monitor resource usage and set appropriate limits
 ## Requirements
 - Python 3.10+
 - CUDA-capable GPU (recommended, 8GB+ VRAM)
 - 16GB+ RAM
 ## Resources
 ### Education
 - [Computer Vision Course](https://pjreddie.com/courses/computer-vision/)
 - [Adversarial VQA Paper](https://arxiv.org/abs/2106.00245)
 - [LLM Visualization](https://bbycroft.net/llm)
 ### References
 - [VizWiz VQA PyTorch](https://github.com/DenisDsh/VizWiz-VQA-PyTorch)
 - [Diffusers Library](https://github.com/huggingface/diffusers)
 - [OpenAI Whisper](https://github.com/openai/whisper)
 - [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b)
 ### Community
 - [AI for Mankind](https://github.com/aiformankind)
 - [ManaAI](https://manaai.cn/)
 ## License
 See LICENSE file for details.
--- a/requirements.txt
+++ b/requirements.txt
@ -1,11 +1,44 @@
-azure-functions
+# Core Framework
-azure-storage-blob
+fastapi==0.115.0
-azure-identity
+uvicorn[standard]==0.30.6
-tensorflow
+pydantic==2.9.0
-scikit-learn
+pydantic-settings==2.5.2
-pandas
+
-numpy
+# Logging
-allennlp
+structlog==25.5.0
-allennlp-models
+python-json-logger==2.0.7
-nltk
+
-Flask>=1.0,<=1.1.2
+# Generation Libraries
 diffusers==0.30.3
 torch==2.5.1
 torchaudio==2.5.1
 torchvision==0.20.1
 transformers==4.46.0
 accelerate==1.1.1
 safetensors==0.4.5
 Pillow==11.0.0
 # Audio Generation & Processing
 openai-whisper==20231117
 TTS==0.22.0
 scipy==1.14.1
 # Video Processing
 imageio==2.36.0
 imageio-ffmpeg==0.5.1
 opencv-python==4.10.0.84
 # Vision & Multimodal
 timm==1.0.12
 # HTTP & API
 httpx==0.27.2
 aiofiles==24.1.0
 python-multipart==0.0.12
 # Monitoring
 prometheus-client==0.21.0
 # Utils
 python-dotenv==1.0.1
 typing-extensions==4.12.2
--- a/src/init.py
+++ b/src/init.py
--- a/src/api/init.py
+++ b/src/api/init.py
--- a/src/api/dependencies.py
+++ b/src/api/dependencies.py
@ -0,0 +1,7 @@
 from fastapi import Header, HTTPException
 from ..core.config import settings
 async def verify_api_key(x_api_key: str = Header(...)):
    if x_api_key != settings.api_key:
        raise HTTPException(status_code=401, detail="Invalid API key")
    return x_api_key
--- a/src/api/v1/init.py
+++ b/src/api/v1/init.py
--- a/src/api/v1/endpoints/init.py
+++ b/src/api/v1/endpoints/init.py
@ -0,0 +1,3 @@
 from . import image, speech, video, vision
 __all__ = ["image", "video", "speech", "vision"]
--- a/src/api/v1/endpoints/image.py
+++ b/src/api/v1/endpoints/image.py
@ -0,0 +1,64 @@
 from fastapi import APIRouter, Depends, File, UploadFile
 from ....schemas.generation import (
    GenerationResponse,
    ImageDescribeResponse,
    ImageGenerateRequest,
 )
 from ....services.image_service import get_image_service
 from ...dependencies import verify_api_key
 router = APIRouter(prefix="/image", tags=["Image"])
@router.post("/generate", response_model=GenerationResponse)
 async def generate_image(
    request: ImageGenerateRequest,
    api_key: str = Depends(verify_api_key),
    service=Depends(get_image_service),
 ):
    """
    Generate an image from a text prompt.
    Args:
        request: Image generation parameters including prompt, steps, dimensions, etc.
        api_key: API key for authentication
        service: Image service instance
    Returns:
        GenerationResponse with file path and generation time
    """
    result = await service.generate(
        prompt=request.prompt,
        steps=request.steps,
        width=request.width,
        height=request.height,
        guidance_scale=request.guidance_scale,
        seed=request.seed,
    )
    return GenerationResponse(**result)
@router.post("/describe", response_model=ImageDescribeResponse)
 async def describe_image(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_image_service),
 ):
    """
    Get a description of an uploaded image.
    Note: This endpoint is deprecated. Use /api/vision/describe instead
    for full captioning capabilities.
    Args:
        file: Image file to describe
        api_key: API key for authentication
        service: Image service instance
    Returns:
        ImageDescribeResponse with description
    """
    image_data = await file.read()
    result = await service.describe(image_data)
    return ImageDescribeResponse(**result)
--- a/src/api/v1/endpoints/speech.py
+++ b/src/api/v1/endpoints/speech.py
@ -0,0 +1,85 @@
 from fastapi import APIRouter, Depends, File, UploadFile
 from ....schemas.generation import (
    GenerationResponse,
    SpeechGenerateRequest,
    SpeechToTextResponse,
 )
 from ....services.speech_service import get_speech_service
 from ...dependencies import verify_api_key
 router = APIRouter(prefix="/speech", tags=["Speech"])
@router.post("/generate", response_model=GenerationResponse)
 async def generate_speech(
    request: SpeechGenerateRequest,
    api_key: str = Depends(verify_api_key),
    service=Depends(get_speech_service),
 ):
    """
    Generate speech audio from text (Text-to-Speech).
    Args:
        request: Speech generation parameters including:
            - prompt: Text to convert to speech
            - voice: Voice model to use (optional, default: "default")
            - language: Language code (optional, default: "en")
        api_key: API key for authentication
        service: Speech service instance
    Returns:
        GenerationResponse with file path to generated audio and generation time
    """
    result = await service.generate(
        prompt=request.prompt,
        voice=request.voice,
        language=request.language,
    )
    return GenerationResponse(**result)
@router.post("/totext", response_model=SpeechToTextResponse)
 async def speech_to_text(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_speech_service),
 ):
    """
    Convert speech audio to text (Speech-to-Text) using Whisper.
    Supported audio formats: wav, mp3, m4a, flac, ogg
    Args:
        file: Audio file to transcribe
        api_key: API key for authentication
        service: Speech service instance
    Returns:
        SpeechToTextResponse with transcribed text, detected language, and confidence
    """
    audio_data = await file.read()
    result = await service.to_text(audio_data)
    return SpeechToTextResponse(**result)
@router.post("/detect_language")
 async def detect_language(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_speech_service),
 ):
    """
    Detect the language of spoken audio using Whisper.
    Args:
        file: Audio file to analyze
        api_key: API key for authentication
        service: Speech service instance
    Returns:
        dict with detected language code and confidence score
    """
    audio_data = await file.read()
    result = await service.detect_language(audio_data)
    return result
--- a/src/api/v1/endpoints/video.py
+++ b/src/api/v1/endpoints/video.py
@ -0,0 +1,63 @@
 from fastapi import APIRouter, Depends, File, UploadFile
 from ....schemas.generation import (
    GenerationResponse,
    VideoDescribeResponse,
    VideoGenerateRequest,
 )
 from ....services.video_service import get_video_service
 from ...dependencies import verify_api_key
 router = APIRouter(prefix="/video", tags=["Video"])
@router.post("/generate", response_model=GenerationResponse)
 async def generate_video(
    request: VideoGenerateRequest,
    api_key: str = Depends(verify_api_key),
    service=Depends(get_video_service),
 ):
    """
    Generate a video from a text prompt.
    Args:
        request: Video generation parameters including prompt, frames, fps, etc.
        api_key: API key for authentication
        service: Video service instance
    Returns:
        GenerationResponse with file path and generation time
    """
    result = await service.generate(
        prompt=request.prompt,
        num_frames=request.num_frames,
        fps=request.fps,
        steps=request.steps,
        seed=request.seed,
    )
    return GenerationResponse(**result)
@router.post("/describe", response_model=VideoDescribeResponse)
 async def describe_video(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_video_service),
 ):
    """
    Get a description of an uploaded video.
    Note: This endpoint is deprecated. Use /api/vision/describe_video instead
    for full video captioning capabilities.
    Args:
        file: Video file to describe
        api_key: API key for authentication
        service: Video service instance
    Returns:
        VideoDescribeResponse with description and frame count
    """
    video_data = await file.read()
    result = await service.describe(video_data)
    return VideoDescribeResponse(**result)
--- a/src/api/v1/endpoints/vision.py
+++ b/src/api/v1/endpoints/vision.py
@ -0,0 +1,63 @@
 from typing import Optional
 from fastapi import APIRouter, Depends, File, Form, UploadFile
 from ....schemas.generation import ImageDescribeResponse, VideoDescribeResponse
 from ....services.vision_service import get_vision_service
 from ...dependencies import verify_api_key
 router = APIRouter(prefix="/vision", tags=["Vision"])
@router.post("/describe", response_model=ImageDescribeResponse)
 async def describe_image(
    file: UploadFile = File(...),
    prompt: Optional[str] = Form(None),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_vision_service),
 ):
    """
    Get a caption/description for an image.
    Optionally provide a prompt to guide the description.
    """
    image_data = await file.read()
    result = await service.describe_image(image_data, prompt)
    return ImageDescribeResponse(**result)
@router.post("/describe_video", response_model=VideoDescribeResponse)
 async def describe_video(
    file: UploadFile = File(...),
    num_frames: int = Form(8),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_vision_service),
 ):
    """
    Get a description for a video by sampling and analyzing frames.
    Args:
        file: Video file (mp4, avi, mov, webm, mkv)
        num_frames: Number of frames to sample for analysis (default: 8)
    """
    video_data = await file.read()
    result = await service.describe_video(video_data, num_frames)
    return VideoDescribeResponse(**result)
@router.post("/vqa")
 async def visual_question_answering(
    file: UploadFile = File(...),
    question: str = Form(...),
    api_key: str = Depends(verify_api_key),
    service=Depends(get_vision_service),
 ):
    """
    Visual Question Answering - ask a question about an image.
    Args:
        file: Image file
        question: Question to ask about the image
    """
    image_data = await file.read()
    result = await service.answer_question(image_data, question)
    return ImageDescribeResponse(**result)
--- a/src/core/init.py
+++ b/src/core/init.py
--- a/src/core/config.py
+++ b/src/core/config.py
@ -0,0 +1,64 @@
 from pathlib import Path
 from pydantic_settings import BaseSettings, SettingsConfigDict
 class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=False,
        extra="ignore",
    )
    env: str = "development"
    host: str = "0.0.0.0"
    port: int = 8085
    log_level: str = "INFO"
    api_v1_prefix: str = "/api"
    project_name: str = "BotModels API"
    version: str = "2.0.0"
    api_key: str = "change-me"
    # Image generation model
    image_model_path: str = "./models/stable-diffusion-v1-5"
    image_steps: int = 4
    image_width: int = 512
    image_height: int = 512
    image_gpu_layers: int = 20
    image_batch_size: int = 1
    # Video generation model
    video_model_path: str = "./models/zeroscope-v2"
    video_frames: int = 24
    video_fps: int = 8
    video_width: int = 320
    video_height: int = 576
    video_gpu_layers: int = 15
    video_batch_size: int = 1
    # Speech/TTS model
    speech_model_path: str = "./models/tts"
    # Vision model (BLIP2 for captioning)
    vision_model_path: str = "./models/blip2"
    # Whisper model for speech-to-text
    whisper_model_path: str = "./models/whisper"
    # Device configuration
    device: str = "cuda"
    # Output directory for generated files
    output_dir: Path = Path("./outputs")
    @property
    def is_production(self) -> bool:
        return self.env == "production"
 settings = Settings()
 settings.output_dir.mkdir(parents=True, exist_ok=True)
 (settings.output_dir / "images").mkdir(exist_ok=True)
 (settings.output_dir / "videos").mkdir(exist_ok=True)
 (settings.output_dir / "audio").mkdir(exist_ok=True)
--- a/src/core/logging.py
+++ b/src/core/logging.py
@ -0,0 +1,33 @@
 import structlog
 from .config import settings
 def setup_logging():
    if settings.is_production:
        structlog.configure(
            processors=[
                structlog.contextvars.merge_contextvars,
                structlog.stdlib.add_log_level,
                structlog.processors.TimeStamper(fmt="iso"),
                structlog.processors.JSONRenderer()
            ],
            wrapper_class=structlog.make_filtering_bound_logger(
                getattr(structlog.stdlib.logging, settings.log_level.upper())
            ),
        )
    else:
        structlog.configure(
            processors=[
                structlog.contextvars.merge_contextvars,
                structlog.stdlib.add_log_level,
                structlog.processors.TimeStamper(fmt="iso"),
                structlog.dev.ConsoleRenderer(colors=True)
            ],
        )
 def get_logger(name: str = None):
    logger = structlog.get_logger()
    if name:
        logger = logger.bind(service=name)
    return logger
 setup_logging()
--- a/src/main.py
+++ b/src/main.py
@ -0,0 +1,78 @@
 from contextlib import asynccontextmanager
 from fastapi import FastAPI
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import JSONResponse
 from fastapi.staticfiles import StaticFiles
 from .api.v1.endpoints import image, speech, video, vision
 from .core.config import settings
 from .core.logging import get_logger
 from .services.image_service import get_image_service
 from .services.speech_service import get_speech_service
 from .services.video_service import get_video_service
 from .services.vision_service import get_vision_service
 logger = get_logger("main")
@asynccontextmanager
 async def lifespan(app: FastAPI):
    logger.info("Starting BotModels API", version=settings.version)
    try:
        get_image_service().initialize()
        get_video_service().initialize()
        get_speech_service().initialize()
        get_vision_service().initialize()
        logger.info("All services initialized")
    except Exception as e:
        logger.error("Failed to initialize services", error=str(e))
    yield
    logger.info("Shutting down BotModels API")
 app = FastAPI(
    title=settings.project_name,
    version=settings.version,
    lifespan=lifespan,
    docs_url="/api/docs",
    redoc_url="/api/redoc",
 )
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
 )
 app.include_router(image.router, prefix=settings.api_v1_prefix)
 app.include_router(video.router, prefix=settings.api_v1_prefix)
 app.include_router(speech.router, prefix=settings.api_v1_prefix)
 app.include_router(vision.router, prefix=settings.api_v1_prefix)
 app.mount("/outputs", StaticFiles(directory="outputs"), name="outputs")
@app.get("/")
 async def root():
    return JSONResponse(
        {
            "service": settings.project_name,
            "version": settings.version,
            "status": "running",
            "docs": "/api/docs",
        }
    )
@app.get("/api/health")
 async def health():
    return {"status": "healthy", "version": settings.version, "device": settings.device}
 if __name__ == "__main__":
    import uvicorn
    uvicorn.run("src.main:app", host=settings.host, port=settings.port, reload=True)
--- a/src/schemas/init.py
+++ b/src/schemas/init.py
--- a/src/schemas/generation.py
+++ b/src/schemas/generation.py
@ -0,0 +1,57 @@
 from datetime import datetime
 from typing import Optional
 from pydantic import BaseModel, Field
 class GenerationRequest(BaseModel):
    prompt: str = Field(..., min_length=1, max_length=2000)
    seed: Optional[int] = None
 class ImageGenerateRequest(GenerationRequest):
    steps: Optional[int] = Field(30, ge=1, le=150)
    width: Optional[int] = Field(512, ge=64, le=2048)
    height: Optional[int] = Field(512, ge=64, le=2048)
    guidance_scale: Optional[float] = Field(7.5, ge=1.0, le=20.0)
 class VideoGenerateRequest(GenerationRequest):
    num_frames: Optional[int] = Field(24, ge=8, le=128)
    fps: Optional[int] = Field(8, ge=1, le=60)
    steps: Optional[int] = Field(50, ge=10, le=100)
 class SpeechGenerateRequest(GenerationRequest):
    voice: Optional[str] = Field("default", description="Voice model")
    language: Optional[str] = Field("en", description="Language code")
 class GenerationResponse(BaseModel):
    status: str
    file_path: Optional[str] = None
    generation_time: Optional[float] = None
    error: Optional[str] = None
    timestamp: datetime = Field(default_factory=datetime.utcnow)
 class DescribeRequest(BaseModel):
    file_data: bytes
 class ImageDescribeResponse(BaseModel):
    description: str
    confidence: Optional[float] = None
    generation_time: Optional[float] = None
 class VideoDescribeResponse(BaseModel):
    description: str
    frame_count: int
    generation_time: Optional[float] = None
 class SpeechToTextResponse(BaseModel):
    text: str
    language: Optional[str] = None
    confidence: Optional[float] = None
--- a/src/services/init.py
+++ b/src/services/init.py
@ -0,0 +1,15 @@
 from .image_service import ImageService, get_image_service
 from .speech_service import SpeechService, get_speech_service
 from .video_service import VideoService, get_video_service
 from .vision_service import VisionService, get_vision_service
 __all__ = [
    "ImageService",
    "get_image_service",
    "VideoService",
    "get_video_service",
    "SpeechService",
    "get_speech_service",
    "VisionService",
    "get_vision_service",
 ]
--- a/src/services/image_service.py
+++ b/src/services/image_service.py
@ -0,0 +1,111 @@
 import time
 from datetime import datetime
 from typing import Optional
 import torch
 from diffusers import DPMSolverMultistepScheduler, StableDiffusionPipeline
 from PIL import Image
 from ..core.config import settings
 from ..core.logging import get_logger
 logger = get_logger("image_service")
 class ImageService:
    def __init__(self):
        self.pipeline: Optional[StableDiffusionPipeline] = None
        self.device = settings.device
        self._initialized = False
    def initialize(self):
        if self._initialized:
            return
        logger.info("Loading Stable Diffusion model", path=settings.image_model_path)
        try:
            self.pipeline = StableDiffusionPipeline.from_pretrained(
                settings.image_model_path,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
                safety_checker=None,
            )
            self.pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
                self.pipeline.scheduler.config
            )
            self.pipeline = self.pipeline.to(self.device)
            if self.device == "cuda":
                self.pipeline.enable_attention_slicing()
            self._initialized = True
            logger.info("Stable Diffusion loaded successfully")
        except Exception as e:
            logger.error("Failed to load model", error=str(e))
            raise
    async def generate(
        self,
        prompt: str,
        steps: Optional[int] = None,
        width: Optional[int] = None,
        height: Optional[int] = None,
        guidance_scale: Optional[float] = None,
        seed: Optional[int] = None,
    ) -> dict:
        if not self._initialized:
            self.initialize()
        # Use config defaults if not specified
        actual_steps = steps if steps is not None else settings.image_steps
        actual_width = width if width is not None else settings.image_width
        actual_height = height if height is not None else settings.image_height
        actual_guidance = guidance_scale if guidance_scale is not None else 7.5
        start = time.time()
        generator = (
            torch.Generator(device=self.device).manual_seed(seed) if seed else None
        )
        logger.info(
            "Generating image",
            prompt=prompt[:50],
            steps=actual_steps,
            width=actual_width,
            height=actual_height,
        )
        output = self.pipeline(
            prompt=prompt,
            num_inference_steps=actual_steps,
            guidance_scale=actual_guidance,
            width=actual_width,
            height=actual_height,
            generator=generator,
        )
        image: Image.Image = output.images[0]
        timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_{hash(prompt) & 0xFFFFFF:06x}.png"
        output_path = settings.output_dir / "images" / filename
        image.save(output_path)
        generation_time = time.time() - start
        logger.info("Image generated", file=filename, time=generation_time)
        return {
            "status": "completed",
            "file_path": f"/outputs/images/{filename}",
            "generation_time": generation_time,
        }
    async def describe(self, image_data: bytes) -> dict:
        # Placeholder for backward compatibility
        # Use vision_service for actual image description
        return {"description": "Use /api/vision/describe endpoint", "confidence": 0.0}
 _service = None
 def get_image_service():
    global _service
    if _service is None:
        _service = ImageService()
    return _service
--- a/src/services/speech_service.py
+++ b/src/services/speech_service.py
@ -0,0 +1,229 @@
 import io
 import tempfile
 import time
 from datetime import datetime
 from pathlib import Path
 from typing import Optional
 from ..core.config import settings
 from ..core.logging import get_logger
 logger = get_logger("speech_service")
 class SpeechService:
    def __init__(self):
        self.tts_model = None
        self.whisper_model = None
        self.device = settings.device
        self._initialized = False
    def initialize(self):
        if self._initialized:
            return
        logger.info("Loading speech models")
        try:
            # Load TTS model (Coqui TTS)
            self._load_tts_model()
            # Load Whisper model for speech-to-text
            self._load_whisper_model()
            self._initialized = True
            logger.info("Speech models loaded successfully")
        except Exception as e:
            logger.error("Failed to load speech models", error=str(e))
            # Don't raise - allow service to run with partial functionality
            logger.warning("Speech service will have limited functionality")
    def _load_tts_model(self):
        """Load TTS model for text-to-speech generation"""
        try:
            from TTS.api import TTS
            # Use a fast, high-quality model
            self.tts_model = TTS(
                model_name="tts_models/en/ljspeech/tacotron2-DDC",
                progress_bar=False,
                gpu=(self.device == "cuda"),
            )
            logger.info("TTS model loaded")
        except Exception as e:
            logger.warning("TTS model not available", error=str(e))
            self.tts_model = None
    def _load_whisper_model(self):
        """Load Whisper model for speech-to-text"""
        try:
            import whisper
            # Use base model for balance of speed and accuracy
            model_size = "base"
            if Path(settings.whisper_model_path).exists():
                self.whisper_model = whisper.load_model(
                    model_size, download_root=settings.whisper_model_path
                )
            else:
                self.whisper_model = whisper.load_model(model_size)
            logger.info("Whisper model loaded", model=model_size)
        except Exception as e:
            logger.warning("Whisper model not available", error=str(e))
            self.whisper_model = None
    async def generate(
        self,
        prompt: str,
        voice: Optional[str] = None,
        language: Optional[str] = None,
    ) -> dict:
        """Generate speech audio from text"""
        if not self._initialized:
            self.initialize()
        start = time.time()
        timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_{hash(prompt) & 0xFFFFFF:06x}.wav"
        output_path = settings.output_dir / "audio" / filename
        if self.tts_model is None:
            logger.error("TTS model not available")
            return {
                "status": "error",
                "error": "TTS model not initialized",
                "file_path": None,
                "generation_time": time.time() - start,
            }
        try:
            logger.info(
                "Generating speech",
                text_length=len(prompt),
                voice=voice,
                language=language,
            )
            # Generate speech
            self.tts_model.tts_to_file(
                text=prompt,
                file_path=str(output_path),
            )
            generation_time = time.time() - start
            logger.info("Speech generated", file=filename, time=generation_time)
            return {
                "status": "completed",
                "file_path": f"/outputs/audio/{filename}",
                "generation_time": generation_time,
            }
        except Exception as e:
            logger.error("Speech generation failed", error=str(e))
            return {
                "status": "error",
                "error": str(e),
                "file_path": None,
                "generation_time": time.time() - start,
            }
    async def to_text(self, audio_data: bytes) -> dict:
        """Convert speech audio to text using Whisper"""
        if not self._initialized:
            self.initialize()
        start = time.time()
        if self.whisper_model is None:
            logger.error("Whisper model not available")
            return {
                "text": "",
                "language": None,
                "confidence": 0.0,
                "error": "Whisper model not initialized",
            }
        try:
            # Save audio to temporary file
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
                tmp.write(audio_data)
                tmp_path = tmp.name
            logger.info("Transcribing audio", file_size=len(audio_data))
            # Transcribe
            result = self.whisper_model.transcribe(tmp_path)
            # Clean up temp file
            import os
            os.unlink(tmp_path)
            transcription_time = time.time() - start
            logger.info(
                "Audio transcribed",
                text_length=len(result["text"]),
                language=result.get("language"),
                time=transcription_time,
            )
            return {
                "text": result["text"].strip(),
                "language": result.get("language", "en"),
                "confidence": 0.95,  # Whisper doesn't provide confidence directly
            }
        except Exception as e:
            logger.error("Speech-to-text failed", error=str(e))
            return {
                "text": "",
                "language": None,
                "confidence": 0.0,
                "error": str(e),
            }
    async def detect_language(self, audio_data: bytes) -> dict:
        """Detect the language of spoken audio"""
        if not self._initialized:
            self.initialize()
        if self.whisper_model is None:
            return {"language": None, "error": "Whisper model not initialized"}
        try:
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
                tmp.write(audio_data)
                tmp_path = tmp.name
            import whisper
            # Load audio and detect language
            audio = whisper.load_audio(tmp_path)
            audio = whisper.pad_or_trim(audio)
            mel = whisper.log_mel_spectrogram(audio).to(self.whisper_model.device)
            _, probs = self.whisper_model.detect_language(mel)
            import os
            os.unlink(tmp_path)
            detected_lang = max(probs, key=probs.get)
            confidence = probs[detected_lang]
            return {
                "language": detected_lang,
                "confidence": confidence,
            }
        except Exception as e:
            logger.error("Language detection failed", error=str(e))
            return {"language": None, "error": str(e)}
 _service = None
 def get_speech_service():
    global _service
    if _service is None:
        _service = SpeechService()
    return _service
--- a/src/services/video_service.py
+++ b/src/services/video_service.py
@ -0,0 +1,106 @@
 import time
 from datetime import datetime
 from typing import Optional
 import imageio
 import torch
 from ..core.config import settings
 from ..core.logging import get_logger
 logger = get_logger("video_service")
 class VideoService:
    def __init__(self):
        self.pipeline = None
        self.device = settings.device
        self._initialized = False
    def initialize(self):
        if self._initialized:
            return
        logger.info("Loading video model", path=settings.video_model_path)
        try:
            from diffusers import DiffusionPipeline
            self.pipeline = DiffusionPipeline.from_pretrained(
                settings.video_model_path,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            )
            self.pipeline = self.pipeline.to(self.device)
            self._initialized = True
            logger.info("Video model loaded successfully")
        except Exception as e:
            logger.error("Failed to load video model", error=str(e))
            raise
    async def generate(
        self,
        prompt: str,
        num_frames: Optional[int] = None,
        fps: Optional[int] = None,
        steps: Optional[int] = None,
        seed: Optional[int] = None,
    ) -> dict:
        if not self._initialized:
            self.initialize()
        # Use config defaults if not specified
        actual_frames = num_frames if num_frames is not None else settings.video_frames
        actual_fps = fps if fps is not None else settings.video_fps
        actual_steps = steps if steps is not None else 50
        start = time.time()
        generator = (
            torch.Generator(device=self.device).manual_seed(seed) if seed else None
        )
        logger.info(
            "Generating video",
            prompt=prompt[:50],
            frames=actual_frames,
            fps=actual_fps,
            steps=actual_steps,
        )
        output = self.pipeline(
            prompt=prompt,
            num_frames=actual_frames,
            num_inference_steps=actual_steps,
            generator=generator,
        )
        frames = output.frames[0]
        timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
        filename = f"{timestamp}_{hash(prompt) & 0xFFFFFF:06x}.mp4"
        output_path = settings.output_dir / "videos" / filename
        imageio.mimsave(output_path, frames, fps=actual_fps, codec="libx264")
        generation_time = time.time() - start
        logger.info("Video generated", file=filename, time=generation_time)
        return {
            "status": "completed",
            "file_path": f"/outputs/videos/{filename}",
            "generation_time": generation_time,
        }
    async def describe(self, video_data: bytes) -> dict:
        # Placeholder for backward compatibility
        # Use vision_service for actual video description
        return {
            "description": "Use /api/vision/describe_video endpoint",
            "frame_count": 0,
        }
 _service = None
 def get_video_service():
    global _service
    if _service is None:
        _service = VideoService()
    return _service
--- a/src/services/vision_service.py
+++ b/src/services/vision_service.py
@ -0,0 +1,204 @@
 import io
 import time
 from datetime import datetime
 from typing import Optional
 import torch
 from PIL import Image
 from ..core.config import settings
 from ..core.logging import get_logger
 logger = get_logger("vision_service")
 class VisionService:
    def __init__(self):
        self.model = None
        self.processor = None
        self.device = settings.device
        self._initialized = False
    def initialize(self):
        if self._initialized:
            return
        logger.info("Loading vision model (BLIP2)")
        try:
            from transformers import Blip2ForConditionalGeneration, Blip2Processor
            self.processor = Blip2Processor.from_pretrained(settings.vision_model_path)
            self.model = Blip2ForConditionalGeneration.from_pretrained(
                settings.vision_model_path,
                torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
            )
            self.model = self.model.to(self.device)
            self._initialized = True
            logger.info("Vision model loaded")
        except Exception as e:
            logger.error("Failed to load vision model", error=str(e))
            # Don't raise - allow service to run without vision
            logger.warning("Vision service will return placeholder responses")
    async def describe_image(
        self, image_data: bytes, prompt: Optional[str] = None
    ) -> dict:
        """Generate a caption/description for an image"""
        start = time.time()
        if not self._initialized or self.model is None:
            # Return placeholder if model not loaded
            return {
                "description": "Vision model not initialized. Please check model path configuration.",
                "confidence": 0.0,
                "generation_time": time.time() - start,
            }
        try:
            # Load image from bytes
            image = Image.open(io.BytesIO(image_data)).convert("RGB")
            # Prepare inputs
            if prompt:
                inputs = self.processor(image, text=prompt, return_tensors="pt").to(
                    self.device
                )
            else:
                inputs = self.processor(image, return_tensors="pt").to(self.device)
            # Generate caption
            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs, max_new_tokens=100, num_beams=5, early_stopping=True
                )
            # Decode the generated text
            description = self.processor.decode(
                generated_ids[0], skip_special_tokens=True
            )
            return {
                "description": description.strip(),
                "confidence": 0.85,  # BLIP2 doesn't provide confidence scores directly
                "generation_time": time.time() - start,
            }
        except Exception as e:
            logger.error("Image description failed", error=str(e))
            return {
                "description": f"Error describing image: {str(e)}",
                "confidence": 0.0,
                "generation_time": time.time() - start,
            }
    async def describe_video(self, video_data: bytes, num_frames: int = 8) -> dict:
        """Generate a description for a video by sampling frames"""
        start = time.time()
        if not self._initialized or self.model is None:
            return {
                "description": "Vision model not initialized. Please check model path configuration.",
                "frame_count": 0,
                "generation_time": time.time() - start,
            }
        try:
            import tempfile
            import cv2
            import numpy as np
            # Save video to temp file
            with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp:
                tmp.write(video_data)
                tmp_path = tmp.name
            # Open video and extract frames
            cap = cv2.VideoCapture(tmp_path)
            total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
            if total_frames == 0:
                cap.release()
                return {
                    "description": "Could not read video frames",
                    "frame_count": 0,
                    "generation_time": time.time() - start,
                }
            # Sample frames evenly throughout the video
            frame_indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)
            frames = []
            for idx in frame_indices:
                cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
                ret, frame = cap.read()
                if ret:
                    # Convert BGR to RGB
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    frames.append(Image.fromarray(frame_rgb))
            cap.release()
            # Clean up temp file
            import os
            os.unlink(tmp_path)
            if not frames:
                return {
                    "description": "No frames could be extracted from video",
                    "frame_count": 0,
                    "generation_time": time.time() - start,
                }
            # Generate descriptions for each sampled frame
            descriptions = []
            for frame in frames:
                inputs = self.processor(frame, return_tensors="pt").to(self.device)
                with torch.no_grad():
                    generated_ids = self.model.generate(
                        **inputs, max_new_tokens=50, num_beams=3, early_stopping=True
                    )
                desc = self.processor.decode(generated_ids[0], skip_special_tokens=True)
                descriptions.append(desc.strip())
            # Combine descriptions into a coherent summary
            # Use the most common elements or create a timeline
            unique_descriptions = list(
                dict.fromkeys(descriptions)
            )  # Remove duplicates preserving order
            if len(unique_descriptions) == 1:
                combined = unique_descriptions[0]
            else:
                combined = "Video shows: " + "; ".join(unique_descriptions[:4])
            return {
                "description": combined,
                "frame_count": len(frames),
                "generation_time": time.time() - start,
            }
        except Exception as e:
            logger.error("Video description failed", error=str(e))
            return {
                "description": f"Error describing video: {str(e)}",
                "frame_count": 0,
                "generation_time": time.time() - start,
            }
    async def answer_question(self, image_data: bytes, question: str) -> dict:
        """Visual question answering - ask a question about an image"""
        # Use describe_image with the question as a prompt
        return await self.describe_image(image_data, prompt=question)
 _service = None
 def get_vision_service():
    global _service
    if _service is None:
        _service = VisionService()
    return _service
--- a/src/utils/init.py
+++ b/src/utils/init.py
		`@ -0,0 +1,3 @@`
							`from . import image, speech, video, vision`

							`__all__ = ["image", "video", "speech", "vision"]`