Add Chapter 20: Embedded & Offline Deployment - Complete guide for Raspberry Pi, Orange Pi with local LLM

This commit is contained in:
Rodrigo Rodriguez (Pragmatismo) 2025-12-12 13:51:39 -03:00
parent 3fdeeedf73
commit ff5d2ac12c
5 changed files with 835 additions and 0 deletions

View file

@ -0,0 +1,47 @@
# Chapter 20: Embedded & Offline Deployment
Deploy General Bots to any device - from Raspberry Pi to industrial kiosks - with local LLM inference for fully offline AI capabilities.
## Overview
General Bots can run on minimal hardware with displays as small as 16x2 character LCDs, enabling AI-powered interactions anywhere:
- **Kiosks** - Self-service terminals in stores, airports, hospitals
- **Industrial IoT** - Factory floor assistants, machine interfaces
- **Smart Home** - Wall panels, kitchen displays, door intercoms
- **Retail** - Point-of-sale systems, product information terminals
- **Education** - Classroom assistants, lab equipment interfaces
- **Healthcare** - Patient check-in, medication reminders
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Embedded GB Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Display │ │ botserver │ │ llama.cpp │ │
│ │ LCD/OLED │────▶│ (Rust) │────▶│ (Local) │ │
│ │ TFT/HDMI │ │ Port 8088 │ │ Port 8080 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ Keyboard │ │ SQLite │ │ TinyLlama │ │
│ │ Buttons │ │ (Data) │ │ GGUF │ │
│ │ Touch │ │ │ │ (~700MB) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## What's in This Chapter
- [Supported Hardware](./hardware.md) - Boards, displays, and peripherals
- [Quick Start](./quick-start.md) - Deploy in 5 minutes
- [Embedded UI](./embedded-ui.md) - Interface for small displays
- [Local LLM](./local-llm.md) - Offline AI with llama.cpp
- [Display Modes](./display-modes.md) - LCD, OLED, TFT, E-ink configurations
- [Kiosk Mode](./kiosk-mode.md) - Locked-down production deployments
- [Performance Tuning](./performance.md) - Optimize for limited resources
- [Offline Operation](./offline.md) - No internet required
- [Use Cases](./use-cases.md) - Real-world deployment examples

View file

@ -0,0 +1,190 @@
# Supported Hardware
## Single Board Computers (SBCs)
### Recommended Boards
| Board | CPU | RAM | Best For | Price |
|-------|-----|-----|----------|-------|
| **Orange Pi 5** | RK3588S | 4-16GB | Full LLM, NPU accel | $89-149 |
| **Raspberry Pi 5** | BCM2712 | 4-8GB | General purpose | $60-80 |
| **Orange Pi Zero 3** | H618 | 1-4GB | Minimal deployments | $20-35 |
| **Raspberry Pi 4** | BCM2711 | 2-8GB | Established ecosystem | $45-75 |
| **Raspberry Pi Zero 2W** | RP3A0 | 512MB | Ultra-compact | $15 |
| **Rock Pi 4** | RK3399 | 4GB | NPU available | $75 |
| **NVIDIA Jetson Nano** | Tegra X1 | 4GB | GPU inference | $149 |
| **BeagleBone Black** | AM3358 | 512MB | Industrial | $55 |
| **LattePanda 3 Delta** | N100 | 8GB | x86 compatibility | $269 |
| **ODROID-N2+** | S922X | 4GB | High performance | $79 |
### Minimum Requirements
**For UI only (connect to remote botserver):**
- Any ARM/x86 Linux board
- 256MB RAM
- Network connection
- Display output
**For local botserver:**
- ARM64 or x86_64
- 1GB RAM minimum
- 4GB storage
**For local LLM (llama.cpp):**
- ARM64 or x86_64
- 2GB+ RAM (4GB recommended)
- 2GB+ storage for model
### Orange Pi 5 (Recommended for LLM)
The Orange Pi 5 with RK3588S is ideal for embedded LLM:
```
┌─────────────────────────────────────────────────────────────┐
│ Orange Pi 5 - Best for Offline AI │
├─────────────────────────────────────────────────────────────┤
│ CPU: Rockchip RK3588S (4x A76 + 4x A55) │
│ NPU: 6 TOPS (Neural Processing Unit) │
│ GPU: Mali-G610 MP4 │
│ RAM: 4GB / 8GB / 16GB LPDDR4X │
│ Storage: M.2 NVMe + eMMC + microSD │
│ │
│ LLM Performance: │
│ ├─ TinyLlama 1.1B Q4: ~8-12 tokens/sec │
│ ├─ Phi-2 2.7B Q4: ~4-6 tokens/sec │
│ └─ With NPU (rkllm): ~20-30 tokens/sec │
└─────────────────────────────────────────────────────────────┘
```
## Displays
### Character LCDs (Minimal)
For text-only interfaces:
| Display | Resolution | Interface | Use Case |
|---------|------------|-----------|----------|
| HD44780 16x2 | 16 chars × 2 lines | I2C/GPIO | Status, simple Q&A |
| HD44780 20x4 | 20 chars × 4 lines | I2C/GPIO | More context |
| LCD2004 | 20 chars × 4 lines | I2C | Industrial |
**Example output on 16x2:**
```
┌────────────────┐
│> How can I help│
< Processing...
└────────────────┘
```
### OLED Displays
For graphical monochrome interfaces:
| Display | Resolution | Interface | Size |
|---------|------------|-----------|------|
| SSD1306 | 128×64 | I2C/SPI | 0.96" |
| SSD1309 | 128×64 | I2C/SPI | 2.42" |
| SH1106 | 128×64 | I2C/SPI | 1.3" |
| SSD1322 | 256×64 | SPI | 3.12" |
### TFT/IPS Color Displays
For full graphical interface:
| Display | Resolution | Interface | Notes |
|---------|------------|-----------|-------|
| ILI9341 | 320×240 | SPI | Common, cheap |
| ST7789 | 240×320 | SPI | Fast refresh |
| ILI9488 | 480×320 | SPI | Larger |
| Waveshare 5" | 800×480 | HDMI | Touch optional |
| Waveshare 7" | 1024×600 | HDMI | Touch, IPS |
| Official Pi 7" | 800×480 | DSI | Best for Pi |
### E-Ink/E-Paper
For low-power, readable in sunlight:
| Display | Resolution | Colors | Refresh |
|---------|------------|--------|---------|
| Waveshare 2.13" | 250×122 | B/W | 2s |
| Waveshare 4.2" | 400×300 | B/W | 4s |
| Waveshare 7.5" | 800×480 | B/W | 5s |
| Good Display 9.7" | 1200×825 | B/W | 6s |
**Best for:** Menu displays, signs, low-update applications
### Industrial Displays
| Display | Resolution | Features |
|---------|------------|----------|
| Advantech | Various | Wide temp, sunlight |
| Winstar | Various | Industrial grade |
| Newhaven | Various | Long availability |
## Input Devices
### Keyboards
- **USB Keyboard** - Standard, any USB keyboard works
- **PS/2 Keyboard** - Via adapter, lower latency
- **Matrix Keypad** - 4x4 or 3x4, GPIO connected
- **I2C Keypad** - Fewer GPIO pins needed
### Touch Input
- **Capacitive Touch** - Better response, needs driver
- **Resistive Touch** - Works with gloves, pressure-based
- **IR Touch Frame** - Large displays, vandal-resistant
### Buttons & GPIO
```
┌─────────────────────────────────────────────┐
│ Simple 4-Button Interface │
├─────────────────────────────────────────────┤
│ │
│ [◄ PREV] [▲ UP] [▼ DOWN] [► SELECT] │
│ │
│ GPIO 17 GPIO 27 GPIO 22 GPIO 23 │
│ │
└─────────────────────────────────────────────┘
```
## Enclosures
### Commercial Options
- **Hammond Manufacturing** - Industrial metal enclosures
- **Polycase** - Plastic, IP65 rated
- **Bud Industries** - Various sizes
- **Pi-specific cases** - Argon, Flirc, etc.
### DIY Options
- **3D Printed** - Custom fit, PLA/PETG
- **Laser Cut** - Acrylic, wood
- **Metal Fabrication** - Professional look
## Power
### Power Requirements
| Configuration | Power | Recommended PSU |
|---------------|-------|-----------------|
| Pi Zero + LCD | 1-2W | 5V 1A |
| Pi 4 + Display | 5-10W | 5V 3A |
| Orange Pi 5 | 8-15W | 5V 4A or 12V 2A |
| With NVMe SSD | +2-3W | Add 1A headroom |
### Power Options
- **USB-C PD** - Modern, efficient
- **PoE HAT** - Power over Ethernet
- **12V Barrel** - Industrial standard
- **Battery** - UPS, solar applications
### UPS Solutions
- **PiJuice** - Pi-specific UPS HAT
- **UPS PIco** - Small form factor
- **Powerboost** - Adafruit, lithium battery

View file

@ -0,0 +1,382 @@
# Local LLM - Offline AI with llama.cpp
Run AI inference completely offline on embedded devices. No internet, no API costs, full privacy.
## Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Local LLM Architecture │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Input ──▶ botserver ──▶ llama.cpp ──▶ Response │
│ │ │ │
│ │ ┌────┴────┐ │
│ │ │ Model │ │
│ │ │ GGUF │ │
│ │ │ (Q4_K) │ │
│ │ └─────────┘ │
│ │ │
│ SQLite DB │
│ (sessions) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
## Recommended Models
### By Device RAM
| RAM | Model | Size | Speed | Quality |
|-----|-------|------|-------|---------|
| **2GB** | TinyLlama 1.1B Q4_K_M | 670MB | ~5 tok/s | Basic |
| **4GB** | Phi-2 2.7B Q4_K_M | 1.6GB | ~3-4 tok/s | Good |
| **4GB** | Gemma 2B Q4_K_M | 1.4GB | ~4 tok/s | Good |
| **8GB** | Llama 3.2 3B Q4_K_M | 2GB | ~3 tok/s | Better |
| **8GB** | Mistral 7B Q4_K_M | 4.1GB | ~2 tok/s | Great |
| **16GB** | Llama 3.1 8B Q4_K_M | 4.7GB | ~2 tok/s | Excellent |
### By Use Case
**Simple Q&A, Commands:**
```
TinyLlama 1.1B - Fast, basic understanding
```
**Customer Service, FAQ:**
```
Phi-2 or Gemma 2B - Good comprehension, reasonable speed
```
**Complex Reasoning:**
```
Llama 3.2 3B or Mistral 7B - Better accuracy, slower
```
## Installation
### Automatic (via deploy script)
```bash
./scripts/deploy-embedded.sh pi@device --with-llama
```
### Manual Installation
```bash
# SSH to device
ssh pi@raspberrypi.local
# Install dependencies
sudo apt update
sudo apt install -y build-essential cmake git wget
# Clone llama.cpp
cd /opt
sudo git clone https://github.com/ggerganov/llama.cpp
sudo chown -R $(whoami):$(whoami) llama.cpp
cd llama.cpp
# Build for ARM (auto-optimizes)
mkdir build && cd build
cmake .. -DLLAMA_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Download model
mkdir -p /opt/llama.cpp/models
cd /opt/llama.cpp/models
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
```
### Start Server
```bash
# Test run
/opt/llama.cpp/build/bin/llama-server \
-m /opt/llama.cpp/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 2048 \
--threads 4
# Verify
curl http://localhost:8080/v1/models
```
### Systemd Service
Create `/etc/systemd/system/llama-server.service`:
```ini
[Unit]
Description=llama.cpp Server - Local LLM
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt/llama.cpp
ExecStart=/opt/llama.cpp/build/bin/llama-server \
-m /opt/llama.cpp/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 2048 \
-ngl 0 \
--threads 4
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
```
Enable and start:
```bash
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
```
## Configuration
### botserver .env
```env
# Use local llama.cpp
LLM_PROVIDER=llamacpp
LLM_API_URL=http://127.0.0.1:8080
LLM_MODEL=tinyllama
# Memory limits
MAX_CONTEXT_TOKENS=2048
MAX_RESPONSE_TOKENS=512
STREAMING_ENABLED=true
```
### llama.cpp Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `-c` | 2048 | Context size (tokens) |
| `--threads` | 4 | CPU threads |
| `-ngl` | 0 | GPU layers (0 for CPU only) |
| `--host` | 127.0.0.1 | Bind address |
| `--port` | 8080 | Server port |
| `-b` | 512 | Batch size |
| `--mlock` | off | Lock model in RAM |
### Memory vs Context Size
```
Context 512: ~400MB RAM, fast, limited conversation
Context 1024: ~600MB RAM, moderate
Context 2048: ~900MB RAM, good for most uses
Context 4096: ~1.5GB RAM, long conversations
```
## Performance Optimization
### CPU Optimization
```bash
# Check CPU features
cat /proc/cpuinfo | grep -E "(model name|Features)"
# Build with specific optimizations
cmake .. -DLLAMA_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_ARM_FMA=ON \
-DLLAMA_ARM_DOTPROD=ON
```
### Memory Optimization
```bash
# For 2GB RAM devices
# Use smaller context
-c 1024
# Use memory mapping (slower but less RAM)
--mmap
# Disable mlock (don't pin to RAM)
# (default is disabled)
```
### Swap Configuration
For devices with limited RAM:
```bash
# Create 2GB swap
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Optimize swap usage
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
```
## NPU Acceleration (Orange Pi 5)
Orange Pi 5 has a 6 TOPS NPU that can accelerate inference:
### Using rkllm (Rockchip NPU)
```bash
# Install rkllm runtime
git clone https://github.com/airockchip/rknn-llm
cd rknn-llm
./install.sh
# Convert model to RKNN format
python3 convert_model.py \
--model tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--output tinyllama.rkllm
# Run with NPU
rkllm-server \
--model tinyllama.rkllm \
--port 8080
```
Expected speedup: **3-5x faster** than CPU only.
## Model Download URLs
### TinyLlama 1.1B (Recommended for 2GB)
```bash
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
```
### Phi-2 2.7B (Recommended for 4GB)
```bash
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf
```
### Gemma 2B
```bash
wget https://huggingface.co/bartowski/gemma-2-2b-it-GGUF/resolve/main/gemma-2-2b-it-Q4_K_M.gguf
```
### Llama 3.2 3B (Recommended for 8GB)
```bash
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
```
### Mistral 7B
```bash
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
```
## API Usage
llama.cpp exposes an OpenAI-compatible API:
### Chat Completion
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama",
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"max_tokens": 100
}'
```
### Streaming
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tinyllama",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
```
### Health Check
```bash
curl http://localhost:8080/health
curl http://localhost:8080/v1/models
```
## Monitoring
### Check Performance
```bash
# Watch resource usage
htop
# Check inference speed in logs
sudo journalctl -u llama-server -f | grep "tokens/s"
# Memory usage
free -h
```
### Benchmarking
```bash
# Run llama.cpp benchmark
/opt/llama.cpp/build/bin/llama-bench \
-m /opt/llama.cpp/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-p 512 -n 128 -t 4
```
## Troubleshooting
### Model Loading Fails
```bash
# Check available RAM
free -h
# Try smaller context
-c 512
# Use memory mapping
--mmap
```
### Slow Inference
```bash
# Increase threads (up to CPU cores)
--threads $(nproc)
# Use optimized build
cmake .. -DLLAMA_NATIVE=ON
# Consider smaller model
```
### Out of Memory Killer
```bash
# Check if OOM killed the process
dmesg | grep -i "killed process"
# Increase swap
# Use smaller model
# Reduce context size
```
## Best Practices
1. **Start small** - Begin with TinyLlama, upgrade if needed
2. **Monitor memory** - Use `htop` during initial tests
3. **Set appropriate context** - 1024-2048 for most embedded use
4. **Use quantized models** - Q4_K_M is a good balance
5. **Enable streaming** - Better UX on slow inference
6. **Test offline** - Verify it works without internet before deployment

View file

@ -0,0 +1,209 @@
# Quick Start - Deploy in 5 Minutes
Get General Bots running on your embedded device with local AI in just a few commands.
## Prerequisites
- An SBC (Raspberry Pi, Orange Pi, etc.) with Armbian/Raspbian
- SSH access to the device
- Internet connection (for initial setup only)
## One-Line Deploy
From your development machine:
```bash
# Clone and run the deployment script
git clone https://github.com/GeneralBots/botserver.git
cd botserver
# Deploy to Orange Pi (replace with your device IP)
./scripts/deploy-embedded.sh orangepi@192.168.1.100 --with-ui --with-llama
```
That's it! After ~10-15 minutes:
- BotServer runs on port 8088
- llama.cpp runs on port 8080 with TinyLlama
- Embedded UI available at `http://your-device:8088/embedded/`
## Step-by-Step Guide
### Step 1: Prepare Your Device
Flash your SBC with a compatible OS:
**Raspberry Pi:**
```bash
# Download Raspberry Pi Imager
# Select: Raspberry Pi OS Lite (64-bit)
# Enable SSH in settings
```
**Orange Pi:**
```bash
# Download Armbian from armbian.com
# Flash with balenaEtcher
```
### Step 2: First Boot Configuration
```bash
# SSH into your device
ssh pi@raspberrypi.local # or orangepi@orangepi.local
# Update system
sudo apt update && sudo apt upgrade -y
# Set timezone
sudo timedatectl set-timezone America/Sao_Paulo
# Enable I2C/SPI if using GPIO displays
sudo raspi-config # or armbian-config
```
### Step 3: Run Deployment Script
From your development PC:
```bash
# Basic deployment (botserver only)
./scripts/deploy-embedded.sh pi@raspberrypi.local
# With embedded UI
./scripts/deploy-embedded.sh pi@raspberrypi.local --with-ui
# With local LLM (requires 4GB+ RAM)
./scripts/deploy-embedded.sh pi@raspberrypi.local --with-ui --with-llama
# Specify a different model
./scripts/deploy-embedded.sh pi@raspberrypi.local --with-llama --model phi-2-Q4_K_M.gguf
```
### Step 4: Verify Installation
```bash
# Check services
ssh pi@raspberrypi.local 'sudo systemctl status botserver'
ssh pi@raspberrypi.local 'sudo systemctl status llama-server'
# Test botserver
curl http://raspberrypi.local:8088/health
# Test llama.cpp
curl http://raspberrypi.local:8080/v1/models
```
### Step 5: Access the Interface
Open in your browser:
```
http://raspberrypi.local:8088/embedded/
```
Or set up kiosk mode (auto-starts on boot):
```bash
# Already configured if you used --with-ui
# Just reboot:
ssh pi@raspberrypi.local 'sudo reboot'
```
## Local Installation (On the Device)
If you prefer to install directly on the device:
```bash
# SSH into the device
ssh pi@raspberrypi.local
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
# Clone and build
git clone https://github.com/GeneralBots/botserver.git
cd botserver
# Run local deployment
./scripts/deploy-embedded.sh --local --with-ui --with-llama
```
⚠️ **Note:** Building on ARM devices is slow (1-2 hours). Cross-compilation is faster.
## Configuration
After deployment, edit the config file:
```bash
ssh pi@raspberrypi.local
sudo nano /opt/botserver/.env
```
Key settings:
```env
# Server
HOST=0.0.0.0
PORT=8088
# Local LLM
LLM_PROVIDER=llamacpp
LLM_API_URL=http://127.0.0.1:8080
LLM_MODEL=tinyllama
# Memory limits for small devices
MAX_CONTEXT_TOKENS=2048
MAX_RESPONSE_TOKENS=512
```
Restart after changes:
```bash
sudo systemctl restart botserver
```
## Troubleshooting
### Out of Memory
```bash
# Check memory usage
free -h
# Reduce llama.cpp context
sudo nano /etc/systemd/system/llama-server.service
# Change -c 2048 to -c 1024
# Or use a smaller model
# TinyLlama uses ~700MB, Phi-2 uses ~1.6GB
```
### Service Won't Start
```bash
# Check logs
sudo journalctl -u botserver -f
sudo journalctl -u llama-server -f
# Common issues:
# - Port already in use
# - Missing model file
# - Database permissions
```
### Display Not Working
```bash
# Check if display is detected
ls /dev/fb* # HDMI/DSI
ls /dev/i2c* # I2C displays
ls /dev/spidev* # SPI displays
# For HDMI, check config
sudo nano /boot/config.txt # Raspberry Pi
sudo nano /boot/armbianEnv.txt # Orange Pi
```
## Next Steps
- [Embedded UI Guide](./embedded-ui.md) - Customize the interface
- [Local LLM Configuration](./local-llm.md) - Optimize AI performance
- [Kiosk Mode](./kiosk-mode.md) - Production deployment
- [Offline Operation](./offline.md) - Disconnected environments

View file

@ -390,5 +390,12 @@
- [Appendix D: Documentation Style](./16-appendix-docs-style/conversation-examples.md)
- [SVG and Conversation Standards](./16-appendix-docs-style/svg.md)
# Part XV - Embedded & Offline
- [Chapter 20: Embedded Deployment](./20-embedding/README.md)
- [Supported Hardware](./20-embedding/hardware.md)
- [Quick Start](./20-embedding/quick-start.md)
- [Local LLM with llama.cpp](./20-embedding/local-llm.md)
[Glossary](./glossary.md)
[Contact](./contact/README.md)