botbook/src/09-tools/nvidia-gpu-setup.md

6.3 KiB

NVIDIA GPU Setup for LXC Containers

This guide covers setting up NVIDIA GPU passthrough for botserver running in LXC containers, enabling hardware acceleration for local LLM inference.

Prerequisites

  • NVIDIA GPU (RTX 3060 or better with 12GB+ VRAM recommended)
  • NVIDIA drivers installed on the host system
  • LXD/LXC installed
  • CUDA-capable GPU

LXD Configuration (Interactive Setup)

When initializing LXD, use these settings:

sudo lxd init

Answer the prompts as follows:

  • Would you like to use LXD clustering?no
  • Do you want to configure a new storage pool?no (will create /generalbots later)
  • Would you like to connect to a MAAS server?no
  • Would you like to create a new local network bridge?yes
  • What should the new bridge be called?lxdbr0
  • What IPv4 address should be used?auto
  • What IPv6 address should be used?auto
  • Would you like the LXD server to be available over the network?no
  • Would you like stale cached images to be updated automatically?no
  • Would you like a YAML "lxd init" preseed to be printed?no

Storage Configuration

  • Storage backend name:default
  • Storage backend driver:zfs
  • Create a new ZFS pool?yes

NVIDIA GPU Configuration

On the Host System

Create a GPU profile and attach it to your container:

# Create GPU profile
lxc profile create gpu

# Add GPU device to profile
lxc profile device add gpu gpu gpu gputype=physical

# Apply GPU profile to your container
lxc profile add gb-system gpu

Inside the Container

Configure NVIDIA driver version pinning and install drivers:

  1. Pin NVIDIA driver versions to ensure stability:
cat > /etc/apt/preferences.d/nvidia-drivers << 'EOF'
Package: *nvidia*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: cuda-drivers*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libcuda*
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libxnvctrl* 
Pin: version 560.35.05-1
Pin-Priority: 1001

Package: libnv*
Pin: version 560.35.05-1
Pin-Priority: 1001
EOF
  1. Install NVIDIA drivers and CUDA toolkit:
# Update package lists
apt update

# Install NVIDIA driver and nvidia-smi
apt install -y nvidia-driver nvidia-smi

# Add CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb

# Install CUDA toolkit
apt-get update
apt-get -y install cuda-toolkit-12-8
apt-get install -y cuda-drivers

Verify GPU Access

After installation, verify GPU is accessible:

# Check GPU is visible
nvidia-smi

# Should show your GPU with driver version 560.35.05

Configure botserver for GPU

Update your bot's config.csv to use GPU acceleration:

name,value
llm-server-gpu-layers,35

The number of layers depends on your GPU memory:

  • RTX 3060 (12GB): 20-35 layers
  • RTX 3070 (8GB): 15-25 layers
  • RTX 4070 (12GB): 30-40 layers
  • RTX 4090 (24GB): 50-99 layers

Troubleshooting

GPU Not Detected

If nvidia-smi doesn't show the GPU:

  1. Check host GPU drivers:

    # On host
    nvidia-smi
    lxc config device list gb-system
    
  2. Verify GPU passthrough:

    # Inside container
    ls -la /dev/nvidia*
    
  3. Check kernel modules:

    lsmod | grep nvidia
    

Driver Version Mismatch

If you encounter driver version conflicts:

  1. Ensure host and container use the same driver version
  2. Remove the version pinning file and install matching drivers:
    rm /etc/apt/preferences.d/nvidia-drivers
    apt update
    apt install nvidia-driver-560
    

CUDA Library Issues

If CUDA libraries aren't found:

# Add CUDA to library path
echo '/usr/local/cuda/lib64' >> /etc/ld.so.conf.d/cuda.conf
ldconfig

# Add to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Custom llama.cpp Compilation

If you need custom CPU/GPU optimizations or specific hardware support, compile llama.cpp from source:

Prerequisites

sudo apt update
sudo apt install build-essential cmake git

Compilation Steps

# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Create build directory
mkdir build
cd build

# Configure with CUDA support
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CURL=OFF

# Compile using all available cores
make -j$(nproc)

Compilation Options

For different hardware configurations:

# CPU-only build (no GPU)
cmake .. -DLLAMA_CURL=OFF

# CUDA with specific compute capability
cmake .. -DLLAMA_CUDA=ON -DLLAMA_CUDA_FORCE_COMPUTE=75

# ROCm for AMD GPUs
cmake .. -DLLAMA_HIPBLAS=ON

# Metal for Apple Silicon
cmake .. -DLLAMA_METAL=ON

# AVX2 optimizations for modern CPUs
cmake .. -DLLAMA_AVX2=ON

# F16C for half-precision support
cmake .. -DLLAMA_F16C=ON

After Compilation

# Copy compiled binary to botserver
cp bin/llama-server /path/to/botserver-stack/bin/llm/

# Update config.csv to use custom build
llm-server-path,/path/to/botserver-stack/bin/llm/

Benefits of Custom Compilation

  • Hardware-specific optimizations for your exact CPU/GPU
  • Custom CUDA compute capabilities for newer GPUs
  • AVX/AVX2/AVX512 instructions for faster CPU inference
  • Reduced binary size by excluding unused features
  • Support for experimental features not in releases

Performance Optimization

Memory Settings

For optimal LLM performance with GPU:

name,value
llm-server-gpu-layers,35
llm-server-mlock,true
llm-server-no-mmap,false
llm-server-ctx-size,4096

Multiple GPUs

For systems with multiple GPUs, specify which GPU to use:

# List available GPUs
lxc profile device add gpu gpu0 gpu gputype=physical id=0
lxc profile device add gpu gpu1 gpu gputype=physical id=1

Benefits of GPU Acceleration

With GPU acceleration enabled:

  • 5-10x faster inference compared to CPU
  • Higher context sizes possible (8K-32K tokens)
  • Real-time responses even with large models
  • Lower CPU usage for other tasks
  • Support for larger models (13B, 30B parameters)

Next Steps