The `llm-model` parameter accepts relative paths like `../../../../data/llm/model.gguf`, absolute paths like `/opt/models/model.gguf`, or model names when using external APIs like `gpt-5`.
BotServer supports GGUF quantized models for CPU and GPU inference. Quantization levels include Q3_K_M, Q4_K_M, and Q5_K_M for reduced memory usage with acceptable quality trade-offs, while F16 and F32 provide full precision for maximum quality.
## LLM Server Configuration
### Running Embedded Server
BotServer can run its own LLM server for local inference:
```csv
llm-server,true
llm-server-path,botserver-stack/bin/llm/build/bin
llm-server-host,0.0.0.0
llm-server-port,8081
```
### Server Performance Parameters
Fine-tune server performance based on your hardware capabilities:
```csv
llm-server-gpu-layers,0
llm-server-ctx-size,4096
llm-server-n-predict,1024
llm-server-parallel,6
llm-server-cont-batching,true
```
| Parameter | Description | Impact |
|-----------|-------------|--------|
| `llm-server-gpu-layers` | Layers to offload to GPU | 0 = CPU only, higher = more GPU |
| `llm-server-ctx-size` | Context window size | More context = more memory |
| `llm-server-n-predict` | Max tokens to generate | Limits response length |
Memory settings control how the model interacts with system RAM:
```csv
llm-server-mlock,false
llm-server-no-mmap,false
```
The `mlock` option locks the model in RAM to prevent swapping, which improves performance but requires sufficient memory. The `no-mmap` option disables memory mapping and loads the entire model into RAM, using more memory but potentially improving access patterns.
## Cache Configuration
### Basic Cache Settings
Caching reduces repeated LLM calls for identical inputs, significantly improving response times and reducing API costs:
```csv
llm-cache,false
llm-cache-ttl,3600
```
### Semantic Cache
Semantic caching matches similar queries, not just identical ones, providing cache hits even when users phrase questions differently:
```csv
llm-cache-semantic,true
llm-cache-threshold,0.95
```
The threshold parameter controls how similar queries must be to trigger a cache hit. A value of 0.95 requires 95% similarity. Lower thresholds produce more cache hits but may return less accurate cached responses.
## External API Configuration
### Groq and OpenAI-Compatible APIs
For cloud inference, Groq offers the fastest performance among major providers:
```csv
llm-key,gsk-your-groq-api-key
llm-url,https://api.groq.com/openai/v1
llm-model,mixtral-8x7b-32768
```
### Local API Servers
When running your own inference server or using another local service:
```csv
llm-key,none
llm-url,http://localhost:8081
llm-model,local-model-name
```
## Configuration Examples
### Minimal Local Setup
The simplest configuration for getting started with local models:
```csv
name,value
llm-url,http://localhost:8081
llm-model,../../../../data/llm/model.gguf
```
### High-Performance Local
Optimized for maximum throughput on capable hardware:
When response speed is the priority, decrease `llm-server-ctx-size` and `llm-server-n-predict` to reduce processing time. Enable both `llm-cache` and `llm-cache-semantic` to serve repeated queries instantly.
### For Quality
When output quality matters most, increase `llm-server-ctx-size` and `llm-server-n-predict` to give the model more context and generation headroom. Use higher quantization models like Q5_K_M or F16 for better accuracy. Either disable semantic cache entirely or raise the threshold to avoid returning imprecise cached responses.
### For Multiple Users
Supporting concurrent users requires enabling `llm-server-cont-batching` and increasing `llm-server-parallel` to handle multiple requests simultaneously. Enable caching to reduce redundant inference calls. If available, GPU offloading significantly improves throughput under load.
Small models like DeepSeek-R3-Distill-Qwen-1.5B deliver fast responses with low memory usage. They work well for simple tasks, quick interactions, and resource-constrained environments.
Medium-sized models such as Llama-2-7B and Mistral-7B provide balanced performance suitable for general-purpose applications. They require moderate memory but handle a wide range of tasks competently.
### Large Models (30B+ parameters)
Large models like Llama-2-70B and Mixtral-8x7B offer the best quality for complex reasoning tasks. They require substantial memory and compute resources but excel at nuanced understanding and generation.
## Troubleshooting
### Model Won't Load
If the model fails to load, first verify the file path exists and is accessible. Check that your system has sufficient RAM for the model size. Ensure the GGUF file version is compatible with your llama.cpp build.
### Slow Responses
Slow generation typically indicates resource constraints. Reduce context size, enable caching to avoid redundant inference, use GPU offloading if hardware permits, or switch to a smaller quantized model.
### Out of Memory
Memory errors require reducing resource consumption. Lower `llm-server-ctx-size` and `llm-server-parallel` values. Switch to more aggressively quantized models (Q3 instead of Q5). Disable `llm-server-mlock` to allow the OS to manage memory more flexibly.
### Connection Refused
Connection errors usually indicate server configuration issues. Verify `llm-server` is set to true if expecting BotServer to run the server. Check that the configured port is not already in use by another process. Ensure firewall rules allow connections on the specified port.
## Best Practices
Start with smaller models and scale up only as needed, since larger models consume more resources without always providing proportionally better results. Enable caching for any production deployment to reduce costs and improve response times. Monitor RAM usage during operation to catch memory pressure before it causes problems. Test model responses thoroughly before deploying to production to ensure quality meets requirements. Document which models you're using and their performance characteristics. Track changes to your `config.csv` in version control to maintain a history of configuration adjustments.