A detailed guide focusing on the Brazilian scenario, crossing high-performance AI models with hardware available in the local market (Mercado Livre, OLX, etc.).
> **Important Note:** Proprietary models like **GPT-5**, **Claude 3.5 Sonnet**, and **Gemini 1.5 Pro** are Closed Source and **do not run locally**. They require a cloud API. This guide assumes open-source equivalents that attempt to reach this performance level, such as **Llama-3.1-405B**, **DeepSeek-V3**, or **Qwen-2.5**, running on local hardware.
## AI Model Scaling for Local Hardware
Mapping mentioned top-tier models to their local "runnable" equivalents.
| Citation Model | Real Status | Local Equivalent (GPU) | Size (Params) |
| :--- | :--- | :--- | :--- |
| **Claude 3.5 Sonnet** | API Only | Llama-3.1-70B / Mistral-Large | ~70B |
| **Claude Opus** | API Only | Llama-3.1-405B (Ref.) | ~405B (Hard for Consumers) |
| **GPT-4o** | API Only | DeepSeek-V2-Lite / Qwen-2.5-72B | ~16B to 72B |
To achieve performance similar to **GLM 4** or **DeepSeek** locally, consider these factors:
### 1. The "Secret Sauce": Quantization (Q4 vs Q8)
***For GLM 4 (9B):** Any modern card runs it in **Q8_0** (8-bit). Intelligence is identical to original. Flies on an RTX 3060.
***For DeepSeek (16B - 23B):** You need **Q4_K_M** to fit in 12GB/16GB VRAM. You lose about 1-2% "intelligence" but gain 4x speed and fitment.
***For Llama-3-70B:**
* 12GB cards (3060) are **useless** for this locally (requires CPU offloading, very slow).
* 24GB cards (3090/4090) run it in **Q3_K_M** or **Q4_K_M** (tight). This reaches GPT-4 class intelligence.
### 2. The Brazilian Scenario
In Brazil, the **RTX 3060 12GB** and **RTX 3090 24GB** are the most critical cards for AI.
***Why not 4060 Ti 16GB?** It costs almost double a used 3060. For budget setups ("custo-benefício"), the used 3060 12GB at ~R$ 1.200 is unbeatable.
***Why the 3090?** To run 70B models, you *need* 24GB. The 4090 is faster, but a used 3090 at R$ 4.000 does the same AI job for 1/3 the price.
### 3. DeepSeek & MoE (Mixture of Experts) in General Bots
**DeepSeek-V2/V3** uses an architecture called **MoE (Mixture of Experts)**. This is highly efficient but requires specific support.
**General Bots Offline Component (llama.cpp):**
The General Bots local LLM component is built on `llama.cpp`, which fully supports MoE models like DeepSeek and Mixtral efficiently.
***MoE Efficiency:** Only a fraction of parameters are active for each token generation. DeepSeek-V2 might have 236B parameters total, but only uses ~21B per token.
***Running DeepSeek:**
* On an **RTX 3060**, you can run **DeepSeek-V2-Lite (16B)** exceptionally well.
* It offers performance rivaling much larger dense models.
***Configuration:** Simply select the model in your `local-llm` setup. The internal `llama.cpp` engine handles the MoE routing automatically. No special Flags (`-moe`) are strictly required in recent versions, but ensuring you have the latest `botserver` update guarantees the `llama.cpp` binary supports these architectures.
### 4. Recommended Configurations by Budget
**Entry Level (Up to R$ 2.500 Total):**
***GPU:** RTX 3060 12GB (Used ~R$ 1.300)
***RAM:** 32 GB DDR4 (~R$ 300)
***Runs:** GLM 4 (Perfect), DeepSeek Lite, Llama-3-8B. Sufficient for 90% of daily tasks and coding.
**Prosumer (R$ 5.000 - R$ 7.000 Total):**
***GPU:** RTX 3090 24GB (Used ~R$ 4.000)
***RAM:** 64 GB DDR4 (~R$ 700)
***Runs:** All above + Llama-3-70B, Command R+, Mixtral 8x7B. True offline GPT-4 class assistant.
**Enterprise Domestic (R$ 15.000+):**
***GPU:** 2x RTX 3090 or 1x RTX 4090
***Runs:** 100B+ models, high context windows (128k), massive parallel batches.