Why Local LLMs
- Data privacy — sensitive business data never leaves your server
- Cost — zero per-token cost after hardware investment
- Latency — sub-100ms on good hardware
- Customization — fine-tune on proprietary data
Setup
# Install
curl -fsSL https://ollama.ai/install.sh | sh
# Pull model
ollama pull llama3.1:8b
ollama pull mistral:7b
# Serve (exposes REST API on port 11434)
ollama serve
Python Integration
import requests
def chat(prompt, model='llama3.1:8b'):
response = requests.post(
'http://localhost:11434/api/chat',
json={
'model': model,
'messages': [{'role': 'user', 'content': prompt}],
'stream': False
}
)
return response.json()['message']['content']
Model Recommendations
| Use Case | Model | VRAM |
|---|
| Fast responses | Mistral 7B | 8GB |
| Reasoning | Llama 3.1 8B | 8GB |
| Complex tasks | Llama 3.1 70B | 48GB |
| Code | Qwen2.5-Coder 7B | 8GB |