Running LLMs Locally with Ollama: A Production Guide

Why Local LLMs

Data privacy — sensitive business data never leaves your server
Cost — zero per-token cost after hardware investment
Latency — sub-100ms on good hardware
Customization — fine-tune on proprietary data

Setup

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Pull model
ollama pull llama3.1:8b
ollama pull mistral:7b

# Serve (exposes REST API on port 11434)
ollama serve

Python Integration

import requests

def chat(prompt, model='llama3.1:8b'):
    response = requests.post(
        'http://localhost:11434/api/chat',
        json={
            'model': model,
            'messages': [{'role': 'user', 'content': prompt}],
            'stream': False
        }
    )
    return response.json()['message']['content']

Model Recommendations

Use Case	Model	VRAM
Fast responses	Mistral 7B	8GB
Reasoning	Llama 3.1 8B	8GB
Complex tasks	Llama 3.1 70B	48GB
Code	Qwen2.5-Coder 7B	8GB