Back to Blog
AI Agents April 18, 2025 7 min read

Running LLMs Locally with Ollama: A Production Guide

Setting up Ollama for production use — model selection, API integration, performance tuning, and running Llama 3.1 on-premise for data privacy.

Why Local LLMs

  1. Data privacy — sensitive business data never leaves your server
  2. Cost — zero per-token cost after hardware investment
  3. Latency — sub-100ms on good hardware
  4. Customization — fine-tune on proprietary data

Setup

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Pull model
ollama pull llama3.1:8b
ollama pull mistral:7b

# Serve (exposes REST API on port 11434)
ollama serve

Python Integration

import requests

def chat(prompt, model='llama3.1:8b'):
    response = requests.post(
        'http://localhost:11434/api/chat',
        json={
            'model': model,
            'messages': [{'role': 'user', 'content': prompt}],
            'stream': False
        }
    )
    return response.json()['message']['content']

Model Recommendations

Use CaseModelVRAM
Fast responsesMistral 7B8GB
ReasoningLlama 3.1 8B8GB
Complex tasksLlama 3.1 70B48GB
CodeQwen2.5-Coder 7B8GB
OllamaLLMLocal AILlamaPrivacy
O

Ossama Elhakki

AI Engineer & ML Systems Builder — Morocco