A year ago, running a capable AI model on your own computer required expensive hardware and significant technical expertise. That's changed dramatically. With the right tools, you can now run frontier-class models locally on a modern laptop or desktop — for free, with no API calls, and complete privacy.
Why Run AI Locally?
Before the how, the why:
- Privacy: Your data never leaves your machine. Ideal for sensitive documents, personal notes, or proprietary code
- Cost: No per-token API fees. Run as many queries as you want for free
- Speed: For short contexts, local models can respond faster than API calls with no network latency
- Offline access: Works without an internet connection
- Experimentation: Fine-tune, modify, and test models without API restrictions
The tradeoff: local models are generally smaller and less capable than frontier API models. But the gap has narrowed significantly, and for many tasks, a well-quantized 7B or 14B model is genuinely excellent.
What Hardware Do You Need?
You need a machine with a GPU or a recent Apple Silicon Mac:
- Apple Silicon (M2 and later): Excellent performance thanks to unified memory. M3 Pro or M3 Max can run 70B models well
- Windows/Linux with NVIDIA GPU: 8GB VRAM handles 7B models; 16GB handles 13-14B; 24GB handles up to 34B
- CPU only: Works for small models (3B and under) but is slow. Not recommended for regular use
The Easiest Path: Ollama
Ollama is the simplest way to run models locally. It handles downloading, managing, and running open-source models with a single command.
Install Ollama: Download from ollama.com for Mac, Windows, or Linux. Installation is a standard package installer — no command line needed.
Run your first model:
ollama run llama3.2
This downloads the Llama 3.2 3B model (about 2GB) and starts a chat session immediately. The first run takes a few minutes to download; after that it starts in seconds.
Other models worth trying:
ollama run mistral— Strong general-purpose 7B modelollama run qwen2.5-coder— Excellent for codeollama run llama3.1:8b— Meta's capable 8B modelollama run gemma3:27b— Google's 27B model (needs 16GB+ VRAM or Apple Silicon)
Using a Chat Interface
The command line is fine for testing, but for regular use you'll want a proper chat interface. Two good options:
Open WebUI: The most polished local AI interface. Install it via Docker or pip, and you get a ChatGPT-like interface that connects to your local Ollama models. Supports conversation history, file uploads, and multiple models.
LM Studio: A desktop app (Mac/Windows) that lets you download models from Hugging Face and run them with a nice GUI. Great if you prefer not to use the command line at all.
Connecting to Your Code and Apps
Ollama exposes a local API endpoint (http://localhost:11434) compatible with the OpenAI API format. This means you can use it as a drop-in replacement for OpenAI in your Python scripts:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
Choosing the Right Model Size
- 3B models: Fast on any hardware, good for simple tasks and quick experiments
- 7-8B models: The sweet spot for most local use — capable and fast on modest hardware
- 14-27B models: Noticeably better quality, need more VRAM or Apple Silicon
- 70B models: Near-frontier quality, need an M3 Max/M4 Pro or 48GB+ VRAM
Start with a 7B or 8B model and only go larger if you find it insufficient for your use case.