Supercharge Your Local AI

Installing Ollama with GPU Support on macOS and Windows

Running large language models locally is no longer a science project—Ollama makes it practical. And if your machine has a GPU, you’ll unlock the kind of performance that rivals the cloud.

In this guide, we’ll show you how to install Ollama on both macOS and Windows with GPU acceleration. You’ll learn how to:

  • Verify your GPU is being used
  • Run the Qwen model (a powerful general-purpose LLM)
  • Use curl to test your local model
  • Add a beautiful front-end with Open WebUI

Understanding how to deploy and interact with local models is a crucial DevOps skill, especially as teams seek more control, lower latency, and reduced dependency on cloud APIs.

While I'm a huge proponent of using Docker for everything, it doesn't support GPUs on a Mac; therefore, you need to install it natively.

🧰 Prerequisites: Why These Matter

To run LLMs effectively on your machine, you need:

  • A macOS or Windows 11 machine
  • A GPU (Apple Silicon or NVIDIA for Windows)
  • Enough disk space (models can be several GB)
  • Internet access for installation (but no need to stay online afterward)

This setup enables you to develop and test AI workflows without paying per token. It’s fast, private, and ideal for edge environments or development sandboxes.

🛠️ Installing Ollama With GPU Acceleration

macOS (Metal-backed GPU acceleration)

On macOS, Ollama uses Apple’s Metal API to take advantage of the GPU. This happens automatically; no drivers or extra configuration needed.

Install Ollama:

brew install ollama
Why? Installing via Homebrew ensures updates are managed cleanly and the install is system-wide.

Start the Ollama service:

ollama serve
Ollama runs as a local server on port 11434. This is your model's API—you can query it with curl, WebUI, or your own apps.

Test it out with curl

curl http://localhost:11434

Will return:

Ollama is running

Windows 11 (CUDA via WSL2)

On Windows, the best way to access your GPU is through WSL2 with CUDA support.

Install WSL and Ubuntu:

wsl --install -d Ubuntu

Install Ollama inside WSL:

curl -fsSL https://ollama.com/install.sh | sh
WSL2 lets you run a full Linux environment inside Windows—vital because Ollama's GPU acceleration depends on Linux drivers.

Install NVIDIA drivers for WSL: Official CUDA guide

These drivers enable GPU passthrough into WSL so Ollama can use CUDA.

After installation, test:

nvidia-smi
You should see your GPU listed. If not, you’ll need to update drivers or check BIOS virtualization settings.

Run the Ollama server:

ollama serve
Just like macOS, this spins up the Ollama API at http://localhost:11434.

Test it out with curl

curl http://localhost:11434

Will return:

Ollama is running

📥 Install the Qwen Model

Qwen is a strong general-purpose LLM from Alibaba, with competitive accuracy and a relatively compact footprint. Great for local experimentation.

ollama pull qwen
This downloads the model layers into your local cache. Once installed, it's ready to run—no internet required. Ideal for offline use or isolated environments.

Confirm GPU usage:

ollama run qwen

Look for on a Mac:

GPU name:   Apple M1

Or on windows:

Using GPU backend: CUDA
Why this matters: If you don’t see this, Ollama might be falling back to CPU, check your CUDA setup. Many models will work on CPU—but slowly. GPU acceleration is the key to real-time performance.

🧪 Test With curl

Make sure this works:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen",
  "prompt": "Explain containers like I’m five."
}'
This command sends a simple JSON payload to your local Ollama server. The response will stream back token-by-token.

Use this to automate testing, hook up scripts, or build your own front-end. You now own your entire inference pipeline.

📊 How to Tell It’s Using the GPU

macOS:

  • Open Activity Monitor → GPU tab
  • You’ll see ollama or serve using GPU resources

Windows/WSL:

  • Run nvidia-smi while Ollama is generating
  • GPU usage (VRAM and compute) will increase

Why this matters: many models will work on CPU—but slowly. GPU acceleration is the key to real-time performance.

💬 Adding Open WebUI for a Full Chat Interface

Command-line is powerful—but not always the most comfortable interface. Open WebUI is a lightweight, self-hosted front-end that provides a ChatGPT-style UI, running entirely on your machine.

This is great for:

  • Less technical users
  • Prototyping prompts
  • Demos or internal tools

Install Docker

macOS Docker install
Windows Docker install

Start WebUI with Ollama support

I'll use docker compose

Create a docker-compose.yaml file

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "8080:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
    restart: unless-stopped

volumes:
  open-webui-data:
    driver: local
host.docker.internal bridges Docker and your host system. It ensures the container can talk to the Ollama server running outside Docker.
On WSL, you may need to use localhost or find the WSL IP (hostname -I) if Docker networking isn’t bridged.

Start it up using docker compose up -d and see the logs with docker compose logs -f

Open http://localhost:8080 in your browser.

You’ve now created your own self-hosted ChatGPT that uses your hardware and keeps your data private.

⚠️ Troubleshooting

If WebUI can’t connect to Ollama:

  • Check the OLLAMA_BASE_URL

🧠 Final Thoughts: Why This Setup Is a DevOps Superpower

Running LLMs locally might sound like a hobbyist exercise—but it’s quickly becoming a strategic DevOps capability:

  • Avoid vendor lock-in
  • Low-latency inference
  • Data privacy
  • Rapid iteration

Ollama and Open WebUI make this all accessible with minimal friction.

📅 What’s Next

In future Everyday DevOps posts, we’ll dive into:

  • Fine-tuning Qwen with your own documents
  • Running Ollama + WebUI inside Kubernetes
  • Comparing LLaMA 3, Gemma, and Qwen on real tasks
  • Automating local inference with GitHub Actions