Local inference for SMB: Gemma 4 on a workstation

When running a model on a box under the desk beats paying per token, and when it doesn't.

Cutting Edge
advanced
Mar 18, 2026
8 min read
By Alex Colon

The case for local

steady workloads with predictable input sizes, data that can't leave the building, and tasks where a mid-tier open-source model is good enough. Classification, tagging, extraction.

The hardware floor

a single workstation with enough VRAM to hold the model plus headroom. Gemma 4, Nemotron 3 Super, or a GPT-OSS variant. Ollama or llama.cpp as the runtime.

Where local still loses

frontier reasoning, long-context research, anything that needs Opus-grade quality. Cost per token on open-source inference providers is often lower than owning hardware at small scale anyway.

Want this applied to your business?

Thirty-minute working call. We map the idea to your situation and leave you with a prioritized next step.

Book a Strategy Call Send a note