Skip to content
Home
Services
Work
Resources
AboutContactBook a Strategy Call
Cutting Edge

Local inference for SMB: Gemma 4 on a workstation

When running a model on a box under the desk beats paying per token, and when it doesn't.

  • Cutting Edge
  • advanced
  • Mar 18, 2026
  • 8 min read
  • By Alex Colon

The case for local

steady workloads with predictable input sizes, data that can't leave the building, and tasks where a mid-tier open-source model is good enough. Classification, tagging, extraction.

The hardware floor

a single workstation with enough VRAM to hold the model plus headroom. Gemma 4, Nemotron 3 Super, or a GPT-OSS variant. Ollama or llama.cpp as the runtime.

Where local still loses

frontier reasoning, long-context research, anything that needs Opus-grade quality. Cost per token on open-source inference providers is often lower than owning hardware at small scale anyway.

Want this applied to your business?

Thirty-minute working call. We map the idea to your situation and leave you with a prioritized next step.