Small + Local LLMs — Trained on your data, beats the cloud

Your data is the unfair advantage.

Frontier cloud LLMs are trained on the public internet. Your team's edge is the data not on the public internet — the contracts, the codebase, the customer transcripts, the proprietary research. Put that data into a small open model and it beats a generic giant on your specific task. Every time.

The math

Smaller + specialized > bigger + generic

An 8B Llama 3, Qwen 3, or Nemotron Nano fine-tuned (via Unsloth) on 50K of your own examples will out-classify, out-route, and out-summarize a frontier 700B model at your specific task. The model size that wins is the one that has seen your data.

The cost

10–100× cheaper at scale

50M tokens/day on GPT-4 or Claude Opus = a high five-figure monthly invoice. The same workload on Kimi K2, DeepSeek V3, or Nemotron Omni served locally runs on electricity. A single RTX 5090 or Mac Studio pays itself off in weeks.

The IP

You own the weights

Train a LoRA on your domain corpus once and the capability is yours. No deprecation emails, no surprise pricing, no model provider learning your business. Re-base onto a better foundation when one ships — without re-disclosing your data.

Why teams pull LLMs in-house.

Frontier APIs are great for prototyping. They're not always the right home for production. For regulated, proprietary, or high-volume workloads, a local stack pays for itself.

shield_lock

01 · Sovereignty

Your prompts never leave your perimeter.

Internal codebases, customer PII, deal documents, patient notes, financial models — sending those through a third-party API is a compliance conversation you don't want to have. Local inference makes the conversation moot.

payments

02 · Predictable cost

Fixed hardware, infinite tokens.

An RTX-class workstation runs millions of tokens a day at electricity cost. The same volume on a per-token API is a five-figure monthly invoice that scales linearly with usage. Local flips the curve.

tune

03 · Real customization

Fine-tune on private data.

LoRA adapters and full fine-tunes on internal corpora — code, contracts, support tickets, lab notes — without exposing the data to a model provider. You own the weights, the deltas, and the eval set.

link_off

04 · No vendor lock-in

Open weights, swappable runtimes.

The model you pick today won't be the best one in six months — and that's fine. Open-weight stacks make swapping a config change, not a rewrite. No API deprecations, no surprise pricing emails, no rate-limit pages.

Models, runtimes, hardware.

We pick by workload, not by brand. Here's what we reach for when a team needs LLM capability behind their own firewall.

Models

Open-weight, production-ready

Llama 3.x — Meta's flagship; strong general reasoning, friendly license.
Gemma 3 — Google's small-model line; tight on quality-per-parameter.
Qwen 2.5 / 3 — Alibaba; multilingual SOTA, strong coding variants.
Nemotron Nano / Super / Omni — NVIDIA's reasoning + multimodal SLMs; our default for agents.
Kimi K2 — Moonshot's long-context flagship; 1M-token windows on-prem.
DeepSeek R1 / V3 — frontier-class reasoning + coding, on-prem capable.
Mistral / Mixtral — sharp instruction-following; the MoE for throughput.

Runtimes

From single-laptop to multi-GPU

Ollama — one-line model serving; the path of least resistance for any team.
LM Studio — desktop UI for non-engineers; great for legal, policy, and analyst workflows.
llama.cpp — quantized inference on anything, including CPU + Metal.
vLLM — throughput-optimized batched inference for shared workstations.
Unsloth — 2× faster fine-tuning, half the VRAM; how we train on your data without renting H100s.
MLX — Apple Silicon-native; what we use on M-series Macs.
LiteLLM — OpenAI-compatible router; flip between local + cloud per call.

Hardware

Where the model lives

NVIDIA GB10 — desktop-class, 128 GB unified; runs 70B models comfortably.
RTX 4090 / 5090 workstation — single-user dev box; 70B quantized, fast.
Mac Studio M3 Ultra — silent, 192 GB unified; team-shared LLM gateway.
On-prem A100 / H100 — full-precision training + serving for the regulated stack.
Multi-GPU homelab (3090s / 4090s) — cost-effective vLLM cluster behind a VPN.
Jetson AGX Orin — when local also needs to mean low-power and edge.

What teams actually do with a local stack.

Not theoretical. These are the four shapes we get asked to build, over and over.

01 · Coding assistants

Continue / Cursor / Copilot — but local

Codebase-aware completions and chat without sending source through a vendor. Continue + Ollama on every developer's laptop. Same workflow, zero IP exposure.

02 · Private RAG

Ask your contracts, lab notes, tickets

Vector store + local embed model + local LLM = the whole pipeline behind your firewall. We've shipped this for legal, healthcare, and ML teams who can't put their corpus on someone else's servers.

03 · Agentic workflows

Tool-calling agents on internal APIs

Nemotron or Llama 3 driving tool calls into your own SaaS, databases, and scripts. LiteLLM in front so the same agent code works against local or cloud — useful for development, then flip the switch.

04 · Fine-tuning on proprietary data

LoRA + your corpus, on your hardware

Model adapters trained on internal data — support tickets, code review history, design specs. You keep the weights, the eval, and the option to retrain when the base model improves.

Local stacks running today.

Real systems behind real firewalls — not slides.

Civic Intelligence

Undervolt

Nemotron Nano 8B running locally via Ollama. Structures 2.2M Austin building permits without paying a per-token bill. NVIDIA DGX AITX 1st place winner.

Nemotron Nano 8BOllamaNo cloud LLM

Read case study arrow_forward

Biotech

HD Research Hub

Ollama-backed agents triaging Huntington's Disease literature. Runs on a developer's Mac or a homelab — same code, no API key, no data egress. Open source.

OllamaLocal agentsPubMedOpen source

Launch app open_in_new

Internal AISOFT

Our own dogfood stack

Mac Studio M3 Ultra serving Llama 3, Gemma 3, Qwen 2.5 over LiteLLM to every dev box on Tailscale. Same OpenAI-shaped API as cloud, zero per-token cost. Where new models get smoke-tested before they hit a client.

Mac StudioLiteLLMTailscale

Pattern

Hybrid by design

Sensitive prompts route local; everything else can hit a frontier API for raw quality. LiteLLM as the router, policy lives in config. Clients get cloud-class quality where it's safe and on-prem privacy where it matters.

LiteLLM routerPolicy-as-codeHybrid

From "should we go local" to "it's running."

01 · Audit + recommend

Workload, model, hardware

Two-week engagement. We benchmark your real prompts against open models on candidate hardware, and write up a sized recommendation with cost-per-million-tokens math.

2 week engagement

02 · Stand it up

Hardware to OpenAI-shaped API

Procurement, racking (or shipping a Mac Studio), Ollama / vLLM / llama.cpp, LiteLLM in front, Tailscale or VPC networking, monitoring. Your team gets a single endpoint that looks like OpenAI.

4–6 week engagement

03 · Wire it in

Coding, RAG, agents, fine-tunes

Continue + Cursor configured against your endpoint. Private RAG over your corpus. Agentic flows on your internal APIs. LoRA adapters when fine-tuning is the right answer. Working features, not pilots.

8–12 week engagement

Ready to bring the LLM in-house?

Free 30-minute call. We'll tell you whether local is the right call for your workload — honestly.

Book a 30-min call calendar_month

Small + local LLMs. Trained on your data.