Small + local LLMs. Trained on your data.
A 3B–8B open-weight model fine-tuned on your proprietary corpus — code, contracts, support tickets, lab notes, transcripts — beats a frontier cloud API on your specific task. At a fraction of the cost, behind your firewall, with weights you own.
Your data is the unfair advantage.
Frontier cloud LLMs are trained on the public internet. Your team's edge is the data not on the public internet — the contracts, the codebase, the customer transcripts, the proprietary research. Put that data into a small open model and it beats a generic giant on your specific task. Every time.
Smaller + specialized > bigger + generic
An 8B Llama 3, Qwen 3, or Nemotron Nano fine-tuned (via Unsloth) on 50K of your own examples will out-classify, out-route, and out-summarize a frontier 700B model at your specific task. The model size that wins is the one that has seen your data.
10–100× cheaper at scale
50M tokens/day on GPT-4 or Claude Opus = a high five-figure monthly invoice. The same workload on Kimi K2, DeepSeek V3, or Nemotron Omni served locally runs on electricity. A single RTX 5090 or Mac Studio pays itself off in weeks.
You own the weights
Train a LoRA on your domain corpus once and the capability is yours. No deprecation emails, no surprise pricing, no model provider learning your business. Re-base onto a better foundation when one ships — without re-disclosing your data.
Why teams pull LLMs in-house.
Frontier APIs are great for prototyping. They're not always the right home for production. For regulated, proprietary, or high-volume workloads, a local stack pays for itself.
Your prompts never leave your perimeter.
Internal codebases, customer PII, deal documents, patient notes, financial models — sending those through a third-party API is a compliance conversation you don't want to have. Local inference makes the conversation moot.
Fixed hardware, infinite tokens.
An RTX-class workstation runs millions of tokens a day at electricity cost. The same volume on a per-token API is a five-figure monthly invoice that scales linearly with usage. Local flips the curve.
Fine-tune on private data.
LoRA adapters and full fine-tunes on internal corpora — code, contracts, support tickets, lab notes — without exposing the data to a model provider. You own the weights, the deltas, and the eval set.
Open weights, swappable runtimes.
The model you pick today won't be the best one in six months — and that's fine. Open-weight stacks make swapping a config change, not a rewrite. No API deprecations, no surprise pricing emails, no rate-limit pages.
Models, runtimes, hardware.
We pick by workload, not by brand. Here's what we reach for when a team needs LLM capability behind their own firewall.
Open-weight, production-ready
- Llama 3.x — Meta's flagship; strong general reasoning, friendly license.
- Gemma 3 — Google's small-model line; tight on quality-per-parameter.
- Qwen 2.5 / 3 — Alibaba; multilingual SOTA, strong coding variants.
- Nemotron Nano / Super / Omni — NVIDIA's reasoning + multimodal SLMs; our default for agents.
- Kimi K2 — Moonshot's long-context flagship; 1M-token windows on-prem.
- DeepSeek R1 / V3 — frontier-class reasoning + coding, on-prem capable.
- Mistral / Mixtral — sharp instruction-following; the MoE for throughput.
From single-laptop to multi-GPU
- Ollama — one-line model serving; the path of least resistance for any team.
- LM Studio — desktop UI for non-engineers; great for legal, policy, and analyst workflows.
- llama.cpp — quantized inference on anything, including CPU + Metal.
- vLLM — throughput-optimized batched inference for shared workstations.
- Unsloth — 2× faster fine-tuning, half the VRAM; how we train on your data without renting H100s.
- MLX — Apple Silicon-native; what we use on M-series Macs.
- LiteLLM — OpenAI-compatible router; flip between local + cloud per call.
Where the model lives
- NVIDIA GB10 — desktop-class, 128 GB unified; runs 70B models comfortably.
- RTX 4090 / 5090 workstation — single-user dev box; 70B quantized, fast.
- Mac Studio M3 Ultra — silent, 192 GB unified; team-shared LLM gateway.
- On-prem A100 / H100 — full-precision training + serving for the regulated stack.
- Multi-GPU homelab (3090s / 4090s) — cost-effective vLLM cluster behind a VPN.
- Jetson AGX Orin — when local also needs to mean low-power and edge.
What teams actually do with a local stack.
Not theoretical. These are the four shapes we get asked to build, over and over.
Continue / Cursor / Copilot — but local
Codebase-aware completions and chat without sending source through a vendor. Continue + Ollama on every developer's laptop. Same workflow, zero IP exposure.
Ask your contracts, lab notes, tickets
Vector store + local embed model + local LLM = the whole pipeline behind your firewall. We've shipped this for legal, healthcare, and ML teams who can't put their corpus on someone else's servers.
Tool-calling agents on internal APIs
Nemotron or Llama 3 driving tool calls into your own SaaS, databases, and scripts. LiteLLM in front so the same agent code works against local or cloud — useful for development, then flip the switch.
LoRA + your corpus, on your hardware
Model adapters trained on internal data — support tickets, code review history, design specs. You keep the weights, the eval, and the option to retrain when the base model improves.
Local stacks running today.
Real systems behind real firewalls — not slides.
Undervolt
Nemotron Nano 8B running locally via Ollama. Structures 2.2M Austin building permits without paying a per-token bill. NVIDIA DGX AITX 1st place winner.
HD Research Hub
Ollama-backed agents triaging Huntington's Disease literature. Runs on a developer's Mac or a homelab — same code, no API key, no data egress. Open source.
Our own dogfood stack
Mac Studio M3 Ultra serving Llama 3, Gemma 3, Qwen 2.5 over LiteLLM to every dev box on Tailscale. Same OpenAI-shaped API as cloud, zero per-token cost. Where new models get smoke-tested before they hit a client.
Hybrid by design
Sensitive prompts route local; everything else can hit a frontier API for raw quality. LiteLLM as the router, policy lives in config. Clients get cloud-class quality where it's safe and on-prem privacy where it matters.
From "should we go local" to "it's running."
Workload, model, hardware
Two-week engagement. We benchmark your real prompts against open models on candidate hardware, and write up a sized recommendation with cost-per-million-tokens math.
Hardware to OpenAI-shaped API
Procurement, racking (or shipping a Mac Studio), Ollama / vLLM / llama.cpp, LiteLLM in front, Tailscale or VPC networking, monitoring. Your team gets a single endpoint that looks like OpenAI.
Coding, RAG, agents, fine-tunes
Continue + Cursor configured against your endpoint. Private RAG over your corpus. Agentic flows on your internal APIs. LoRA adapters when fine-tuning is the right answer. Working features, not pilots.
Ready to bring the LLM in-house?
Free 30-minute call. We'll tell you whether local is the right call for your workload — honestly.