Edge AI — Small models, production-grade

Four reasons the cloud bill isn't inevitable.

Frontier models are great. They're not always the right tool. For most real workloads, a well-chosen 3B–8B model on the right hardware beats a cloud API on every axis that matters to a production system.

lock

01 · Privacy

Data never leaves the device.

Healthcare records, student homework, courtroom video, call center transcripts — most real workloads can't move to a third-party cloud without lawyers, DPAs, and data-residency gymnastics. Running the model locally sidesteps all of it.

savings

02 · Cost

No per-token meter. Ever.

A Jetson AGX Orin runs at 60 watts. At cloud-API pricing, the same workload costs more every month than the hardware cost up front. Amortized over a year, it's not close.

bolt

03 · Latency

No round trip.

Real-time video analysis, voice tutoring, robotics — anything interactive — dies on a 400ms network round trip. Local inference runs in the same process as the sensor, in milliseconds.

offline_bolt

04 · Resilience

Works when the internet doesn't.

A tennis court in a dead zone. A field site with a flaky 4G uplink. A factory floor behind an air gap. If your product only works with cloud connectivity, your market is smaller than you think.

Hardware + model + inference runtime.

We don't pick by brand. We pick by thermal envelope, memory bandwidth, and the specific workload. Here's the stack we reach for most.

Hardware

Where the model runs

Jetson AGX Orin — 60W, 275 TOPS, 64 GB unified. Our default for edge inference.
NVIDIA GB10 — desktop-class for latency-critical workloads; we used it for CoachClaw.
Mac M-series — developer machines + small-team deployments; Ollama-native.
Jetson Nano / Orin Nano — sub-20W for truly embedded use.
x86 with NVIDIA RTX — when you need Windows compatibility + 24 GB VRAM.

Models

What we reach for

Nemotron Nano 8B — NVIDIA's reasoning SLM; Undervolt runs on this.
Cosmos Reason 2 — physical reasoning VLM; RefereAI runs on this.
Qwen 2.5 VL — strong open VLM; Sideline defaults here.
Gemma 3n — runs in-browser via WebGPU; real option for client-side AI.
OpenCLIP — embedding model for Studio Copilot's photo search.
llama.cpp-quantized SLMs — when you need 3B params on a 12 GB box.

Runtime

How it stays fast

DeepStream — NVIDIA's video inference pipeline; 672 FPS on Jetson in our testing.
RAPIDS — GPU-accelerated dataframes; how Undervolt processes 2.2M permits.
Ollama — one-line local model serving; HD Research agents.
llama.cpp — CPU/Metal quantized inference; CoachClaw on GB10.
vLLM — throughput-optimized batch inference when we need it.
WebGPU — browser-native inference, no server.

Four live systems. Zero cloud inference bills.

Every flagship product below runs its AI workload on-device. These aren't slides — they're running with real users right now.

Civic Intelligence

Undervolt

Nemotron Nano 8B on Jetson AGX Orin, 60W, structuring 2.2M Austin building permits. RAPIDS for GPU-accelerated joins. 1st place NVIDIA DGX AITX.

Nemotron Nano 8BJetson AGX Orin60WRAPIDS

Read case study arrow_forward

Sports AI

RefereAI

Cosmos Reason 2 VLM on Jetson at the court edge. DeepStream 7.1 pipeline at 672 FPS. Chain-of-thought line-call reasoning without cloud. Video never leaves the site.

Cosmos Reason 2DeepStream 7.1672 FPSJetson

Read case study arrow_forward

Youth Sports

CoachClaw

Nemotron-grade SLM quantized via llama.cpp on NVIDIA GB10. Rules lookups, concussion protocol, practice plans — all from Telegram. Zero cloud dependency by design.

NemotronNVIDIA GB10llama.cppTelegram

Read case study arrow_forward

Biotech · Drug Discovery

HD Research Hub

Ollama-backed agents for Huntington's Disease literature triage. Runs on Jetson or Mac — no API key, no data egress. Open source.

OllamaJetson / MacPubMed agentsAlphaFold

Launch app open_in_new

From "what model should we use" to "it's deployed."

01 · Model Selection

The right-sized model

We benchmark open small models against your real workload — not a synthetic eval. Output: a model + quantization + hardware recommendation with numbers.

1–2 week engagement

02 · Deployment

Inference stack, end-to-end

Hardware provisioning, OS + drivers, inference runtime (Ollama / llama.cpp / vLLM / DeepStream), monitoring, OTA update flow. Working device, not a notebook.

4–8 week engagement

03 · Product Integration

Wire it into the thing that ships

REST / WebSocket / gRPC / on-device SDK — we meet your product where it is. Agentic pipelines, tool-use, streaming, structured output. You get a product feature, not an inference endpoint.

8–12 week engagement

Have a workload that shouldn't be in the cloud?

Free 30-minute call. We'll tell you honestly whether edge AI is the right call.

Book a 30-min call calendar_month

Small models. Production-grade.

Four reasons the cloud bill isn't inevitable.

Data never leaves the device.

No per-token meter. Ever.

No round trip.

Works when the internet doesn't.

Hardware + model + inference runtime.

Where the model runs

What we reach for

How it stays fast

Four live systems. Zero cloud inference bills.

Undervolt

RefereAI

CoachClaw

HD Research Hub

From "what model should we use" to "it's deployed."

The right-sized model

Inference stack, end-to-end

Wire it into the thing that ships

Have a workload that shouldn't be in the cloud?