Three Robots, One Brain: Building Physical AI in Twelve Hours
I flew to San Francisco the day before NVIDIA GTC for a hackathon I almost didn't sign up for. Nebius.Build SF — twelve hours, in person, prizes that included GTC passes. I had a small idea I wanted to test: what happens when one inference loop drives multiple physical bodies?
We called the project Sideline. The team was four: me, Vivek, Vissh, and Kruthik. The hardware — provided by the venue — was three robots and a Jetson AGX Orin. The brain was Cosmos Reason 2, the same physics-aware model I'd been using for RefereAI. We finished forty-five minutes before the deadline. By that point, the same AI had picked up an object with a robotic arm, navigated a course on a wheeled rover, and made a referee gesture from the chest of a humanoid.
This is the first time I've felt that "physical AI" is real and not a deck.
What "physical AI" actually means
The phrase has been around a while. NVIDIA leans on it. Most product talks gloss it as "AI in robots." That's underselling what's changed.
Until recently, every physical-AI deployment had a tightly coupled stack: a perception model trained for a specific robot's cameras, a controller written for that robot's actuators, a behavior layer that knew exactly what hardware it was driving. Swap the robot, rewrite the stack. The model and the body were inseparable.
What we wanted to show is that you can decouple them. Run one perception-and-reasoning model. Translate its output into different actuation commands depending on which body is downstream. Same brain, different limbs.
That's a small architectural shift. It's also the part that makes general-purpose robotics start to feel possible.
The three bodies
SO-101 robotic arm via LeRobot. A research-grade six-DOF arm, position-controlled, driven through Hugging Face's LeRobot framework. The arm sees a sports object (in our demo, a tennis ball), and on the AI's call, picks it up and places it in a target zone. LeRobot abstracts the kinematics: we send a target Cartesian position, it solves the joint angles.
MentorPi rover on ROS2. A wheeled differential-drive robot with a depth camera, running a stripped-down ROS2 stack. It receives navigation goals — "go to where the ball landed" — and runs its own local planner to get there. Cosmos doesn't know about wheels or motors. It just knows about positions.
Unitree G1 humanoid. The dramatic one. The G1 has a small set of pre-trained gesture primitives (point, wave, hold, signal). We wired Cosmos Reason 2's call output to fire the corresponding primitive. When the AI calls "out" on a tennis serve, the humanoid raises its arm in the standard ref signal.
The brain layer
One inference loop, running on a Jetson AGX Orin. Cosmos Reason 2 served via NIM (we didn't have time to pull it locally during the hackathon — that's the obvious follow-up). The loop:
- Grab the most recent frame from whichever camera is active.
- Send it to Cosmos Reason 2 with a structured prompt asking three questions: what objects are in the scene, where are they (with position estimates), and what action should be taken given the current task.
- Parse the structured response — JSON with object positions, target action, confidence, and a chain-of-thought trace.
- Dispatch the action to whichever robot is currently in scope.
The dispatcher was the most architecturally interesting piece. It's a small TypeScript service running on the Jetson with three handlers — one per body — and a routing rule that knew which body was active for the current step in the demo. Same model output, different action translation, no rewiring of the model side.
Total custom code: about 800 lines across the four of us. The orchestration was the work; the AI calls were a single API client.
What worked
The handoff between bodies was smoother than I expected. The arm picks up the ball, drops it on the court. The rover sees it land, navigates to the bounce location. The humanoid, watching from the sideline, makes the call about whether the bounce was in. Three different bodies, one continuous narrative, one perception-and-reasoning loop.
The chain-of-thought trace from Cosmos was the connective tissue. Each step's reasoning was visible. When the rover picked the wrong landing spot during one demo run, we could read the trace and see exactly why — the model had estimated the bounce location based on a frame two frames before the actual bounce, and the ball had moved. Fixing that was a prompt change, not a model retrain.
What didn't
Latency was real. Cosmos Reason 2 via NIM takes seconds for a reasoning step, not milliseconds. For action selection ("call this serve") that's fine. For closed-loop actuation ("track the moving ball with the arm") it's nowhere near fast enough. We had to script the demo so the AI's reasoning steps happened between physical actions, not during them. Real-time closed-loop control with a reasoning model is still a research problem.
Coordinate frames are unforgiving. Each robot has its own coordinate system. The arm's frame is local to its base. The rover's frame is local to its wheels. The court has its own real-world coordinate system. Translating Cosmos's "the ball landed at position X" into commands that mean the same thing to all three bodies took longer than the AI integration. About four of our twelve hours went here.
What this proves and what it doesn't
It proves that the model layer is increasingly portable across bodies. That's a real architectural change. A general-purpose perception-and-reasoning model that you can point at any body with the right translation layer is qualitatively different from the bespoke stacks that came before.
It does not prove that physical AI is solved. Closed-loop real-time control with reasoning models is still hard. Generalizing across genuinely new tasks (not just new bodies executing the same task) is still hard. Safety is still hard. The hackathon demo was choreographed enough that we knew the action space in advance.
But for the practical question of "can I build a useful physical-AI product right now, with the components shipping today" — the answer is yes. You build perception with Cosmos. You translate to actions with a thin per-body layer. You let the chain-of-thought give you the audit log. You don't try to close the control loop at reasoning speeds; you sequence the AI's role appropriately.
The longer-term picture
NVIDIA's bet — and Cosmos's whole thesis — is that physical AI follows the trajectory that language AI did. A small number of base models that understand the physical world, surrounded by application-layer code that points them at specific bodies and tasks. The same model in a warehouse, in a clinic, on a sideline.
Twelve hours and one hackathon doesn't prove that thesis. But it doesn't disprove it either. Three robots, one brain, one continuous demo loop, working before midnight. That's a long way from where this was eighteen months ago.
We didn't place at Nebius — 73 entries, the bar was high. But the thing kept running after the hackathon, and the architecture is the part I'm keeping.
Considering physical-AI integration — robots, edge perception, multi-body systems? AISOFT engages on stack design, prototype-to-production hardening, and on-device inference. hello@aisoft.us · book a 30-min consult →