Three Simulated Bodies, One Brain: Building Toward Physical AI
I flew to San Francisco the day before NVIDIA GTC for a hackathon I almost didn't sign up for. Nebius.Build SF was twelve hours, in person, with prizes that included GTC passes. I had a small idea I wanted to test: what happens when one reasoning loop coordinates multiple simulated bodies?
The project was Sideline. For the hackathon, I worked with Vivek, Vissh, and Kruthik on a narrow simulation demo. The demo used simulated bodies, not hardware. The brain was Cosmos Reason 2, the same physics-aware model I had been using for RefereAI. By the deadline, one loop could reason about a scene, choose an action, and route that action to the simulated body responsible for the next step.
This is the first time simulation made physical AI feel less like a deck phrase and more like an architecture I could reason about.
What "physical AI" actually means
The phrase has been around a while. NVIDIA leans on it. Most product talks gloss it as "AI controlling machines." That's underselling what's changed.
Most physical-AI stacks are tightly coupled: a perception model for a specific camera setup, a controller for a specific actuator set, and a behavior layer that knows exactly what body it is driving. Swap the body, rewrite the stack. The model and the body become hard to separate.
What I wanted to test is whether you can decouple them. Run one perception-and-reasoning model. Translate its output into different action commands depending on which simulated body is downstream. Same brain, different environments.
That is a small architectural shift. It is also the part that makes real robotics work feel easier to approach later.
The three simulated bodies
Arm simulation. A simulated six-DOF arm receives an object location and target zone. The interesting part is not torque control; it is whether the reasoning layer can decide when the arm should act and hand off a structured command.
Navigation simulation. A simulated mobile body receives goals such as "go to where the ball landed." Cosmos does not need to know about motors. It only needs to reason about positions and produce a target.
Gesture simulation. The third body has a small set of gesture primitives: point, wave, hold, signal. Cosmos Reason 2 chooses the call, and the simulation fires the matching gesture.
The brain layer
One inference loop, running against Cosmos Reason 2. The loop:
- Read the current simulated scene state.
- Send the state to Cosmos Reason 2 with a structured prompt asking three questions: what objects are in the scene, where are they, and what action should be taken given the current task.
- Parse the structured response, JSON with object positions, target action, confidence, and a chain-of-thought trace.
- Dispatch the action to whichever simulated body is currently in scope.
The dispatcher was the most architecturally interesting piece. It is a small TypeScript service with three handlers, one per simulated body, and a routing rule that knows which body is active for the current step in the demo. Same model output, different action translation, no rewiring of the model side.
Total custom code: about 800 lines across the hackathon group. The orchestration was the work; the AI calls were a single API client.
What worked
The handoff between simulated bodies was smoother than I expected. The manipulation body acts on the object. The navigation body moves to the landing point. The gesture body makes the sideline call. Three different simulated bodies, one continuous narrative, one perception-and-reasoning loop.
The chain-of-thought trace from Cosmos was the connective tissue. Each step's reasoning was visible. When the navigation simulation picked the wrong landing spot during one demo run, I could read the trace and see exactly why. The model had estimated the bounce location based on a frame two frames before the actual bounce, and the ball had moved. Fixing that was a prompt change, not a model retrain.
What didn't
Latency was real. Cosmos Reason 2 via NIM takes seconds for a reasoning step, not milliseconds. For action selection ("call this serve") that is fine. For closed-loop control ("track the moving ball continuously") it is nowhere near fast enough. The demo had to sequence reasoning steps between simulated actions, not during them. Real-time closed-loop control with a reasoning model is still a research problem.
Coordinate frames are unforgiving even in simulation. Each simulated body has its own coordinate system. The manipulation frame, navigation frame, gesture frame, and court frame all need the same event to mean the same thing. Translating Cosmos's "the ball landed at position X" into commands that made sense to all three simulated bodies took longer than the AI integration.
What this proves and what it doesn't
It suggests that the model layer can become more portable across bodies. That is the architectural idea worth keeping. A general-purpose perception-and-reasoning model, paired with the right translation layer, is qualitatively different from a bespoke stack for every body.
It does not prove that physical AI is solved. It does not prove a hardware stack. Closed-loop real-time control with reasoning models is still hard. Generalizing across genuinely new tasks is still hard. Safety is still hard. The hackathon demo was bounded enough that the action space was known in advance.
But for the practical question of "can simulation help design a useful physical-AI product right now", yes. You reason with Cosmos. You translate to actions with a thin per-body layer. You let the chain-of-thought give you the audit log. You do not try to close the control loop at reasoning speeds; you sequence the AI's role appropriately.
The longer-term picture
NVIDIA's bet, and Cosmos's whole thesis, is that physical AI follows the trajectory that language AI did. A small number of base models that understand the physical world, surrounded by application-layer code that points them at specific bodies and tasks. Simulation is one place to learn that interface before hardware makes every mistake more expensive.
Twelve hours and one hackathon does not prove that thesis. But it gave me a useful pattern: three simulated bodies, one brain, one continuous demo loop, working before midnight.
The project did not place at Nebius. There were 73 entries and the bar was high. But the architecture is the part I am keeping.
Working on simulation-first physical AI, edge perception, or multi-body systems? AISOFT engages on stack design, prototype-to-production hardening, and on-device inference. hello@aisoft.us · book a 30-min consult →