February 27, 2026 · Ravinder Jilkapally

When Cosmos Reason 2 Watches a Tennis Match

NVIDIA quietly shipped Cosmos Reason 2 last fall. It's an 8B-parameter physics-aware reasoning model trained to understand what's physically happening in a scene — not just what's in the frame, but how the objects in it relate, how they're moving, and what would have to happen next given physics. I'd been chewing on a sports-analytics idea for months. The model showed up at exactly the right time.

I put it on a Jetson AGX Orin and pointed it at amateur tennis. This is the project I called RefereAI. Submitted to NVIDIA's Cosmos Cookoff. Here's what the model can actually do, where it doesn't yet, and what that means for shipping physical-AI products this year.

The setup

A standard NVIDIA Jetson AGX Orin. Cosmos Reason 2 served via the OpenAI-compatible NIM API for cloud testing, then pulled locally for on-device inference at the venue. A camera angle on a tennis court. Twelve-second clips of rallies. The system was asked to do three things, in order:

Track the ball. Where is it across frames, what's its trajectory.
Make a call. Specifically: did the ball cross the line, was the serve in or out, did the ball bounce inside the court.
Explain the reasoning. A chain-of-thought trace explaining why the call is what it is, in language a human ref or coach could check.

The third step is the part most "AI for sports" projects skip. Most of them stop at "the model says out." That's not enough. A coach needs to know whether the call was based on a clear bounce inside the line, a partial occlusion the model guessed through, or a signal the model wasn't actually confident about. The chain-of-thought is the audit trail.

What Cosmos Reason 2 actually does well

Three things it nails that I didn't expect from an 8B model:

1. Physical continuity across frames. It tracks the ball as a moving object with momentum, not as an isolated detection per frame. When the ball is occluded for two frames behind a player's body, it doesn't lose track. It infers that the trajectory continues, and picks the ball back up when it reappears. A standard YOLO pipeline would lose the object and label-flicker.

2. Geometric reasoning about the court. It understands "the line." Not as pixels, but as a planar boundary. It can answer "did the ball bounce inside or outside the line" in a way that a 2D image classifier cannot, because the question is fundamentally about the 3D position of the ball at the moment of bounce, projected onto the court plane.

3. Physics-grounded uncertainty. When the bounce is genuinely close, the model says so. It returns a confidence and a reason: "the ball appears to bounce within 5 cm of the baseline; due to camera angle and motion blur, sub-centimeter accuracy is not possible from this footage." That kind of calibrated uncertainty is rare in vision models. It matters enormously for downstream trust.

Where it falls down

Three honest limitations I hit:

1. Frame rate is the ceiling. Cosmos Reason 2 is a reasoning model, not a real-time perceiver. On a Jetson AGX Orin, I was running at well under 1 fps for the reasoning step. Detection-and-tracking ran 600+ fps via DeepStream. The architectural pattern that worked: fast detector to identify interesting moments (a serve, a baseline rally ending, a possible line call), then call Cosmos Reason 2 only on those candidate clips. About 5 frames of inference per actual call. Spending Cosmos cycles on every frame is wasteful and unnecessary.

2. Camera angle matters more than the model. When the camera is roughly perpendicular to the line in question, the model is excellent. When the camera is at an oblique angle, the geometric reasoning starts hedging — correctly, because the geometric question is genuinely harder. The model is not a substitute for a good camera angle. This sounds obvious; in practice it means productizing a sports-AI tool requires opinions about where users place their phones.

3. Sport-specific priors are still up to you. Cosmos Reason 2 doesn't know the rules of tennis. It knows physics and geometry. To turn "the ball bounced 2cm beyond the baseline" into "the serve was out," you still need a rule-encoder layer that maps physics observations to sport-specific calls. I built mine in maybe 200 lines of TypeScript. Other sports — pickleball, badminton, table tennis — get their own encoders. That's the application-layer work that doesn't go away just because the underlying vision model got smarter.

Why this is meaningful for product builders

A year ago, you couldn't build RefereAI without a custom training pipeline, a labeled dataset of tennis line-call moments, and a small ML team. Today you can do it with a base model, a Jetson, a couple hundred lines of orchestration, and a weekend. That's the change.

The shift in what's needed isn't "model capabilities." It's orchestration and integration. Picking which frames to send to which model. Wiring physics-grounded reasoning to sport-specific rules. Designing the chain-of-thought log so a human can audit it. Choosing the camera angle. Running everything locally so the latency is acceptable on a court instead of in a cloud region.

This is also why the agentic engineering muscle matters more than people think. The hard work in RefereAI wasn't model selection. It was assembling six things — DeepStream, Cosmos Reason 2 via NIM, a rule encoder, a trajectory smoother, a chain-of-thought log, a TypeScript front end — into a system where each component does its job cleanly. That assembly work is exactly what agentic engineering accelerates.

What's next

RefereAI submitted to the Cosmos Cookoff. I'm extending it to pickleball and badminton with the same core stack. The bigger move is taking the pattern and pointing it at non-sports physical-AI use cases: warehouse safety, autonomous-vehicle ride-along audits, robotic manipulation verification. Anywhere you need a model that reasons about what's physically happening and tells you why.

Cosmos Reason 2 isn't the last model in this lineage. NVIDIA will ship a Reason 3 and the field will move fast. The pattern — physics-aware reasoning gated by a fast detector, deployed locally, audited by chain-of-thought — is the part that matters and the part that will keep applying.

Building physical-AI products with vision-reasoning models? AISOFT designs and ships systems combining detection, reasoning, and rule encoding on edge hardware. hello@aisoft.us · book a 30-min consult →

Ravinder Jilkapally

Founder, AISOFT — agentic engineering, edge AI, local LLMs.