What's on the Bench

Five machines, no cloud, and what owned hardware actually costs.

Jul 02, 2026

In the first Open Garage post I said the door is up and you can see what is on the bench. Fair enough. Here is the bench.

Five machines. Two NVIDIA DGX Sparks that carry the scoring work for the diagnostic pipeline. An M2 Mac Studio that runs embeddings, text inference, and a small resident agent that never sleeps. An M3 Mac Studio that handles vision models and the largest things I run, because 192GB of unified memory will hold what a GPU will not. A NAS in the corner holding it all together with shared storage. Across the five nodes there are roughly 750 billion model parameters hosted and ready. The biggest single model is a 235B mixture-of-experts that lives on the M3. Nothing in the pipeline calls a cloud API to think. Cloud services collect source material. The reasoning happens here.

That is a choice, and it has reasons. The diagnostic work processes evidence about real companies, and that evidence stays on machines I own. The cost shape matters too. When inference is metered, every experiment carries a small tax, and the tax makes you hesitant. When the metal is yours, the marginal experiment is free, and you run more of them. The lab got better because trying things stopped costing anything but time.

Time is the honest price. Owned hardware bills you in attention.

A vision model spent a week this spring working through 31,000 RAW files from my photo archive at about eighteen seconds a file. A seven-day run has to survive whatever happens during seven days, so every pass in every pipeline writes checkpoints and resumes from the last one. That discipline is not optional at this duration. You also become your own IT department. When a node drops, there is no ticket to file. There is a tunnel chain instead: laptop to the M2 over Tailscale, M2 to the Spark over the LAN, so I can check on a run from anywhere with a phone signal. And each morning the resident agent reads the pipeline state and the cluster health and posts me a briefing, which is the closest thing the garage has to a shift report.

What the attention buys: 189 diagnostic pipeline runs so far, 178 of them clean production passes, thirty companies on the fleet, from defense primes to sneaker brands, every finding challenged by an adversarial skeptic pass before it earns a score. And one project that is pure garage: a twenty-year photo archive, 102,630 files, 3.7 terabytes, indexed by a seven-pass pipeline into a single 60MB index I can search by concept.

The strangest thing on the bench sits at the intersection of those two workloads. The failure-mode taxonomy is encoded as text embeddings, and the photo index can be searched with them, which means I can ask twenty years of photographs for images that look like Responsibility Compression. That query should not work. It does, and some of the matches are unsettling.

The lab is not the practice. The practice is diagnosis, and it would exist on rented compute if it had to. But one person can run a fleet because the fleet is downstairs, checkpointed, and reporting for duty every morning. The door stays up.

JG

Discussion about this post

Ready for more?