The Coherence Record, Edition 1
An executive builds diagnostic infrastructure on owned hardware using public data.
on Owned Hardware
and Maintains a Public Build Record
Justin Greenbaum
Greenbaum Labs
February 2026
Edition 1
This is not a product announcement or a finished framework. It is a build record. It documents what it takes for an executive to design, test, and trust diagnostic infrastructure using public data and owned hardware. The technical and the strategic are not separated here, because in the work itself they were never separate.
Forty Hours and Counting
There is a moment in every build where you question everything. Not the architecture. Not the strategy. Not the market. Everything.
It is the kind of doubt that sits in your chest at 2 a.m. when the GPU has been running for ten hours, every request is timing out, and you cannot determine why.
I reached that moment more than once.
Over forty hours across six days, I debugged the same pipeline. A parameter in the wrong nesting level. A prompt that was too permissive. A working process I terminated because I could not observe progress. Each failure different. Each one invisible until it was not.
This is what building infrastructure actually looks like. Not a highlight reel. Not an architecture diagram. The real sequence of decisions, mistakes, and corrections that separates an idea from a system that can be trusted.
What Coherence Means in This Work
This work does not begin with a framework. It begins with an observation.
Organizations routinely say one thing and produce another.
The distance between those two is not abstract. It appears in public records, regulatory actions, consumer complaints, job postings, and narrative shifts over time. That distance can be observed, compared, and measured.
I use the term coherence to describe the structural integrity of that relationship.
Not alignment, which implies a static state.
Not integrity, which carries moral weight.
Coherence is descriptive. It reflects whether an organization’s internal narrative holds when tested against external reality.
The diagnostic model used here evaluates three dimensions, each derived exclusively from public sources.
Truth
The relationship between an organization’s stated claims and what outside observers experience. A gap does not require falsehood. Omission is sufficient.
Authority
The relationship between responsibility and decision-making power as expressed through role design, escalation patterns, and ownership signals.
Continuity
Stability of narrative over time. Repeated shifts without acknowledgment indicate structural drift rather than adaptation. This dimension requires multiple collection periods and emerges longitudinally.
Together, these form the Coherence Triangle. The question was never whether coherence could be measured. The question was whether I could build the infrastructure to measure it independently, on hardware I control, using models I understand, and produce diagnostics I trust.
Why the System Runs on Owned Hardware
This project could have been implemented using cloud inference and commercial APIs. That approach is faster, cheaper in the short term, and easier to scale.
I chose not to use it.
If a diagnostic claims to measure the gap between narrative and reality, the diagnostic itself cannot depend on infrastructure I do not control. A system designed to surface fragility cannot be built on rented dependencies that can change without notice.
The environment consists of:
NVIDIA DGX Spark for inference
Synology NAS for raw data storage
Mac Studio for orchestration and development
All models run locally. All data is stored locally. All processing occurs on a private network.
The trade-offs are explicit. Inference takes minutes instead of seconds. Full diagnostic runs take hours. Throughput is constrained.
The benefit is traceability.
When a score is produced, I can identify exactly which data sources, model weights, prompts, and parameters contributed to it. That is not a performance feature. It is the foundation of trust in the measurement.
The economics reinforce the decision. Run 005 processed approximately 1.4 million tokens. On commercial APIs, that range spans from tens of cents to double-digit dollars depending on provider and configuration. At scale, those costs compound quickly. On owned hardware, the marginal cost is power draw. Thirteen hours at sustained utilization produced no invoice.
That is not optimization. It is independence.
Center and Edge Data
A diagnostic reflects only the data it examines.
This system separates sources into two categories.
Center data
Materials an organization publishes intentionally. Press releases. Job postings. Regulatory filings. Earnings transcripts. Official communications.
Edge data
Public responses to those claims. Consumer complaints. Regulatory actions. Employee reviews. Other externally observable signals.
For the first diagnostic subject, Coinbase, all data was collected from publicly accessible sources, including:
job postings retrieved via the Greenhouse API
SEC EDGAR filings
consumer complaints filed with the CFPB
No internal systems, non-public documents, or privileged access were used.
The dataset is incomplete by design. Employee review platforms, earnings transcripts, and social media signals were not included in this run. The diagnostic explicitly records those omissions.
A system that claims to measure coherence must be able to state what it does not know.
How the Pipeline Operates
The pipeline runs in four stages, each validated before proceeding.
Collect
Documents are retrieved from configured public sources. File counts, formats, and accessibility are validated.
Extract
Each document is processed by an extraction agent that identifies diagnostically relevant claims and observations and returns structured JSON. In the agent-powered run, this stage processed 844 documents across 844 consecutive agent calls with near-zero failure.
Score
Extracted content is evaluated across Truth, Authority, and Continuity. In agent mode, this includes structured debate between specialized agents and a Skeptic that can sustain or reject findings.
Synthesize
The system produces a diagnostic summary, supporting field notes, a watch list, an overall coherence score, and an explicit data quality grade.
The pipeline runs in two modes:
rule-based, which completes in under a second using pattern matching
agent-powered, which takes hours using multi-agent inference
Both produce results. The purpose of this build was to determine whether the agent-powered architecture materially improves diagnostic quality.
It does.
What Forty Hours Teaches You
Synthetic data validated the mechanics. Real data exposed reality.
The rule-based extractor classified only 31.7 percent of real content. Most material fell outside predefined patterns. This was expected.
The agent pipeline was intended to read context and apply judgment. Initially, it returned nothing.
The cause was a single misplaced parameter. A model behavior flag was passed in the wrong location. The API accepted the request. The model ran. The output buffer was consumed internally. No usable output returned.
One parameter. Wrong nesting level. No error. No warning.
After correcting that, the model responded exhaustively. Each document produced more than ten thousand characters of structured JSON. Perfectly formatted. Completely unusable.
The problem was not infrastructure or model capability. It was the prompt. The instruction asked for everything, and the model complied.
Constraining the request to the top five diagnostically important items per document stabilized output immediately.
The lesson is not about prompt technique. It is that failure can live at any layer of the system, and it does not announce which one.
A later run appeared to hang. No logs. No output. No visible progress. The pipeline was working the entire time. Logging was not configured. Progress was invisible. I terminated a process that was nearly halfway complete.
Two print statements resolved it.
This is the kind of failure that does not appear in summaries. Progress you cannot observe is progress you will eventually destroy.
What the Diagnostic Found
The agent-powered diagnostic processed 844 documents over thirteen hours and classified 95.1 percent of all content.
The overall coherence score was 0.609, down from 0.621 in the rule-based baseline.
This is not regression. It is honesty.
Truth improved slightly as agents found more evidence on both sides of the narrative. The dominant pattern remained omission rather than contradiction.
Authority decreased materially. The agents identified concentration and diffusion patterns that rule-based logic could not detect.
Continuity was not scored. It requires longitudinal measurement.
Data quality was graded C due to incomplete source coverage. The diagnostic states this explicitly.
Inside the data, 645 observations clustered around the product experience. That signal emerged only because agents could read context that patterns could not.
The Evidence Chain Failure
One critical subsystem failed.
The scoring agents produced substantive findings, but could not reliably cite the specific claim and observation identifiers that supported them. The Skeptic rejected every finding.
The scores are valid. The evidence ledger is incomplete.
This is a wiring problem, not a capability problem. Shorter identifier aliases are being introduced to restore provenance integrity. The failure is documented here because the framework requires it.
If a system measures gaps, it must disclose its own.
The Executive Who Builds
I am not an engineer.
I am an executive who decided that understanding infrastructure is now a leadership capability.
As AI compresses the distance between intent and consequence, leadership that operates only through delegation loses resolution. The value is no longer in deciding. It is in understanding what decisions actually require.
No vendor briefing explains where reality resists abstraction. You learn that by building.
Why This Record Is Public
There is no established reference for this work.
This document exists as a record, not a guide.
It preserves decision context and holds the work accountable to its own standards. Coherence Diagnostics measures the gap between what organizations say and what they do. The build itself must be coherent.
This record is the edge data for the project’s own narrative.
The Record Begins
This is Edition 1.
What exists now:
an extraction engine that comprehends over 95 percent of content
a multi-agent scoring system with an active Skeptic
a synthesis layer that grades its own data quality
owned infrastructure with zero cloud dependency
And an evidence chain that is not yet complete.
Future editions will document what changes, why, and whether those changes improve coherence measurement over time.
If coherence matters, it must be observable.
If diagnostics matter, they must be accountable.
If an executive claims to understand the infrastructure, there must be evidence.
This is that evidence.
Justin Greenbaum
Founder, Greenbaum Labs
Building diagnostic infrastructure to measure the gap between what organizations say and what they do.
This is The Coherence Record, Edition 1, published at writing.justingreenbaum.com.


