Justin Greenbaum: The Coherence Record

The Coherence Record, Edition 6

Justin R. Greenbaum — Thu, 16 Apr 2026 18:53:53 GMT

Justin R. Greenbaum
Greenbaum Labs
April 2026

This edition is different from the ones that came before.

Editions 1 through 5 were build logs. They documented the construction of an instrument: scoring architectures, reproducibility testing, prompt hardening, fleet operations. The question behind every edition was whether the system could measure what it claimed to measure. Those questions are not finished. But this edition pauses the build to follow a different thread.

On March 27, Dr. Nick van der Meulen, a research scientist at MIT CISR, presented his work on digital business transformation at the MIT AI Executive Academy. One week earlier, he had published a research briefing titled “Minimum Viable Governance for Generative AI” (MIT CISR Research Briefing, Vol. XXVI, No. 3, March 2026). It was his newest piece. Four characteristics of governance designed for a world where the technology transforms every eighteen months: structurally agile, trustworthy by design, integrated end-to-end, opportunity-sensitive.

I was in that room. During the session, I asked about a pattern I have seen repeatedly: authority that exists on paper but requires so much lateral alignment to execute that nobody actually owns the decision. He called it “very recognizable.” He connected it to what he calls organizational scar tissue: rules put in place because one person made one mistake, now applied to everyone forever. The conversation continued over lunch, and I described the diagnostic framework, the seventeen failure modes, the scoring pipeline, the center-edge documentation methodology.

I am not writing this to claim validation. I am writing it because his research and this project’s taxonomy are looking at the same problem from two altitudes. He is mapping what good looks like: the characteristics of governance that works. The Coherence framework maps what broken looks like: the structural conditions that emerge when those characteristics are absent. The two are complements. And the space between them is where the language lives.

The Language Is the Contribution

The Coherence project is built on a single premise: before you can measure organizational coherence, you need language that names what you are measuring. Before you can diagnose failure modes, you need vocabulary that makes those modes recognizable. The seventeen failure modes are not a scoring system. They are a naming system. FM-01, Responsibility Compression, describes something every person who has worked inside a scaling organization has felt. They felt it, adjusted to it, compensated for it, and could not name it. Without a name, it is invisible. You cannot address what you cannot see, and you cannot see what you cannot name.

The same insight surfaces in van der Meulen’s MVG work. He opened his session with a claim that landed harder than any framework or quadrant: shared vocabulary and strategic focus are the two prerequisites for transformation progress. Without shared vocabulary, “AI” means something different to every person in the room. “Transformation” is a word people nod at and define privately. “Governance” is either a reassurance or a threat, depending on who hears it.

This is what van der Meulen’s research and this project share as a foundational commitment: the belief that structural conditions must be named before they can be changed. His vocabulary (MVG, organizational explosions, silos and spaghetti, Future Ready) gives organizations language for where they are and where they need to go. The Coherence framework’s vocabulary (failure modes, field notes, the Triangle) gives organizations language for what is preventing them from getting there. The research describes the destination. The diagnostic names the obstacles. Both require language first.

But the instrument’s deepest contribution may not be the scores it produces. It may be the vocabulary it gives people for naming what they already observe. Edition 5 ended with that question: “whether someone needs a pipeline to see these patterns, or just the right questions.” This edition is the answer. The pipeline validates the language. The language is what scales.

A diagnostic score requires infrastructure, compute, methodology, a practitioner. A name requires only recognition. Someone reads “Responsibility Without Authority” and thinks: that is what I have been living inside for two years. That recognition is the beginning of the diagnostic, whether or not the pipeline ever runs on their organization. The language is the instrument’s gift to the people who will never buy the service. And it is the entry point for the people who will.

What Breaks When Governance Isn’t Structurally Agile

Van der Meulen’s first MVG characteristic is structural agility: governance that can adapt its own structure as the environment changes. Not flexibility in the casual sense. The ability to change the rules about who decides what, and how quickly those rules take effect, without convening a senate each time.

When this characteristic is absent, the Coherence framework names what appears.

FM-01, Responsibility Compression, is the most persistent signal in the diagnostic pipeline. It is one of three foundational failure modes the taxonomy identifies as tier 1: structurally universal in organizations past a certain scale. The instrument detected FM-01 above threshold in fourteen of fifteen fleet entities. In practice, the tier 1 modes are present in every large organization the pipeline has measured. What varies is severity, not presence. Responsibility concentrates where authority does not. Senior roles hold decision power. Frontline teams absorb the consequences without the ability to change outcomes. In a structurally agile governance model, decision rights redistribute as conditions change. Without that agility, they calcify. The people closest to the problem lack the authority to act on it. The people with authority are too far from the problem to see it clearly. Compression is the predictable result.

FM-03, Responsibility Without Authority, is the sharper version of the same condition. Someone is explicitly accountable. Their name is on the RACI chart. Their performance review includes the outcome. But they lack the organizational authority to influence that outcome. Van der Meulen had a line in his session that named this precisely: “You can have the most beautiful RACI chart in the world, but it’s not going to change anything fundamentally.” He is right. The chart assigns responsibility. It does not transfer power. When governance cannot restructure authority in response to shifting conditions, RACI becomes a documentation of servitude, not a mechanism of alignment.

FM-06, Exception Inflation, completes the picture. Every exception that gets hard-coded into process rather than resolved structurally is a governance system losing agility. A VP approves one off-cycle purchase because the timeline demands it. Next quarter, an exception form exists. The quarter after that, the form requires three signatures. A year later, forty percent of purchases route through the exception path, and the exception path is now the slow one. The organization layered governance on top of governance instead of fixing the structural condition that generated exceptions in the first place. Van der Meulen calls these layers “organizational scar tissue.” The Coherence framework counts them. They accumulate. They slow the organization down. And they are structurally invisible to the people living inside them because each individual scar feels reasonable.

What Breaks When Governance Isn’t Trustworthy by Design

The second MVG characteristic is that governance must be trustworthy by design: not trust bolted on after the fact, but trust embedded in the structure. People comply with governance they believe is fair, useful, and responsive. They route around governance they believe is theater.

When this characteristic is absent, the first thing that surfaces is FM-04, Metric Shadowing. The official metrics still get reported. They look fine. But the people producing those metrics know they do not reflect what is actually happening. A customer satisfaction score stays high because the survey only reaches customers who completed a transaction, not the ones who abandoned. A project status is green because the definition of green was quietly redefined two quarters ago. The governance mechanism is technically functioning. The trust is gone. The numbers are correct and meaningless.

FM-02, Escalation Inversion, follows. The escalation paths exist on paper. People know where to route a problem, who to flag, what to file. But when escalating is costly, slow, or reputationally risky, people stop doing it. They absorb problems at the edge instead. The issue gets quietly resolved, or it doesn’t, and the organization only learns of it when something public breaks. In a trustworthy system, escalation is a signal. In one where trust has eroded, escalation is treated as failure: the act of raising a problem carries more cost than the problem itself. Issues get absorbed rather than surfaced. The structural conditions that produced them remain.

This is the gap that the Coherence framework measures: the distance between what the governance system reports about itself and what the people inside it (and the customers outside it) actually experience. Trust by design means the governance system’s self-report is reliable. When it is not, the diagnostic finds the specific failure modes that explain why.

What Breaks When Governance Isn’t Integrated End-to-End

The third MVG characteristic is integration: governance that operates across organizational boundaries, not within them. Not integration as an IT project. Integration as a structural condition where governance mechanisms talk to each other, where a decision made in one unit is visible and actionable in another.

When this is absent, what appears is the condition van der Meulen’s research calls “silos and spaghetti.” When leaders in the room self-selected into quadrants, the poll aligned with his survey data: the majority placed themselves there. The majority condition of large organizations is dysfunction as normal. People surviving on heroics, compensating for fragmentation with personal effort, navigating workarounds that everyone knows about and nobody addresses.

The Coherence framework names this FM-05, Normalized Workarounds. It is the operational texture of silos and spaghetti. The workaround that started as a temporary bridge becomes the permanent road. The manual handoff between two systems that should be integrated. The spreadsheet that exists because the platform cannot do what the team needs. The person who holds the institutional knowledge of how things actually work, and whose departure would break the process.

FM-07, Coordination Decay, is the structural driver underneath. As governance fragments across organizational boundaries, the coordination cost between units rises silently. More meetings. More alignment documents. More “quick syncs” that are not quick and do not sync. The governance technically exists in each unit. The space between units is ungoverned. The coordination decay is invisible in any single unit’s reporting. It is visible only to the people absorbing the cost of bridging the gap and to the customers at the far end of it.

Van der Meulen made the amplification point explicitly in his afternoon session: AI does not create these conditions. AI amplifies whatever it is pointed at. Good operational backbone, clean data, skilled people with decision rights: AI accelerates that. Silos and spaghetti with overworked heroics and messy data: AI pours gasoline on it. The governance integration question is structural. AI does not create it. AI only makes it urgent. The Coherence framework measures the structural conditions. MVG describes the governance response. The sequence matters: diagnose first, govern second.

What Breaks When Governance Isn’t Opportunity-Sensitive

The fourth MVG characteristic is opportunity-sensitivity: governance that does not just prevent bad outcomes but actively creates conditions for good ones. This is the hardest characteristic to measure because its absence looks like stability. Nothing goes wrong. Nothing remarkable happens. The organization operates within its constraints and does not notice that the constraints have become the strategy.

The Coherence framework approaches this through the Truth vertex. Truth measures the distance between what an organization says about itself and what is observable at the edges: customer experience, employee experience, market reality. An organization that is not opportunity-sensitive tells a story about innovation, growth, and transformation that does not match the observable reality. The center narrative describes ambition. The edge data describes maintenance.

This is compression, operating at the narrative level in the same way FM-01 operates structurally. The center compresses complexity into a story it can tell the board, the market, the workforce. The edge lives the uncompressed version. The gap between the two is measurable, and the instrument measures it. A high Truth score means the center-edge gap is narrow, and the story matches the experience. A low Truth score means the story and the experience have diverged. Neither score tells you what to do. Both tell you where to look.

An opportunity-sensitive governance model keeps the gap narrow by design. The governance mechanisms surface edge reality into center decision-making. Customer complaints reach product strategy. Employee experience data reaches organizational design. Market signals reach resource allocation. When the governance model is not opportunity-sensitive, those feedback loops degrade. The center narrative drifts from edge reality. The Truth score declines. The organization becomes, in van der Meulen’s quadrant framing, a candidate for the Integrated Experience trap, the “dopamine trail” where customer-facing metrics improve while the underlying structure deteriorates. Everything looks better. Nothing has changed.

What’s Next

The MVG paper and the Coherence framework are adjacent layers. His research maps the governance characteristics that make organizations adaptive. The Coherence framework maps the structural conditions that emerge when those characteristics are absent. The mapping between them is specific: each MVG characteristic, when missing, produces identifiable, nameable failure modes.

This edition is the first attempt to bridge the two explicitly. The citations are a practitioner showing where a research tradition and twenty years of operational experience land on the same problems.

The language distribution work begins now. Each failure mode is a standalone piece of content. Each one names something people recognize but have not had a word for. The taxonomy (seventeen failure modes, twenty-one field notes, the Coherence Triangle) came first. It came from twenty years inside organizations where these patterns had no names. The pipeline was built afterward, to prove these conditions exist in the wild at scale, across sectors, without needing a client engagement or putting a former employer on the line. The language was always the point. The pipeline is the evidence.

Getting the vocabulary into circulation is the next phase. The pipeline will continue to run. The fleet will grow. The instrument will sharpen. But the vocabulary does not need the pipeline to travel. It needs only to be placed in front of people who have been waiting for it without knowing they were waiting.

Van der Meulen said something in his session that I keep coming back to, “The hard, unglamorous work of getting the conditions right for AI to actually help accelerate and transform the organization… that is not paid enough attention to.” He is right. And the first step in that work is naming the conditions. Not the aspirational conditions. The current ones. The ones that have been invisible because nobody had words for them.

Now they have names. Seventeen of them.

The images in this edition are from my own library, shot on Leica. Everything in this project is built or sourced firsthand. The visuals are no exception.

References

Van der Meulen, N., Jewer, J., and Levallet, N. “Minimum Viable Governance for Generative AI.” MIT CISR Research Briefing, Vol. XXVI, No. 3, March 2026.

Van der Meulen, N. and Ross, J.W. “Realizing Decentralized Economies of Scale.” MIT CISR, January 2023.

Van der Meulen, N. “Managing the Two Faces of Generative AI.” MIT CISR, September 2024.

Van der Meulen, N. “Bring Your Own AI: How to Balance Risks and Innovation.” MIT Sloan Management Review, October 2024.

Ross, J.W., Beath, C.M., and Mocker, M. Designed for Digital: How to Architect Your Business for Sustained Success. MIT Press, 2019.

Greenbaum, J. “The Coherence Record, Editions 1–5.” Greenbaum Labs, 2026.

The Coherence Record, Edition 5

Justin R. Greenbaum — Tue, 17 Mar 2026 20:55:26 GMT

Justin R. Greenbaum | Founder, Greenbaum Labs
March 2026

What’s Happened

Edition 4 ended with the strongest claim in this project’s history: the instrument is reproducible. Zero standard deviation. Same entity, same score, every time. Finding-derived scoring replaced the LLM’s opinion with deterministic computation. The pipeline was grounded.

That was published on March 5. By March 13, eight days later, the project changed shape again.

134 runs in the ledger now. This is what the last eight days produced.

The Prompts Weren’t Good Enough

Edition 4 proved the scoring architecture was sound. It did not prove the prompts were.

Con-Hotel’s original run, run 079, scored with 33% skeptic throughput. Two of six findings survived debate. The other four were rejected. The rejected findings followed a consistent pattern: the agent had decided what score felt right, then went looking for evidence to justify it. The findings read like conclusions wearing an evidence costume.

This is the same failure pattern twenty-seven runs had eliminated from the scoring architecture, the model generating an opinion instead of computing from evidence. Fixed in the formula. Not fixed in the prompt.

Two changes:

First, Evidence Discipline blocks were added to the truth and authority scorer prompts. These are structural constraints, not suggestions. The prompt now explicitly names the failure pattern, “starting from a conclusion and working backward,” and forbids it. It requires each finding to be built from cited evidence: specific claims, specific observations, specific scope. The finding follows the evidence. Not the other way around.

Second, the authority few-shot examples were rewritten. The old examples were abstract. They led the model to produce vague, general findings that sounded analytical but said nothing specific enough to survive the Skeptic. The new examples follow a progression: BAD (vague, unsupported), STILL BAD (specific but backward, conclusion first), GOOD (evidence first, finding emerges from the data). Each example is led by a specific customer quote, not a category label.

Con-Hotel rescore with the hardened prompts: 83% skeptic throughput. Five of six findings sustained. Triple-blind validation confirmed deterministic, 0.000 standard deviation.

The prompts are now committed to the pipeline repo. The same codebase that runs the fleet.

Autoresearch

The prompt changes that fixed Con-Hotel were designed by hand. The Skeptic’s rejections were analyzed, the failure pattern was identified, and the prompts were rewritten to prevent it. That worked. But it does not scale.

So the lab built the machine that does it.

Andrej Karpathy recently open-sourced a similar concept, an agent that iterates on ML training code autonomously, running experiments while the operator sleeps. Different domain, same principle: structured experimentation at a pace no human can match. The Greenbaum Labs version optimizes diagnostic prompts against an adversarial debate mechanism.

Autoresearch is a harness that runs prompt experiments automatically. It takes a frozen extraction, same claims, same observations, and tests prompt variations against it, measuring skeptic throughput, scoring correctness, and reproducibility. Each experiment produces a structured log: what changed, what the scores were, whether the findings survived debate.

Between March 10 and 12, the Sparks ran 60 experiments across two tracks.

The scoring track ran 41 experiments. The baseline, Edition 4’s prompts before the Evidence Discipline changes, scored 0.058 on the optimization metric. The best variant scored 0.667. An 11.5x improvement. The key discovery wasn’t a single brilliant prompt. It was that asymmetric extraction limits, pulling 3 items from center sources and 5 from edge sources, outperformed symmetric limits. The edge is where the signal lives. Give the model more of it. The scoring track converged. Later experiments showed diminishing returns. The prompt space for scoring is largely explored. That means the current prompts are near the ceiling for what prompt engineering alone can achieve.

The extraction track ran 19 experiments. Baseline 0.117, best 0.450. A 3.8x improvement, with more room to run. Extraction is upstream of everything, the quality of claims and observations determines what the scorer has to work with. This track matters more than the scoring track in the long run. No convergence yet. The runway is open.

The harness is 7,165 lines of code. It runs unsupervised. It produces structured, reproducible experiment logs. And it confirmed something previously suspected but never measured: the pipeline’s reproducibility is near-perfect even as correctness varies. Fleet average reproducibility across cross-entity validation: 0.998. The instrument produces the same answer every time, even when the answer is wrong. That is the foundation. You fix correctness once and it stays fixed.

Prompt optimization is not craft anymore. It is experimental science. Hypothesis, test, measure, iterate. The machine questions the machine.

Three Machines, One Night

On the night of March 12, three jobs were launched across three machines.

Spark 2 rescored runs 070 through 084, the March 6 collection, all fifteen fleet entities, with the hardened prompts. M2 Studio rescored runs 049 through 064, the February 20 collection, the same fifteen entities, with identical prompts. Spark 1 ran Con-Hotel end-to-end, run 085, full extraction and scoring with the hardened prompts.

Everything completed overnight. Thirty rescores and one full pipeline run, across three machines, without intervention.

Six weeks ago the operational workflow required manual SSH checks on each machine, NAS mount debugging, and hand-verification of every flag in every launch command. Three failed Con-Hotel launches in a single session, wrong environment, wrong mode, missing flags, forced the construction of proper pre-flight checks.

The overnight run confirmed that the operational infrastructure caught up to the analytical infrastructure. The pipeline was reproducible weeks ago. The operations around it were not. Now they are.

The Numbers

Fleet rescore v4 results, March 6 collection (fifteen entities, hardened prompts):

Fleet average overall: 0.452. Range: 0.370 (Tech-Oscar) to 0.496 (Tech-Mike, Fin-Foxtrot). Truth average: 0.491. Authority average: 0.398.

For comparison, the original scores on this collection averaged 0.455 overall. The fleet moved down by 0.003. Effectively unchanged. But what moved underneath matters.

The biggest individual shifts:

Fin-Delta: Truth rose from 0.455 to 0.554 (+0.099). The original scoring had suppressed a real signal, center-edge alignment on financial performance that the Evidence Discipline prompts now surface properly.

Tech-Oscar: Overall dropped from 0.450 to 0.370 (−0.080). The original scoring had been generous. The hardened prompts derived a lower score from the specific findings that survived debate. The old scorer gave Tech-Oscar credit the evidence didn’t support.

The pattern is the same one from Edition 4’s rescore: the system corrects in both directions. Upward where signal was suppressed. Downward where opinion had inflated. Calibration, not drift.

Skeptic throughput across the v4 fleet: 48% (44 of 90 findings sustained). Tighter than the original runs’ 57%. The hardened prompts produce fewer findings overall, but the ones that survive are better grounded. Quality over quantity. That is the design intent.

February 20 collection rescored with identical prompts: fleet average overall 0.443. Range: 0.386 (Fin-Echo) to 0.496 (Aero-Charlie). Comparable distribution, different collection date, same methodology. The scores are in the same band because the instrument is calibrated, not because the entities haven’t changed.

Run 085, Con-Hotel full end-to-end with hardened prompts: overall 0.460 (finding-derived). Truth 0.500, Authority 0.410. The extraction pulled 201 claims and 2,559 observations from the collection. Skeptic throughput was low, 20%, one finding sustained out of five. The Skeptic was harsh on this run, and the surviving finding was strong. The system working correctly. A low throughput rate with strong surviving findings is a more honest result than a high throughput rate with weak ones.

Against Con-Hotel’s original run 079 (overall 0.427), run 085 gained 0.033. A modest improvement. The real difference is in the evidence quality. The finding that survived debate in 085 is grounded in specific claims and observations. The findings that survived in 079 were vaguer. The score is similar. The confidence behind it is not.

Continuity

Every edition of this record has contained the same line: “Continuity remains unscorable. One collection period.”

That line is retired.

The February 20 and March 6 collections, both scored under finding-derived v1 with hardened prompts, provide the two temporal points needed to compute Continuity. For every entity in the fleet, there are now two readings on the same instrument, separated by two weeks.

Two weeks is not much. But it is infinitely more than zero. And the structure is in place for the next collection, and the one after that.

What Continuity measures is trajectory. Truth and Authority are snapshots, where is this organization right now? Continuity asks: is it getting better, getting worse, or holding steady? Is the compression increasing? Is the center-edge gap widening or closing? Are the same failure modes persisting, or are new ones emerging?

The diagnostic becomes most valuable here. Not “here is your coherence” but “here is where your coherence is heading.” A snapshot tells you what to investigate. A trajectory tells you what is urgent.

The entity-level deltas between February 20 and March 6 are the next computation. The data exists. The methodology is identical. The analysis is coming.

Findings

Edition 4 appeared to be a conclusion: reproducibility proved, architecture locked, fleet scored. It was not a conclusion. It was the foundation for a harder set of questions.

The prompt deficiency was discovered by using the instrument, not by theorizing about it. The Skeptic’s 33% throughput on Con-Hotel meant four findings were rejected, and the rejection rationale pointed to the prompt, not the scorer. The instrument diagnosed its own inputs. That is a real feedback loop.

Automating prompt optimization appeared to be a shortcut. It is not. It is the only way to explore a space this large with any rigor. The autoresearch harness ran 60 structured experiments in three days, each one isolating a single variable. Combined with the 13 hand-tuned experiments that preceded it, 73 total experiments shaped the current prompts. No manual process achieves that. Not in three days, not in thirty. The machine is better at questioning itself than the operator is at questioning it.

A structural shift occurred in the last eight days. Editions 1 through 4 were construction: designing the architecture, fixing the scorers, debugging the pipeline. The relationship was builder to tool. With autoresearch, the instrument improved itself. The operator set the constraints, defined the metrics, launched the harness, and read the results. The machine ran the experiments independently. Builder to observer. That is the shift underneath the numbers. The discipline is in the constraints, not the keystrokes.

The authority data constraint, carried as a cap through Editions 3 and 4, was lifted without ceremony. Employee reviews appeared in the March 6 collection for all fifteen entities. The internal voice that was entirely absent from the edge data now exists. The authority scores did not move much. That raises a harder question than the cap did: the constraint was clear and honest. Now the data is present and the scores are similar, and the next step is determining whether the instrument is surfacing what the employee reviews contain or whether the extraction and scoring prompts need to be tuned to this new source type.

Continuity changes what the project is. The instrument has been taking snapshots. Snapshots are useful. They show where compression lives, where the center-edge gap is widest, where authority is concentrated or diffused. But snapshots are inherently limited. One reading on a patient. No indication of trajectory. Continuity adds the temporal dimension. The vital sign over time. Lighting it does not just add a third vertex to the Triangle. It transforms the diagnostic from a static assessment to a dynamic one. That transformation is larger than any scoring architecture change or prompt improvement.

The overnight run is the operational milestone. Not because the computation was impressive, it is commodity inference on consumer hardware. Because the infrastructure held without the operator. The pipeline ran. The pre-flight checks caught errors before launch. The scoring was deterministic. The results landed on the NAS. Morning review confirmed completion. That is operations, not engineering. The project crossed that line sometime in the last eight days.

What’s Capped

The structural constraints from Edition 4 remain, with two significant changes.

The authority cap has been partially lifted. The March 6 collection includes employee reviews for all fifteen entities, approximately 100 Indeed reviews per entity with ratings, positions, and locations. This is the first time the pipeline has had internal voice data in the edge sources. The February 20 collection still has no employee reviews; that cap remains.

The v4 rescore of the March 6 collection had access to this data, and run 085 (Con-Hotel, full end-to-end) confirmed that claims were extracted from employee reviews. Authority scores on the March 6 collection still cluster between 0.375 and 0.500. Whether that clustering reflects a genuine measurement or whether the extraction and scoring prompts are not yet surfacing the employee review signal effectively is an open question. The data constraint is lifted. Whether the instrument is fully using that data is the next thing to verify.

Overall confidence remains capped at 0.60. Same reasoning. The data supports measurement within a range, and the system reports that range rather than inventing precision it doesn’t have.

Continuity is no longer dark. Two collection points exist. The computation is next. The cap here is temporal; two weeks of separation limits what the trajectory can reveal. More collection points, more widely spaced, will deepen the signal. But the vertex is lit. The infrastructure is in place.

What’s Next

The immediate work is the Continuity analysis. Fifteen entities, two collection dates, identical scoring. The deltas will show which entities shifted and in which direction. Some of those shifts will be real: a company changed its messaging, launched a product, faced a crisis. Some will be noise: collection variance, source availability differences. Distinguishing signal from noise in the Continuity vertex is the next methodological challenge.

The fleet needs to grow. Fifteen entities across five sectors gives trios in most industries. Enough to detect variation. Not enough to establish baselines. The vital-signs framing, coherence as organizational health metric, requires enough data points per sector to define what normal looks like. That work continues.

The autoresearch extraction track has room to run. Nineteen experiments, 3.8x improvement, no convergence yet. Extraction quality is upstream of everything. Better claims and observations mean better findings, which means better scores. The scoring prompts are near their ceiling. The extraction prompts are not.

And something else is taking shape. The taxonomy, seventeen failure modes, twenty-one field notes, the Coherence Triangle, was built for the pipeline. It was designed to be computed by machines against public data. But the patterns it describes are recognizable to anyone who has worked inside an organization. Immediately recognizable. An early external review produced this reaction: “You’re making the invisible, visible.”

The question forming is whether someone needs a pipeline to see these patterns, or just the right questions. Whether the instrument’s real contribution is not the scores it produces but the vocabulary it gives people for naming what they already observe. The next edition will follow that question.

The images in this edition are from my own library, shot on Leica over the last twenty years. Everything in this project is built or sourced firsthand. The visuals are no exception.

The Coherence Record, Edition 4

Justin R. Greenbaum — Thu, 05 Mar 2026 14:38:32 GMT

Greenbaum Labs

March 2026

What’s Happened

Edition 3 ended with a line I believed when I wrote it: “the instruments are getting sharper.”

They were not. They were producing numbers that looked like measurements but behaved like opinions. This edition is about discovering that, fixing it, and what became possible once the fix held.

Between February 26 and March 3, the pipeline ran twenty-seven hardening diagnostics on a single entity, rescored all fifteen fleet entities under a new scoring architecture, shipped two public websites, and defined consulting engagements. Six days. The most consequential week in the project’s history.

It started because I turned the instrument on itself.

The Variance Problem

Edition 3 flagged a specific concern: five entities landed at exactly 0.33 on Truth. I described this as a floor, the scorer compressing within the low range, unable to differentiate between moderately misaligned and severely misaligned. I proposed a wider aperture. That was the wrong diagnosis.

The problem was not the range of the scorer. The problem was that the scores were not measurements.

I discovered this by rescoring the same entity’s extraction eight times using the same model. Same claims. Same observations. Same scorer. Eight runs. Truth scores: 0.57, 0.43, 0.43, 0.62, 0.33, 0.62, 0.33, 0.33. Standard deviation: 0.114. Range: 0.33 to 0.62.

Authority, scored by the same process: standard deviation 0.021.

The truth scorer was not measuring coherence. It was sampling from a distribution of plausible-sounding numbers and returning whichever one the model generated on that particular inference pass. The five entities clustered at 0.33 in Edition 3 didn’t share a structural condition. They shared a scoring artifact. The model’s most common low-range output happened to be 0.33, the way a person asked to estimate something uncertain might repeatedly say “about a third.”

Authority was stable because authority findings are structurally constrained. Compression, diffusion, and misalignment are observable in the data. Truth is harder to pin down. The distance between what an organization says and what observers experience admits more interpretive latitude. The model used that latitude differently each time.

A diagnostic instrument with 0.114 standard deviation on its primary vertex is not an instrument. It is a random number generator with a plausible output range.

Twenty-Seven Runs

The hardening campaign was designed to isolate the source of variance systematically. Twenty-seven runs, all on a single entity, Fin-Delta, using the same collection date, the same pipeline version, varying one parameter at a time.

Phase 1: Model comparison. Six runs, six different extraction models ranging from 8 billion to 72 billion parameters, all scored by the same 32-billion-parameter model. The question: does extraction quality predict score quality?

It does not. The 8-billion-parameter model produced an overall score of 0.478. The 72-billion-parameter model produced 0.371. The smallest model outscored the largest. The scoring noise was louder than the model signal. This eliminated model capability as the explanation for variance and pointed directly at the scoring mechanism itself.

Phase 2: Reproducibility. Eight rescores of a single extraction, testing whether the same claims and observations produce the same scores when rescored by the same model. They did not. Truth ranged from 0.33 to 0.62. Authority held at 0.40 to 0.45.

The diagnosis was now specific: the Truth and Authority agents were returning a floating-point number, a single scalar that the model generated alongside its textual analysis. That number was an LLM opinion. It reflected the model’s general sense of where the score should land, not a computation grounded in specific evidence. Run the same prompt twice, get a different number. The textual findings were substantive. The numerical scores were not.

Phase 3: The architectural change. The solution was to stop asking the model for a number.

The agents already produced structured findings as part of their analysis. Each finding identifies a specific dimension (alignment, omission, or contradiction for Truth; compression, diffusion, or misalignment for Authority), cites specific claims and observations, and characterizes the strength of the evidence. These findings then pass through the Skeptic debate, where weak or unsupported findings are rejected.

The change: instead of using the model’s self-reported score, compute the score deterministically from the findings that survive the Skeptic. Each dimension has a calibrated base weight. Each strength level maps to a multiplier. The formula is fixed. The model produces findings. The math produces scores.

The calibrated bases, frozen after testing against the fleet’s existing data:

Truth: alignment shifts the score upward by 0.45, omission shifts it downward by 0.30, contradiction shifts it downward by 0.50. Authority: compression shifts downward by 0.25, diffusion by 0.18, misalignment by 0.22. A sparse-finding dampener prevents a single finding from saturating the score. If only one finding survives debate, its influence is scaled by one-third.

The agent’s original floating-point score is preserved in the metadata as an audit field. It no longer determines the production score.

Three validation runs under the new architecture showed immediate improvement. Authority standard deviation: 0.011. Truth still varied, not because the formula was unstable, but because the model was generating different findings each time. Same data, different emphasis, different findings, different derived scores.

Phase 4: Determinism. The remaining variance came from upstream. The stratified sampler that selects which claims and observations to present to each agent used unseeded random shuffling. Different samples meant different context, which meant different findings, which meant different scores.

Four changes eliminated this:

Seed the sampler. Each scope group gets a deterministic seed derived from a hash of its group key. The same entity always produces the same sample.
Sort claims and observations by identifier before sampling. Deterministic input order.
Constrain agents to exactly three findings per vertex, one per dimension. No more, no fewer. The model must produce one alignment finding, one omission finding, and one contradiction finding for Truth, each grounded in cited evidence.
Normalize finding phrasing with structural templates to eliminate stylistic drift between runs.

Three final validation runs. Truth: 0.4551, 0.4551, 0.4551. Authority: 0.3889, 0.3889, 0.3889. Overall: 0.4253, 0.4253, 0.4253.

Standard deviation: 0.000. The token counts were identical.

The pipeline is fully deterministic. Run the same entity twice, get the same score. Not approximately. Exactly.

What Changed in the Scores

With the new scoring architecture locked, all fifteen fleet entities were rescored under finding-derived scoring. The same claims and observations from the original fleet run, scored by the new deterministic system.

The 0.33 truth floor is gone. Six entities that had clustered at exactly 0.33 now spread across 0.36 to 0.53. The scores differentiate. Aero-Alpha, which had been indistinguishable from Fin-Delta, Tech-Mike, Auto-Juliet, Fin-Foxtrot, and Aero-Charlie at the old floor, now scores 0.49 on Truth, a meaningfully different reading from Auto-Juliet’s 0.36 or Fin-Foxtrot’s 0.37.

Truth standard deviation across the fleet dropped from 0.096 to 0.067. Not because the scores compressed, but because the artificial clustering disappeared. The old scores had two modes: entities stuck at 0.33 and entities scattered above. The new scores form a continuous distribution. The instrument resolves the range where most readings land, which is exactly what Edition 3 said was needed.

Authority tightened further, from a standard deviation of 0.073 to 0.041. The old authority scores included an entity at 0.62 that was never supported by the evidence, an LLM opinion that happened to be generous. Under finding-derived scoring, authority clusters between 0.375 and 0.500, which reflects the structural reality that every entity in this fleet has the same constraint: no employee reviews in edge data. The scorer now acknowledges that constraint in its output rather than generating scores that imply resolution it doesn’t have.

Fleet average overall coherence moved from 0.458 to 0.445. A small downward shift. The new system is not more optimistic. It is more honest.

The rank order partially held, and in the places where it didn’t, the corrections were revealing. Tech-Mike moved from 0.33, indistinguishable at the floor, to 0.53, the highest Truth score in the fleet. A shift of +0.20, the largest in the rescore. The old architecture had suppressed a real signal. Tech-Mike’s center-edge narrative alignment was materially better than the rest of the fleet, and the scorer could not see it because it was generating a default low number instead of computing from evidence.

Auto-Lima moved the other direction: 0.50 to 0.40. The old scorer had been generous. The new one derived a lower score from the specific findings that survived debate. The system corrected in both directions: upward where signal was suppressed, downward where opinion had inflated. That is what an honest recalibration looks like.

Auto-Juliet and Fin-Foxtrot, which were invisible at the 0.33 floor, emerged as the fleet’s lowest Truth scores, a finding that was always there in the data but could not surface through the old scoring mechanism.

In Edition 3, I wrote that Aero-Alpha’s score was the fleet’s lowest, but cautioned that the data quality grade was the weakest and only one finding survived the Skeptic. Under finding-derived scoring, Aero-Alpha’s Truth rose from 0.33 to 0.49. The old score was the model’s default low output. The new score reflects the specific findings that survived debate. Aero-Alpha is still the weakest in the fleet on several dimensions. But the measurement now explains why, in terms that trace to evidence, rather than landing on a number the model reached for when it was uncertain.

What the Hardening Exposed

The twenty-seven runs answered the explicit questions they were designed to answer. They also revealed something I had not been looking for.

The scoring agents produce findings that are substantively valuable. They identify real patterns in the data. They cite specific claims and observations. The Skeptic debate correctly filters weak findings and sustains strong ones. This mechanism, the part that does the diagnostic thinking, was never broken.

What was broken was the translation layer. The agents did good analytical work and then generated a number that did not reflect it. The number was a separate act of inference, disconnected from the structured reasoning that preceded it. It was as if a physician conducted a thorough examination, identified specific clinical findings, and then reported a health score based on general impression rather than computing it from the findings.

Finding-derived scoring does not make the agents smarter. It makes their intelligence load-bearing. The structured findings that were always the most reliable part of the system now determine the output. The unreliable part, the scalar opinion, has been moved to an audit field where it can be studied without affecting the measurement.

This is a design principle, not just a bug fix. The principle: constrain the model to structured judgment, compute the measurement from the structure. Let the model do what it is good at: reading context, identifying patterns, evaluating evidence. Do not let it do what it is bad at: generating stable numerical outputs.

The Skeptic debate, for the third consecutive edition, proved itself the most reliable component. Across twenty-seven hardening runs and fourteen fleet rescores, findings were sustained when evidence was strong and rejected when evidence was weak. The adversarial mechanism’s judgment scales. Its reliability is the foundation that makes deterministic scoring possible. You can only derive scores from findings if the debate mechanism produces findings you can trust.

What’s Capped

The structural constraints from Edition 3 remain. But they sit on a different foundation.

Authority is still capped. No employee reviews in edge data. This affects all fifteen entities. The authority scores now cluster more tightly because the scorer is acknowledging the constraint rather than inventing resolution. That tighter clustering is honesty, not limitation.

Continuity remains unscorable. One collection period. One-third of the Triangle is dark. This has not changed.

Overall confidence remains capped at 0.60. The structural constraints are unchanged. What changed is that the scores within those constraints are now deterministic and evidence-grounded.

The difference matters. A capped score on a stable foundation can be incrementally uncapped as data improves. A capped score on an unstable foundation cannot be trusted even within its stated range. The fleet’s constraints have not changed. The trustworthiness of the measurement within those constraints has.

The Practice

With the scoring grounded, the infrastructure became a practice.

Engagement definitions, pricing, a diagnostic gift strategy with researched targets, and a brand and web presence across four sites, all built in forty-eight hours on March 2 and 3.

None of this would have been defensible with a 0.114 standard deviation on the primary vertex. You do not offer diagnostic services built on an instrument that generates different readings for the same patient. You do not describe a measurement system that cannot reproduce its own results.

The hardening campaign was not a prerequisite for the practice. It was the moment the practice became possible. The distance between “interesting prototype” and “field-grade instrument” is measured in reproducibility. Twenty-seven runs closed that distance.

What I Learned

The model’s opinion is not the measurement. This is the architectural lesson. LLMs produce text that reads like analysis and numbers that look like scores. The text is grounded in the prompt and the data. The numbers are generated by a different cognitive process: pattern completion in a latent space that has no concept of numerical precision. The solution is not to make the model better at generating numbers. It is to stop using generated numbers as measurements. Let the model analyze. Let the math measure. That boundary must be structural, not aspirational.
Reproducibility is not a feature. It is the minimum standard. Edition 3 reported scores without reproducibility testing. Those scores were published in good faith and are documented in the record. They were not wrong. The findings they were based on were real. But the numbers attached to those findings were unstable, and I did not know that because I had not tested it. The hardening campaign should have preceded the fleet run, not followed it. I built the fleet before I tested the instrument. That sequence was backwards.
Authority was always the stable vertex. Across twenty-seven runs with varying models, varying scorers, and varying sampling, authority standard deviation never exceeded 0.021. Truth varied by 5x that amount. This asymmetry was invisible until the reproducibility tests made it visible. Authority is stable because the patterns it measures: compression, diffusion, misalignment, are structurally legible in the data. Truth is harder because it requires comparing what organizations say against what is observed, and the interpretive latitude in that comparison is where the model exercises discretion. Constraining that discretion to structured findings was the right fix. But the fact that one vertex was stable and the other was not tells you something about the nature of the measurement, not just the quality of the scorer.
The Skeptic is the anchor. For the fourth edition running, the adversarial debate mechanism has been the most reliable component. It has now processed over a hundred runs across fifteen entities and two scoring architectures. Its behavior is consistent: challenge harder when evidence is thin, sustain findings when evidence is strong. The finding-derived scoring architecture is built on this reliability. If the Skeptic could not be trusted to correctly sustain and reject findings, computing scores from those findings would amplify errors rather than eliminate them. The fact that the Skeptic is reliable makes the entire downstream architecture viable.
You cannot sell what you cannot reproduce. This is the business lesson, and it is not about integrity in the abstract. A diagnostic practice requires that two runs on the same entity produce the same result. Not because clients demand reproducibility testing. Most will never ask. Because the practitioner must trust the instrument. Every recommendation, every finding, every conversation with a client flows from the diagnostic output. If that output is unstable, every downstream decision is built on sand. The hardening campaign was not a quality investment. It was the foundation of professional confidence. Without it, the practice would have been a performance.

What’s Next

The immediate work is building the second collection period for a subset of entities. Continuity, the third vertex of the Triangle, has been dark for every run in the project’s history. Lighting it requires temporal depth: at least two collection points, separated by enough time to observe narrative shifts, strategy changes, or structural drift. This is the next capability unlock, and it will change the shape of the diagnostic fundamentally. Truth and Authority are snapshots. Continuity is a trajectory. The first trajectory measurement will reveal whether the diagnostic framework can distinguish noise from trend, and whether the FM-01 vital-signs framing holds when you can measure not just whether compression is present but whether it is increasing.

The fleet needs more entities per sector. Fifteen entities across five sectors gives pairs and trios in most industries. That is enough to observe variation. It is not enough to establish baselines. The vital-signs framing, FM-01 as cholesterol, needing a resting rate to interpret, requires enough data points per sector to define what normal looks like. Twenty entities per sector is the threshold where baselines become defensible. The pipeline can run that volume. The collection infrastructure needs to scale to support it.

The fleet’s five-sector, three-entity-per-sector architecture was designed for falsification. It also produced something I had not planned for: the first competitive coherence benchmark. Within-sector comparison on identical instruments reveals which failure modes are structural conditions of an industry and which are specific to a single organization’s current state. That distinction, sector-wide versus company-specific, is where the diagnostic becomes most useful. Not just “here is your coherence,” but “here is how your coherence compares to direct competitors, measured the same way, on the same instruments.” The next edition will explore what that comparison reveals.

AR-001 Still Holds

Automation may observe, summarize, and suggest, but may not decide.

Finding-derived scoring did not change this principle. It reinforced it. The agents produce findings. The Skeptic evaluates them. The formula computes scores. Every step is observable, auditable, and deterministic.

But the diagnostic output is still a suggestion. It tells you where to look. It does not tell you what to do. A coherence score of 0.45 is not a verdict. It is an invitation to investigate what the findings describe. The human reviews the case summary, reads the evidence, and decides what it means in context.

One hundred seventeen runs. Fifteen entities. Five sectors. The pipeline does not decide. That is still by design.

What This Is Becoming

Edition 1 asked whether the infrastructure could exist. Edition 2 asked whether it could measure. Edition 3 asked whether it holds at scale.

This edition asked whether the measurement could be trusted.

The answer required rebuilding the scoring architecture, proving determinism, and rescoring every entity under the new standard. The instrument that produced Edition 3’s fleet scores was a prototype. It generated plausible numbers. The instrument that rescored that fleet is a calibrated tool. It computes grounded numbers. The difference is reproducibility, and reproducibility is not a technical property. It is the boundary between a demonstration and a practice.

The hardening campaign changed more than the scoring. It changed what the project is. A diagnostic prototype is interesting. A reproducible diagnostic instrument with a published methodology and a public build record is a practice. The scores are the same kind of object they were before: measurements of coherence across truth, authority, and continuity. But the confidence behind them is structurally different. Not confidence in the sense of a statistical interval. Confidence in the sense that a practitioner can stand behind the output.

You cannot sell what you cannot reproduce. And now the instrument reproduces.

The physics of business at scale and speed, accelerated by AI. That is what this work measures. One hundred seventeen runs in, the instrument is grounded.

Justin Greenbaum

Greenbaum Labs

March 2026

The Coherence Record, Edition 3

Justin R. Greenbaum — Mon, 02 Mar 2026 15:00:52 GMT

Edition 1 asked whether this infrastructure could exist. Edition 2 asked whether it could measure one company. This edition asks whether it holds at scale.

Justin R. Greenbaum

Greenbaum Labs

February 2026

What’s Happened

Edition 2 ended with a promise: the system needed to prove it measured coherence, not just one company.

This edition is that test. And what happened when the test exposed a flaw in the system itself.

Between February 10 and February 25, the pipeline ran sixty-four diagnostics across fifteen entities and five sectors: fintech, defense, automotive, retail, technology, aerospace, sports betting, and apparel. The final fleet of fifteen completed clean. Every pipeline stage passed. Every validation check cleared. No manual intervention on synthesis.

But that clean fleet was the second attempt. The first attempt revealed something the pipeline wasn’t designed to catch. The way it was found, fixed, and re-run is as much a part of the record as the results.

From the start, The Coherence Record has been as much about instrument failure as subject failure. Edition 1 documented a misplaced parameter. Edition 2 documented premature optimization. Edition 3 adds a third class of error: a system that passes every check and is still wrong.

The Bug

After the first seven entities completed (the batch documented in the draft that preceded this edition), I expanded the fleet to fifteen. Fourteen ran. The fifteenth failed at scoring with an empty evidence ledger. When I investigated, the problem wasn’t in the scoring. It was in the extraction.

A JSONL expansion bug in the extractor had been silently duplicating and malforming observation records. The extractor reported healthy counts. The validator accepted the files. But the data feeding the scorer was structurally compromised. Inflated observation counts masking thin actual evidence. Fourteen of fifteen entities were affected.

The discovery happened because one entity’s data was thin enough that the corruption left the scorer with nothing to work with. In the other fourteen, there was enough valid data mixed in with the corrupted records that the pipeline produced plausible-looking outputs. Plausible, but not trustworthy.

Every diagnostic from the affected runs was discarded. The bug was fixed. All fifteen entities were re-run from extraction forward. The fleet you see in this edition is the clean re-run.

I’m documenting this for the same reason I documented the misplaced parameter in Edition 1 and the premature optimization in Edition 2. Edition 1 was about incorrect configuration. Edition 2 was about incorrect prioritization. Edition 3 is about incorrect trust in “passing” checks. A system that measures the gap between narrative and reality must disclose its own gaps.

The lesson is not about JSONL parsing. It is about the distance between validation and verification. Every validation check passed. The data was structurally valid. It was not structurally sound. Those are different things, and the pipeline didn’t know the difference until it was forced to.

Organizations make the same mistake. They validate that reports are complete. They rarely verify that those reports describe what is actually happening.

The system’s first real success in this fleet was proving it could be wrong.

The Fleet

Fifteen entities. Five sectors: fintech, defense, automotive, retail, and technology, with single representations in sports betting, aerospace, and apparel. Same pipeline version (0.1.0). Same model (Qwen 32B). Same collection date (February 20, 2026). Two NVIDIA DGX Spark nodes running in parallel, orchestrated by an automated queue runner that distributed work across both machines.

Fleet average coherence score: 0.458. Scores ranged from 0.36 to 0.54. Total inference time: approximately 35 hours across both nodes. Forty-one million tokens processed over sixty-four total runs. On commercial cloud APIs, that volume would have cost roughly $586. On owned hardware, the marginal cost was electricity.

Data quality grades ranged from B to D. The entities with the thinnest data produced the fewest sustained findings. Expected behavior, but it means the cleanest-looking diagnostics may also be the least examined. Evidence density and diagnostic confidence are not the same thing, and the fleet made that visible.

Every run produced a complete diagnostic with triangle scores, failure modes, field notes, and a watch list. The pipeline did what it was designed to do. The problems, and they are real, are in what the diagnostics reveal about both the entities and the system measuring them.

What the Fleet Shows

Truth is the most stressed vertex in ten of fifteen entities. The pattern from Edition 2 holds at scale: organizations say things publicly that don’t match what’s observed at the operational edges. Product claims contradicted by customer complaints. Culture narratives contradicted by employee experience signals. Financial performance framing contradicted by external analysis.

In Edition 2, that misalignment could have been a property of one company. In a cross-sector fleet, it reads as physics, not pathology.

Five entities showed Authority as their most stressed vertex instead. These cluster in interesting ways. A global retailer scored the highest Truth in the fleet. It says what it means, but its authority structure was the least clear. Two automotive companies both stressed on Authority rather than Truth, suggesting that in fast-moving industries, the primary fracture isn’t narrative integrity but decision-making distribution.

Why Cross-Sector

This question matters enough to answer directly.

The Coherence framework emerged from twenty years inside one organization, one industry, one set of structural pressures. It would be reasonable to wonder whether the patterns are just artifacts of that context. Responsibility compression might be a telecom problem. Escalation inversion might be a regulated-industry problem. The entire failure mode taxonomy might describe one company’s dysfunction dressed up as universal physics.

Edition 2 made that risk visible: a single-company diagnostic could always be dismissed as idiosyncratic.

The fleet was designed to answer that question.

Fifteen entities across five sectors. Public companies and private ones. Pre-crisis, mid-crisis, and post-crisis organizations. Legacy incumbents and startups. Companies with two thousand employees and companies with two hundred thousand. The only things they share are scale and public signal.

If the patterns only appeared in one sector, the framework would be local. If they only appeared in crisis organizations, the framework would be reactive. What the fleet showed is that FM-01 appears in fourteen of fifteen entities. That truth stress is more common than authority stress. That the same structural forces that produce dysfunction in aerospace also produce it in retail, fintech, automotive, and technology.

Edition 1 proved the infrastructure could run on owned hardware against public data. The fleet proves the language it produces has signal beyond the company that trained my intuition.

Not for coverage. For falsification.

Failure Modes

FM-01, Responsibility Compression at the Edge, appeared in fourteen of fifteen entities. It is the most persistent structural signal in the fleet.

In Edition 1 and 2, FM-01 read like a problem to be fixed. At fleet scale, it behaves more like gravity: sometimes benign, sometimes lethal, always present.

This is not a defect to be eliminated. It is a structural force. Always present, always acting. The physics of business at scale and speed. The question is not whether FM-01 exists but what it means at different intensities.

A resting heart rate of 72 and a resting heart rate of 120 are both a heartbeat. One is baseline. One is a signal that something is producing strain. The same is true of responsibility compression. Elevated FM-01 is not a diagnosis. It is a vital sign.

FM-04 (Metric Shadowing) and FM-14 (Narrative Collapse) co-occurred in six entities. Where organizations optimize visible metrics while unmeasured costs accumulate, the public narrative eventually decouples from operational reality. The co-occurrence suggests a causal relationship the taxonomy doesn’t yet model.

The average entity triggered three to four distinct failure modes. The most structurally stressed triggered six. The cleanest each triggered one. But cleanliness correlates with evidence density: the entities with fewer failure modes also had fewer sustained findings. The system may be under-detecting rather than finding genuine structural health.

Field Notes

The most signal-dense entity produced thirteen distinct field notes, nearly the full set. Two others triggered eleven each. The leanest produced five. Field note density correlates loosely with evidence density and data quality grade, which means the pipeline produces more diagnostic signal when it has more to work with. That is the expected behavior, but it also means thin-data entities may be under-diagnosed rather than structurally healthy.

What the Pipeline Shows

Edition 1 proved the instrument could produce signal. Edition 2 proved it could debate itself. The fleet shows where that debate logic and its surrounding infrastructure still fail.

Truth scores cluster at the floor. Five entities landed at exactly 0.33 on Truth, spanning fintech, defense, automotive, aerospace, and technology. These are structurally diverse organizations. Either center-edge narrative misalignment really is that uniform across industries, or the scoring model compresses within the low range and can’t differentiate between moderately misaligned and severely misaligned. At five entities, the clustering is too consistent to ignore. The scorer needs a wider aperture in the lower range.

The Skeptic works. Including when it shouldn’t. One entity’s initial run failed because the Skeptic rejected all six findings in the debate round. Every rejection followed the same pattern: insufficient specificity, lack of quantification, reasoning not grounded in evidence. The Skeptic also had a schema validation failure on its first attempt, which forced a retry. The retry was in an overly-critical mode. A re-run of the scoring step produced a healthy result: three sustained, four rejected. The debate mechanism is calibrated to evidence strength, but it’s not robust to its own retry state. That’s a design flaw.

Output confidence must be constrained by evidence density. The fleet’s lowest-scoring entity’s diagnostic reads like a complete assessment. It is not. One sustained finding. One ledger entry. The weakest data quality grade in the fleet. The synthesizer doesn’t know how thin its support is. It produces full output regardless. A diagnostic built on one piece of evidence needs to say so. Not in a metadata field. In the output itself. The data quality grade flagged the problem. It didn’t constrain the output. That grade needs to be load-bearing.

The fleet automated, but the infrastructure didn’t. Running fifteen entities across two Spark nodes required a queue runner script built the same week. NAS mounts dropped mid-run. One node lost its mount entirely and couldn’t be used for the re-run. The pipeline code is stable. The infrastructure around it (mount management, node health checks, job recovery) is manual. At fifteen entities, that’s manageable. At fifty, it won’t be.

What’s Capped

The structural constraints from Edition 2 remain and are now systematic: Authority capped by lack of internal signal; Continuity dark because the system only sees a single time slice.

Authority is capped because the pipeline has no employee reviews in edge data. This was a single-entity problem in Edition 2. It now affects all fifteen entities. Customer complaints and news coverage show the outside. Employees see the inside. Without that signal, the Authority scorer can’t fully assess whether internal power structures match internal accountability. This is where client-invited work changes the equation. With internal access, the Authority vertex uncaps.

Continuity remains unscoreable across all entities. Every run is based on a single collection period. One-third of the Triangle is dark. Edition 2 accepted that darkness as a constraint. Edition 3 turns it into a design requirement.

What I Learned

Validation is not verification. Every check passed. The data was still corrupted. The difference between “the file is well-formed” and “the file contains trustworthy data” was a gap I didn’t build for. Now I have to.
Scale doesn’t just test the pipeline. It tests the framework. Scale didn’t just stress the GPUs. It stressed the assumptions baked into Edition 1 and 2. Edition 1 asked whether the pipeline could exist on owned hardware. Edition 2 asked whether it could measure coherence inside one company. The fleet changed the question again: are these failure modes properties of that context, or properties of large organizations as such? Fourteen of fifteen entities showing FM-01 is a different kind of answer than one entity showing it across thirteen runs.
Failure modes are vital signs, not verdicts. FM-01 at every entity doesn’t mean every entity is failing. It means responsibility compression is structural to organizations at scale. The diagnostic value isn’t detecting it. It’s measuring intensity. The pipeline doesn’t do that well enough yet.
Evidence density caps diagnostic confidence. A thin-data entity producing a clean diagnostic is not the same as a rich-data entity producing a clean diagnostic. The pipeline treats them the same. It shouldn’t.

AR-001 Still Holds

Automation may observe, summarize, and suggest, but may not decide.

AR-001 constrained Edition 1’s experiments and Edition 2’s single-company diagnostics. It constrains the fleet just as hard.

Every diagnostic in this fleet is a suggestion. Every finding requires human review. Fifteen entities didn’t change that. Fifty won’t.

The pipeline knows more than it did in Edition 1. It covers more ground than it did in Edition 2. It is not closer to deciding. That’s by design.

What’s Next

Light Continuity. A second collection period for a subset of the fleet will produce the first Continuity scores. The third vertex of the Triangle will light for the first time. Even a four-week gap between collections should show whether narratives hold, shift, or contradict prior positions.

Scorer calibration. The Truth floor at 0.33 needs investigation. Either it’s real and that’s the baseline for large organizations, or the scoring model compresses signal in the low range. A targeted scoring test with expanded rubrics should clarify.

Data quality as output constraint. The data quality grade needs to constrain what the synthesizer produces. A Grade D diagnostic should look visibly different from a Grade B diagnostic. Not just in metadata. In the output itself.

Infrastructure hardening. NAS mounts, node health, and job recovery need to be automated. The pipeline code is stable. The infrastructure running it is not.

A system that improves visibly, in public, with its failures documented alongside its progress. That’s the goal.

What This Is Becoming

Edition 1 proved the infrastructure could exist. Edition 2 proved it could measure. Edition 3 proves the measurements change how the framework itself is understood. FM-01, the failure mode that carries the most direct weight on humans, is no longer treated as a defect to eliminate but as a structural force to measure.

And where FM-01 goes, the other structural forces follow: authority diffuses, context decays, metrics drift from the reality they were built to measure. Some of those are persistent conditions. Some are acute. The diagnostic work is learning to tell the difference.

AI accelerates this physics. It doesn’t change the forces. It increases the speed at which they produce consequences. An organization with elevated responsibility compression and good human buffers can sustain that for years. The people closest to the work absorb the strain, compensate through judgment and relationships, and keep the system functioning. Add automation that removes those buffers or increases throughput without addressing the underlying compression, and the same physics produces symptoms in months instead of decades.

That is what coherence measurement is for. Not to judge organizations. Not to score them against each other. To make the structural forces visible before they become symptomatic. To give organizations a way to slow down just enough to see what’s actually happening inside their processes before they deploy automation on top of conditions they can’t see.

The physics of business at scale and speed, accelerated by AI. Sixty-four runs in, the instruments are getting sharper. That’s what this work measures.

Justin Greenbaum

Greenbaum Labs

February 2026

The Coherence Record, Edition 2

Justin R. Greenbaum — Tue, 17 Feb 2026 18:02:16 GMT

What’s Happened

Edition 1 introduced coherence as a structural property: truth, authority, and continuity reinforcing each other over time. This edition is about what happened when I tried to make it measurable.

Since the last update, I’ve been building a diagnostic pipeline. Software that takes what a company says about itself (center) and what the world says back (edge), then measures the gap. The system scores organizations on three vertices: Truth, Authority, and Continuity. The output is a structural diagnostic, not an opinion.

The target entity is Coinbase. Not because they’re broken. Because they’re public, data-rich, and operating at a scale where coherence failures become visible.

Thirteen runs later, the pipeline works.

The Pipeline

It runs on a local GPU. No cloud APIs for data processing. The models are open-weight (Qwen 32B). Every claim and observation traces back to a source document with a cryptographic hash. Provenance is not optional. It’s structural.

The pipeline has four steps:

Collect — Gather what the company says (press releases, job postings) and what the world sees (customer reviews, social media, news coverage).
Extract — Pull structured claims and observations from raw documents.
Score — Evaluate Truth (does center match edge?) and Authority (does the entity’s voice carry weight?). An adversarial Skeptic challenges every finding. Only sustained findings enter the evidence ledger.
Synthesize — Produce a diagnostic summary with failure modes, field notes, and a watch list.

Thirteen Runs

The run history tells the real story. Not the polished version. The actual one.

Runs 001-004 were scaffolding. Rule-based extraction, heuristic scoring. Establishing that the data moved through the system correctly. Run 004 produced the first real output: 296 claims, overall coherence 0.621. But 68% of the signal was unclassified. The pipeline was mechanically sound but structurally shallow.

Runs 005-007 introduced agent-powered extraction. The model reads the documents and produces structured claims and observations directly. Unclassified rate dropped from 68% to under 5%. Claims jumped to 1,335. Observations to 2,680. The system was seeing things the heuristics missed entirely.

Run 006 was the first fully unattended execution. Twelve hours, no manual intervention, every validation check passed. Run 007 confirmed repeatability.

Runs 008-011 were premature optimization. I tried to make it faster. Smaller model, larger batches. It worked mechanically (6x speedup) but broke structurally. Schema validation failed on 100% of first attempts. Four runs, three configurations, two killed early. The compounding lesson:

I was changing the engine while flying.

The root cause turned out to be a prompt conflict: two contradictory schemas in the same instruction set. Run 011 proved it wasn’t the model. The model was doing exactly what the prompt told it to do.

Run 012 was the reset. Back to the proven configuration, with three specific fixes. Zero schema failures across 276 batches. All validation checks passed. The largest single-run improvement in Authority (+0.13) came from adding center sources. Press releases and job postings made the organizational voice audible for the first time. Truth dropped because 1,319 center claims met only 23 edge observations. The pipeline correctly identified the imbalance.

Run 013 was the payoff. was the payoff. Full edge expansion. Social media and news coverage split into individual documents. The numbers:

1,377 claims. 4,736 observations.
Truth: 0.60. Up from 0.33. Edge depth restored.
Authority: 0.70. Capped — more on this below.
Overall coherence: 0.636.
13.7 hours. 73% of findings sustained by the Skeptic (11 of 15, 4 rejected). FM-01 detected again.

Score evolution across thirteen runs. The dip through runs 008–009b is the optimization regression. The recovery at 013 is edge depth.

FM-01: Responsibility Compression at the Edge

This failure mode has appeared in every agent-powered run. Confidence 0.4 to 0.6. Across every model configuration, every data mix, every prompt version.

The pattern: responsibility concentrates where authority does not. Coinbase’s center narrative claims ownership of security, efficiency, and user experience. The edge data shows those responsibilities dispersing. Accountability for operational performance lands downstream without corresponding decision-making power. The diagnostic doesn’t measure whether Coinbase is good or bad at customer support. It measures whether the structure that owns those outcomes has the authority to change them.

This is not a model artifact. When a signal persists across nine runs with different extraction methods, different model sizes, and different data compositions, you’re looking at structure, not noise.

What’s Capped

Two binding constraints remain:

Authority is capped at 0.70 because the pipeline has no employee reviews. Customers and journalists see the outside. Employees see the inside. Without that source, the Authority scorer can’t fully assess whether power and accountability are aligned within the organization. Employee reviews are the single highest-leverage data gap.

Overall confidence is capped at 0.50 because all data comes from a single collection period. Continuity, the third vertex, requires temporal depth. One snapshot tells you the current state. Three tell you whether it’s getting better or worse.

These caps are not limitations of the model. They’re structural constraints designed into the system. The pipeline knows what it doesn’t know.

Why This Matters Economically

The business case for coherence is not moral. It is mechanical. When truth, authority, and continuity fall out of alignment, the organization generates friction: in revenue (customers experience something different from what’s promised), in talent (employees absorb structural failures disguised as performance problems), and in capital (investors price in narratives that the edge doesn’t support). That friction has a cost, and the cost compounds. Coherence measurement makes the friction visible before it becomes a write-down or a reorg or an exodus.

What I Learned

Schema failures are prompt-driven, not model-driven. When extraction breaks, check the instructions before blaming the model. The model does what you tell it to do, including when you tell it two contradictory things.
Don’t optimize what isn’t stable. Runs 008-011 should have been one run. The detour cost four attempts and a week. The principle is simple: get it right, prove it works, then make it fast.
Edge splitting was the highest-leverage change in pipeline history. Splitting consolidated staging files into individual documents produced a 206x increase in observations. Not a model change. Not a prompt change. A data format change.
Center sources make Authority audible. Adding press releases and job postings didn’t just add data. It gave the scoring agent visibility into what the organization is actually claiming. You can’t score authority if you can’t hear the voice making claims.
The record exists for the truth. Every run has a review. Every review documents what happened, what broke, and what was learned. The run reviews are not retrospective polish. They’re written the same day, before the lessons have time to soften.

What’s Next

Run 014 is staged. Employee reviews are being added to the collection. A normalizer tool now converts manually-sourced Glassdoor data into pipeline-compatible format. When that source comes online, Authority uncaps from 0.70.

The scoring agents are being tuned. Few-shot examples now show the Authority agent what structural differentiation looks like. Not just what score to produce, but how to reason about power, accountability, and organizational voice as distinct signals.

After 014, the focus shifts to repeatability. A second entity. A second collection period. The system needs to prove it measures coherence, not just Coinbase.

Decision & Responsibility Infrastructure was filed with the USPTO on February 10, 2026. Serial number 99645812. It names the field, not a product.

Thirteen runs. Every score, every status, every lesson. The record is the proof.

AR-001 Still Holds

Edition 1 established the first Accountability Record: *Automation may observe, summarize, and suggest, but may not decide.*

The pipeline embodies this. Every diagnostic output is a suggestion. Every finding requires human review. The Skeptic debate mechanism is adversarial, but the final judgment is not automated.

Thirteen runs in, the governance hasn’t changed. The capability has grown around it without eroding it.

That’s coherence in practice.

Justin Greenbaum

Greenbaum Labs

February 2026

The Coherence Record, Edition 1

Justin R. Greenbaum — Tue, 10 Feb 2026 14:52:37 GMT

on Owned Hardware
and Maintains a Public Build Record

Justin Greenbaum
Greenbaum Labs
February 2026

Edition 1

This is not a product announcement or a finished framework. It is a build record. It documents what it takes for an executive to design, test, and trust diagnostic infrastructure using public data and owned hardware. The technical and the strategic are not separated here, because in the work itself they were never separate.

Forty Hours and Counting

There is a moment in every build where you question everything. Not the architecture. Not the strategy. Not the market. Everything.

It is the kind of doubt that sits in your chest at 2 a.m. when the GPU has been running for ten hours, every request is timing out, and you cannot determine why.

I reached that moment more than once.

Over forty hours across six days, I debugged the same pipeline. A parameter in the wrong nesting level. A prompt that was too permissive. A working process I terminated because I could not observe progress. Each failure different. Each one invisible until it was not.

This is what building infrastructure actually looks like. Not a highlight reel. Not an architecture diagram. The real sequence of decisions, mistakes, and corrections that separates an idea from a system that can be trusted.

What Coherence Means in This Work

This work does not begin with a framework. It begins with an observation.

Organizations routinely say one thing and produce another.

The distance between those two is not abstract. It appears in public records, regulatory actions, consumer complaints, job postings, and narrative shifts over time. That distance can be observed, compared, and measured.

I use the term coherence to describe the structural integrity of that relationship.

Not alignment, which implies a static state.
Not integrity, which carries moral weight.
Coherence is descriptive. It reflects whether an organization’s internal narrative holds when tested against external reality.

The diagnostic model used here evaluates three dimensions, each derived exclusively from public sources.

Truth
The relationship between an organization’s stated claims and what outside observers experience. A gap does not require falsehood. Omission is sufficient.

Authority
The relationship between responsibility and decision-making power as expressed through role design, escalation patterns, and ownership signals.

Continuity
Stability of narrative over time. Repeated shifts without acknowledgment indicate structural drift rather than adaptation. This dimension requires multiple collection periods and emerges longitudinally.

Together, these form the Coherence Triangle. The question was never whether coherence could be measured. The question was whether I could build the infrastructure to measure it independently, on hardware I control, using models I understand, and produce diagnostics I trust.

Why the System Runs on Owned Hardware

This project could have been implemented using cloud inference and commercial APIs. That approach is faster, cheaper in the short term, and easier to scale.

I chose not to use it.

If a diagnostic claims to measure the gap between narrative and reality, the diagnostic itself cannot depend on infrastructure I do not control. A system designed to surface fragility cannot be built on rented dependencies that can change without notice.

The environment consists of:

NVIDIA DGX Spark for inference
Synology NAS for raw data storage
Mac Studio for orchestration and development

All models run locally. All data is stored locally. All processing occurs on a private network.

The trade-offs are explicit. Inference takes minutes instead of seconds. Full diagnostic runs take hours. Throughput is constrained.

The benefit is traceability.

When a score is produced, I can identify exactly which data sources, model weights, prompts, and parameters contributed to it. That is not a performance feature. It is the foundation of trust in the measurement.

The economics reinforce the decision. Run 005 processed approximately 1.4 million tokens. On commercial APIs, that range spans from tens of cents to double-digit dollars depending on provider and configuration. At scale, those costs compound quickly. On owned hardware, the marginal cost is power draw. Thirteen hours at sustained utilization produced no invoice.

That is not optimization. It is independence.

Center and Edge Data

A diagnostic reflects only the data it examines.

This system separates sources into two categories.

Center data
Materials an organization publishes intentionally. Press releases. Job postings. Regulatory filings. Earnings transcripts. Official communications.

Edge data
Public responses to those claims. Consumer complaints. Regulatory actions. Employee reviews. Other externally observable signals.

For the first diagnostic subject, Coinbase, all data was collected from publicly accessible sources, including:

job postings retrieved via the Greenhouse API
SEC EDGAR filings
consumer complaints filed with the CFPB

No internal systems, non-public documents, or privileged access were used.

The dataset is incomplete by design. Employee review platforms, earnings transcripts, and social media signals were not included in this run. The diagnostic explicitly records those omissions.

A system that claims to measure coherence must be able to state what it does not know.

How the Pipeline Operates

The pipeline runs in four stages, each validated before proceeding.

Collect
Documents are retrieved from configured public sources. File counts, formats, and accessibility are validated.

Extract
Each document is processed by an extraction agent that identifies diagnostically relevant claims and observations and returns structured JSON. In the agent-powered run, this stage processed 844 documents across 844 consecutive agent calls with near-zero failure.

Score
Extracted content is evaluated across Truth, Authority, and Continuity. In agent mode, this includes structured debate between specialized agents and a Skeptic that can sustain or reject findings.

Synthesize
The system produces a diagnostic summary, supporting field notes, a watch list, an overall coherence score, and an explicit data quality grade.

The pipeline runs in two modes:

rule-based, which completes in under a second using pattern matching
agent-powered, which takes hours using multi-agent inference

Both produce results. The purpose of this build was to determine whether the agent-powered architecture materially improves diagnostic quality.

It does.

What Forty Hours Teaches You

Synthetic data validated the mechanics. Real data exposed reality.

The rule-based extractor classified only 31.7 percent of real content. Most material fell outside predefined patterns. This was expected.

The agent pipeline was intended to read context and apply judgment. Initially, it returned nothing.

The cause was a single misplaced parameter. A model behavior flag was passed in the wrong location. The API accepted the request. The model ran. The output buffer was consumed internally. No usable output returned.

One parameter. Wrong nesting level. No error. No warning.

After correcting that, the model responded exhaustively. Each document produced more than ten thousand characters of structured JSON. Perfectly formatted. Completely unusable.

The problem was not infrastructure or model capability. It was the prompt. The instruction asked for everything, and the model complied.

Constraining the request to the top five diagnostically important items per document stabilized output immediately.

The lesson is not about prompt technique. It is that failure can live at any layer of the system, and it does not announce which one.

A later run appeared to hang. No logs. No output. No visible progress. The pipeline was working the entire time. Logging was not configured. Progress was invisible. I terminated a process that was nearly halfway complete.

Two print statements resolved it.

This is the kind of failure that does not appear in summaries. Progress you cannot observe is progress you will eventually destroy.

What the Diagnostic Found

The agent-powered diagnostic processed 844 documents over thirteen hours and classified 95.1 percent of all content.

The overall coherence score was 0.609, down from 0.621 in the rule-based baseline.

This is not regression. It is honesty.

Truth improved slightly as agents found more evidence on both sides of the narrative. The dominant pattern remained omission rather than contradiction.

Authority decreased materially. The agents identified concentration and diffusion patterns that rule-based logic could not detect.

Continuity was not scored. It requires longitudinal measurement.

Data quality was graded C due to incomplete source coverage. The diagnostic states this explicitly.

Inside the data, 645 observations clustered around the product experience. That signal emerged only because agents could read context that patterns could not.

The Evidence Chain Failure

One critical subsystem failed.

The scoring agents produced substantive findings, but could not reliably cite the specific claim and observation identifiers that supported them. The Skeptic rejected every finding.

The scores are valid. The evidence ledger is incomplete.

This is a wiring problem, not a capability problem. Shorter identifier aliases are being introduced to restore provenance integrity. The failure is documented here because the framework requires it.

If a system measures gaps, it must disclose its own.

The Executive Who Builds

I am not an engineer.

I am an executive who decided that understanding infrastructure is now a leadership capability.

As AI compresses the distance between intent and consequence, leadership that operates only through delegation loses resolution. The value is no longer in deciding. It is in understanding what decisions actually require.

No vendor briefing explains where reality resists abstraction. You learn that by building.

Why This Record Is Public

There is no established reference for this work.

This document exists as a record, not a guide.

It preserves decision context and holds the work accountable to its own standards. Coherence Diagnostics measures the gap between what organizations say and what they do. The build itself must be coherent.

This record is the edge data for the project’s own narrative.

The Record Begins

This is Edition 1.

What exists now:

an extraction engine that comprehends over 95 percent of content
a multi-agent scoring system with an active Skeptic
a synthesis layer that grades its own data quality
owned infrastructure with zero cloud dependency

And an evidence chain that is not yet complete.

Future editions will document what changes, why, and whether those changes improve coherence measurement over time.

If coherence matters, it must be observable.
If diagnostics matter, they must be accountable.
If an executive claims to understand the infrastructure, there must be evidence.

This is that evidence.

Justin Greenbaum
Founder, Greenbaum Labs

Building diagnostic infrastructure to measure the gap between what organizations say and what they do.

This is The Coherence Record, Edition 1, published at writing.justingreenbaum.com.