Discussion about this post

User's avatar
Adam Zachary Wasserman's avatar

Great essay. Everything you've named (spec as source of truth, MC/DC over line coverage, obligation checklists, traceability against rot) is necessary.

It's also insufficient. The instrument defines what's visible: coverage tooling, test methodology, and audit framework each creates a "cone of illumination", and bugs outside that cone are invisible by construction, not by oversight. Even a perfect spec doesn't survive contact with code whose architecture forecloses the necessary verification itself.

"You cannot test for what you never described" is one version of the cone problem. The deeper version: tests cannot catch conditions that by construction can only be evaluated at runtime. The spec-vs-code verification step itself becomes impossible when "what does this do?" is a runtime question rather than a property of the inputs.

The architectural challenge is constructing code so every condition within spec is definitively predictable by example bounds. That's a construction-time choice that cannot be retrofitted at test time. The studliest version of the NASA approach is exactly this: not exhaustive testing, but architecture that makes whole categories of test unnecessary. My proposed architectural discipline called Honest Code operationalizes it, with the Honest Framework as its FOSS reference implementation. Anyone telling you it's ineluctable, that some code just has to be runtime-opaque, is wrong.

I also developed the Slop Audit to detect this type of failure: 18 architectural dimensions backed by 20 quantitative Layer-1 indicators, each with a layered procedure, designed so the auditor's judgment is bounded by the framework rather than by personal or professional opinion. Same instinct as MC/DC + FRET/FLIP, applied to architectural conditions rather than boolean ones.

Its most provocative indicator is L1.18, the Mutable State Ratio: the percentage of functions that read or write state outside their parameter list. The claim isn't soft. L1.18 measures the proportion of code that is mathematically untestable regardless of test budget, because mutable-state systems are subject to state-space explosion that no finite test suite can cover. It's also a defect-category predictor: race conditions, order-dependent test failures, stale-cache bugs, side-effect interference, and null-from-uninitialized-state are all confined to at most the L1.18 percentage of functions. At L1.18 = 0%, every function's behavior is a finite function of its parameters and prediction is exact. Your jsonparser had near-perfect line coverage and missed an unasked spec question; L1.18 catches an adjacent failure mode where the question can't be answered with any test, because the code's shape forecloses every test you could write.

An ongoing pre-registered experiment (DOI 10.17605/OSF.IO/DBSYG) tests whether AI code-generation accuracy depends on this same architectural property: three frontier models, twenty-five programming tasks, five paradigm conditions. The prediction is that construction-style paradigms (low L1.18) outperform class-heavy ones (high L1.18); the pre-registration is what makes that prediction falsifiable rather than retrofitted.

The economics shift you named (AI lowering the cost of MC/DC) applies in the other direction too: AI lowered the cost of construction methodologies. What once required a NASA-grade requirements process is now reachable for any disciplined team.

Happy to share the pre-release auditing code that calculates L1.18 if you want to try it on jsonparser. The code has already been run on itself, of course :)

2 more comments...

No posts

Ready for more?