The Verification Gap

The second 100%

Leonid Bugaev — Wed, 08 Jul 2026 09:41:43 GMT

I spent years being proud of coverage numbers. I want that time back.

Here’s a session check, close to code I’ve actually shipped:

func (s *Store) CanUseSession(t Token, now time.Time) bool {
	return t.SignatureValid() &&
	       now.Before(t.ExpiresAt) &&
	       !s.nonces.Seen(t.Nonce)
}

And here’s the test I’ve written for it a hundred times:

func TestCanUseSession(t *testing.T) {
	store := NewStore()
	if !store.CanUseSession(validToken(), time.Now()) {
		t.Fatal("valid session rejected")
	}
}

100% line coverage. Every condition executed. The report is green.

Now delete the nonce check. Test passes. Delete the expiration check. Test passes. Replace the whole body with return true. Test passes. This test can’t tell the difference between my function and no function. All it proves is that good input produces a good answer — which is also what code with no security checks does.

Branch coverage is the same lie with more steps. Add the case everyone adds — a tampered token gets rejected — and both branches are hit, the report looks even better, and two of the three security checks can still be deleted without a single test noticing. That’s what our coverage metrics measure: whether code ran. Not whether any of it was ever forced to matter. And we hold serious meetings about whether the threshold should be 80% or 90%.

Avionics solved this twice

In the 80s, avionics certification hit exactly this problem, and their answer had two parts. We took neither. I’ll get to the second part later, because it took watching this kind of failure to understand it.

Part one is MC/DC — Modified Condition/Decision Coverage. Ugly name, simple idea: every condition in a decision must independently flip the result, or you haven’t tested the decision.

For my session check, that means four cases:

signature valid   expired   nonce seen   result
yes               no        no           allow
no                no        no           deny    // signature matters
yes               yes       no           deny    // expiration matters
yes               no        yes          deny    // nonce matters

So let’s do it properly this time:

func TestCanUseSession(t *testing.T) {
	store := NewStore()
	now := time.Now()

	if !store.CanUseSession(validToken(), now) {
		t.Fatal("valid session rejected")
	}
	if store.CanUseSession(tamperedToken(), now) {
		t.Fatal("bad signature accepted")
	}
	if store.CanUseSession(expiredToken(), now) {
		t.Fatal("expired token accepted")
	}

	replayed := validToken()
	store.CanUseSession(replayed, now) // first use: allowed
	if store.CanUseSession(replayed, now) {
		t.Fatal("replayed token accepted")
	}
}

100% MC/DC. Every condition forced to matter. Delete the expiration check and a test fails. Flip the nonce logic and a test fails. This is a good test — I’d approve it in review today. It felt like the ceiling of rigor.

Remember that feeling.

The bug that still ships

That test suite is good enough to make the function feel done. The code is simplified, but this is the exact class of bug I’ve watched ship: in-memory state standing in for a guarantee that needed to survive the process. Here’s the incident this design is one deploy away from.

s.nonces is an in-memory map. Every deploy restarts the process, the map comes up empty, and every token captured before the deploy becomes replayable after it. Deploy daily, and replay protection — the thing with a dedicated, passing, well-designed test — effectively doesn’t exist for a window after every single release.

Which test failed? None. Which test could fail? None. Look at the replay test again: it creates a store, uses a token, replays it against the same store, in the same process, in the same millisecond. Within that world, the test is correct. The bug doesn’t live in that world. It lives in the question the test never asked: what remembers the nonces, and for how long?

That’s the part that stings. MC/DC did its job perfectly — every condition in the decision was proven to matter. But it can only interrogate the logic I wrote. It can’t interrogate the logic I didn’t write. The bug wasn’t in a branch. It was in a question:

What happens to seen nonces across a restart?
Across two instances behind a load balancer?
Whose clock is now, and how far do our servers drift?
What does the client learn from a rejection — does the error message tell an attacker which check failed?

None of these are coverage questions. A suite could hit 100% of anything a tool can measure and never touch one of them. Nobody writes them down, so nobody tests them, so production tests them for us.

A missing test is annoying. A missing question is dangerous.

I eventually got a word for those questions: obligations.

An obligation isn’t a test case. It’s a class of behavior the change must account for. “Replayed token rejected” was a test case, and we had it. “Replay protection survives restarts and failover” was an obligation, and it existed nowhere — not in the ticket, not in the tests, not in review, only in the gap between what the ticket said and what production required.

This is why I now think of coverage as two different numbers. The first 100% is structural: did the code run, was the logic exercised. Line coverage, branch coverage, MC/DC — they all live here, and MC/DC is the honest end of it. The second 100% is semantic: did we cover what the code is responsible for — the requirement, the obligations, the assumptions, the risks. Every tool we have measures the first. Every incident I can remember came from the second.

Not every obligation becomes a test. Some are handled elsewhere, some are accepted risks, some need fuzzing or a runtime assertion or one honest sentence in a doc. Clock skew might be “accepted, ±30 seconds, here’s why” — that’s fine. What’s not fine is the obligation silently not existing, so that the difference between considered and accepted and never thought about is invisible.

And these aren’t exotic cases. Restarts, failover, cache staleness, error messages that say too much — they’re Tuesday. If nobody writes them down, they live in memory: the author’s, the reviewer’s, maybe nobody’s.

Tests are evidence, not truth

This is part two of what avionics knew — the part I skipped earlier. In that world, a test that doesn’t trace to a requirement is evidence of nothing. Every test exists to support a claim; every requirement must have evidence behind it. They understood forty years ago that executing code proves nothing by itself. The industry kept the percentage and threw away both ideas that gave it meaning.

For most of my career, this was my whole model of testing:

code -> tests -> CI -> confidence

Tests were code that exercised other code. More tests, more coverage, more confidence. The replay failure above fits this model perfectly and the model never blinks — the code is exercised, CI is green, confidence is high, and the bug ships anyway.

The model I’d steal from avionics, minus the paperwork:

requirement -> obligations -> evidence -> confidence

Same code, same tests, same CI — but now they sit inside an argument. The requirement says what must be true. The obligations say what must be considered. The tests are demoted from “the thing itself” to what they always actually were: evidence for specific claims.

Here’s my session check under that model:

Requirement:
A session token is accepted only if its signature is valid,
it has not expired, and it has never been used before.

Obligations:                              Evidence:
- forged signature rejected               TestCanUseSession (MC/DC)
- expired token rejected                  TestCanUseSession (MC/DC)
- replayed token rejected, same process   TestCanUseSession (MC/DC)
- replay survives restart                 MISSING — nonce store is in-memory
- replay survives failover                MISSING
- clock skew between servers              accepted risk: ±30s, documented
- malformed token rejected safely         fuzz target, nightly
- rejection reveals which check failed    MISSING

Same function, same tests. But now the failure isn’t a surprise buried in an architecture diagram — it’s a row that says MISSING, visible in review, before the deploy. The test suite didn’t get better. The *argument* got better, and the argument is what caught it.

This changes what review means. “Do we have tests?” becomes “which claim does this test prove?” “What’s the coverage number?” becomes “which obligations have no evidence?” It’s much harder to fake an argument than a number.

Code does not remember why

Engineers love saying code is the source of truth, because code runs and docs rot. Half right. Code is the source of truth for what the system does. It’s a terrible source of truth for what the system was supposed to do.

Code can’t tell you whether a behaviour is intentional or accidental. Whether the weird branch exists because of a customer, a migration, or a 3am incident nobody wrote down. Whether clock skew was accepted or never considered. So every change starts with archaeology: reconstruct intent from the diff, stale tickets, Slack threads, and whoever hasn’t left yet.

This was already failing when humans wrote all the code. AI makes it fail faster, because generating plausible code is now cheap and verifying intent is not. And when the code and the tests come from the same prompt — the same incomplete understanding — they agree with each other perfectly. The replay bug above, generated fluently in seconds: in-memory nonce store, matching in-process replay test, green CI, high coverage, missing obligation still missing. That’s not verification. That’s a model grading its own homework.

Humans always did this too. AI just industrialised the polished misunderstanding.

What the second 100% looks like

Not MC/DC everywhere. Not a requirements document for a button color. Start where being wrong is expensive: auth, billing, deletion, migrations, tenant boundaries, anything that moves money or can’t be un-shipped.

Write the requirement in plain English. List the obligations. Attach evidence — and let MISSING be visible. That’s the entire practice. The obligations table above is not a big artifact; it’s twenty lines that would expose the failure before it became an incident, and it answers questions a coverage report can’t: Which obligations are handled? Which are accepted risks — accepted by a person, on the record? Which decisions need MC/DC instead of a happy path? What goes stale if the requirement changes?

The first 100% covers the structure of the code. The second covers its responsibility. One asks whether the code ran. The other asks whether anyone understood what the code had to protect.

Why I’m building Proof

This bothered me long enough that I’m building a tool around it. Requirements that survive the ticket closing. Obligations that live in the repo instead of someone’s head. Tests that know which claim they support. Code, tests, and requirements that invalidate each other when they drift.

Not compliance theatre. Not another number to gamify. An evidence chain that answers the only question that matters in a review: Why do we trust this change?

Coverage can’t answer that. It can only tell you which lines ran.

The second 100% is everything your coverage tool cannot see — and after enough incidents, you learn that’s exactly where the bugs live.

If this changed how you think about tests, coverage, or code review, subscribe to The Verification Gap and share this post

Bugs Are Misunderstandings Made Executable

Leonid Bugaev — Tue, 30 Jun 2026 16:40:49 GMT

A software bug is often a misunderstanding that has become executable.

At the deepest level, a bug is not merely “a programmer made a typo.” It is a mismatch among four things: what people wanted, what was specified, what was implemented, and the world in which the software actually runs. Software is hard because humans speak in intentions, but computers execute exact instructions. The gap between intention and exactness is where bugs live.

Dijkstra captured one side of the problem: “Program testing can be used to show the presence of bugs, but never to show their absence.” His point was not that testing is useless; it is that testing samples behavior, while a nontrivial program may have an enormous or effectively unbounded space of possible behaviors. At the theoretical limit, computability theory gives a hard boundary: there is no general algorithm that can decide all interesting behavioral questions about arbitrary programs; Stanford’s entry on computability summarizes Turing’s proof that the halting problem is unsolvable.

1. The philosophical reason software has bugs

Software is an attempt to turn an ambiguous, changing, social world into precise, repeatable machinery.

That creates several layers of failure.

First, the world is underspecified. A user says, “make checkout fast,” “don’t lose my data,” “show the correct price,” or “make login secure.” Each sounds clear until you ask: correct in which currency, time zone, tax jurisdiction, discount rule, retry scenario, fraud case, browser, network condition, and accessibility context?

Second, software is exact but our understanding is approximate. A tiny missing condition can matter. A human can infer “obviously don’t charge the customer twice.” A payment service needs an explicit idempotency rule, retry behavior, database constraint, observability path, and failure recovery plan.

Third, software composes abstractions that leak. An app depends on libraries, operating systems, networks, CPUs, databases, caches, browsers, cloud services, humans, and business rules. Each layer promises something. Each layer has exceptions.

Fourth, software changes faster than most engineered artifacts. A bridge is not redeployed ten times a day. A web service might be. Features, dependencies, infrastructure, data distributions, regulations, and user behavior all shift underneath it.

Fifth, bugs are partly observer-relative. The same behavior can be correct for engineering, wrong for product, acceptable for one customer, illegal in one jurisdiction, and catastrophic in a safety-critical context. A bug is not only a broken instruction; it is a violated expectation.

A useful mental model:

Bug risk ≈ ambiguity × complexity × change × coupling × consequence ÷ feedback speed

That is not a formal equation, but it captures the industry’s lived reality: bugs grow where systems are unclear, large, changing, interconnected, high-stakes, and slow to learn from.

2. A classification of bugs

1. Requirements bugs

What goes wrong: The team builds the wrong thing.

Why it happens: The desired behavior was ambiguous, incomplete, misunderstood, or changed after implementation began.

Example: “Cancel subscription” does not clearly specify whether refunds are prorated, immediate, delayed, or handled differently by country.

2. Domain and modeling bugs

What goes wrong: The software’s model of reality is too simple or simply wrong.

Why it happens: Real-world rules are messier than the abstraction chosen by the developers.

Example: Tax rules, calendar rules, currency conversion, medical workflows, legal requirements, logistics, and identity systems often contain exceptions that the original model did not capture.

3. Algorithmic and logic bugs

What goes wrong: The code computes the wrong result.

Why it happens: The programmer’s reasoning was flawed, an edge case was missed, or the algorithm was implemented incorrectly.

Example: Off-by-one errors, wrong rounding, incorrect sorting, missing conditions, or a branch that handles most cases but fails on one unusual input.

4. State bugs

What goes wrong: The system remembers the wrong thing, forgets something important, duplicates an action, or corrupts stored information.

Why it happens: State is hard. Software must track sessions, caches, retries, database writes, background jobs, user actions, and partially completed operations.

Example: A customer clicks “pay,” the network times out, the system retries, and the customer is charged twice because the payment operation was not designed to be idempotent.

5. Interface and integration bugs

What goes wrong: Two parts of the system disagree about how they are supposed to communicate.

Why it happens: Different components make different assumptions about data formats, units, versions, schemas, encodings, or API behavior.

Example: One service sends a timestamp in milliseconds, while another service interprets it as seconds.

6. Data bugs

What goes wrong: The code may be logically correct, but the data it receives is wrong, incomplete, stale, duplicated, malformed, or unexpected.

Why it happens: Real production data is often messier than test data. It may contain null values, old records, migration artifacts, inconsistent formats, or historical exceptions.

Example: A report works perfectly in testing but crashes in production because one old customer record is missing a field that newer records always have.

7. Concurrency bugs

What goes wrong: The software behaves differently depending on timing.

Why it happens: Multiple operations happen at the same time and interact in unexpected ways.

Example: Two users try to buy the last ticket at the same moment, and both transactions appear to succeed because the system did not correctly lock or coordinate access to shared state.

8. Distributed-systems bugs

What goes wrong: Different machines, services, or regions develop different views of reality.

Why it happens: Networks are unreliable. Messages can be delayed, duplicated, reordered, dropped, or retried. Systems can be temporarily unavailable while other parts continue running.

Example: Two services both believe they own the same task because a coordination message was delayed or lost.

9. Resource and performance bugs

What goes wrong: The application works under normal conditions but fails under pressure.

Why it happens: The system consumes too much memory, CPU, database capacity, network bandwidth, or time. Small inefficiencies become serious at scale.

Example: A page loads quickly with 100 users but times out with 100,000 because every page view triggers dozens of unnecessary database queries.

10. Configuration and deployment bugs

What goes wrong: The right code runs with the wrong settings, in the wrong environment, or against the wrong dependency.

Why it happens: Modern software depends heavily on environment variables, secrets, permissions, feature flags, regions, infrastructure settings, and deployment pipelines.

Example: A production service accidentally points to a staging database, or a feature flag is enabled for all users instead of a small test group.

11. Security bugs

What goes wrong: The software behaves correctly for ordinary users but becomes exploitable when used maliciously.

Why it happens: The system was designed for cooperative use, but attackers search for unexpected paths through input fields, permissions, APIs, dependencies, and trust boundaries.

Example: A form accepts user input and passes it directly into a database query, creating an injection vulnerability.

12. UX and human-factor bugs

What goes wrong: The user makes a mistake because the interface encourages, hides, or fails to prevent it.

Why it happens: Software is not only code; it is also a conversation with the user. Bad design can make the wrong action look safe, obvious, or reversible.

Example: A destructive action uses vague wording, so users click it without realizing they are permanently deleting data.

13. Toolchain and platform bugs

What goes wrong: The application fails because of something beneath or around the code.

Why it happens: Software depends on compilers, runtimes, operating systems, browsers, hardware, cloud platforms, libraries, and frameworks. Those layers can also contain bugs or behave differently than expected.

Example: An application works in one browser but fails in another because of a subtle difference in how they implement a web standard.

14. Process and organizational bugs

What goes wrong: The organization creates the conditions for defects.

Why it happens: Bugs are not always born inside code. They can come from rushed deadlines, unclear ownership, poor communication, weak review practices, bad incentives, or fragmented teams.

Example: A risky migration ships without a rollback plan because no single team clearly owns the full production impact.

Security has its own mature bug vocabulary. MITRE’s CWE is a community-developed list of software and hardware weaknesses, and the CWE Top 25 highlights common, impactful weaknesses that can guide engineering investment, SDLC changes, and architectural prevention.

3. Why bugs keep happening

The naive explanation is: “programmers make mistakes.”

The deeper explanation is: software development is knowledge work under uncertainty.

A programmer is not simply typing instructions. They are translating a fuzzy goal into a formal mechanism while negotiating incomplete requirements, legacy constraints, hidden dependencies, deadlines, user expectations, economic tradeoffs, and future change.

Many bugs are not failures of syntax. They are failures of shared understanding.

For example:

A requirements bug is a failed conversation.

An integration bug is a failed contract.

A concurrency bug is a failed mental model of time.

A security bug is a failed imagination of adversarial behavior.

A performance bug is a failed extrapolation from small to large.

A deployment bug is a failed connection between code and environment.

A process bug is a failed organization design.

This is why “just hire better programmers” does not solve the problem. Better programmers help, but most serious software failures emerge from the system around the programmer: unclear goals, changing context, inadequate feedback, insufficient isolation, weak testing, poor observability, and incentives that reward shipping over understanding.

4. The industry response

The industry’s response is not one thing. It is a layered immune system.

Prevention: make whole classes of bugs harder or impossible

This includes better requirements, design reviews, type systems, memory-safe languages, static analysis, linters, code review, architecture constraints, secure coding standards, threat modeling, and safer frameworks.

For security, NIST’s Secure Software Development Framework organizes practices into preparing the organization, protecting software, producing well-secured software, and responding to vulnerabilities; NIST explicitly frames SSDF as a risk-based way to reduce vulnerabilities, mitigate impact, and address root causes rather than as a mere checklist. OWASP SAMM similarly defines business functions and security practices across governance, design, implementation, verification, and operations to help organizations improve software security posture.

For safety-critical systems, industries use standards and assurance processes. In aviation, RTCA describes DO-178C as the current core document for design assurance and product assurance for airborne software, referenced by FAA guidance. NASA’s software assurance materials emphasize systematic software assurance, software safety, independent verification and validation, and formal inspections to detect and eliminate defects early in the lifecycle.

Detection: find bugs before users do

This includes unit tests, integration tests, end-to-end tests, property-based tests, fuzzing, static analysis, dynamic analysis, penetration testing, model checking, simulation, staging environments, chaos experiments, and formal verification.

The key point: different techniques see different bug classes. Unit tests catch local logic errors. Integration tests catch contract mismatches. Fuzzing finds weird inputs. Static analysis finds certain patterns without running the program. Formal methods can prove specific properties. Production monitoring catches what pre-release methods missed.

Containment: assume some bugs will escape

Modern systems often assume failure and try to reduce blast radius.

That means feature flags, canary releases, staged rollouts, circuit breakers, rate limits, graceful degradation, retries with idempotency, backups, rollback plans, isolation boundaries, and observability.

Site reliability engineering formalizes this with service-level objectives and error budgets. Google’s SRE material describes SLOs as a way to measure reliability and error budgets as a way to balance reliability work against other engineering work.

Measurement: make software delivery visible

The DevOps/DORA movement responds to bugs partly by measuring flow and instability. DORA’s software delivery metrics include change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate; DORA frames these as ways to understand both throughput and instability in software delivery.

The important cultural shift is that high-performing teams do not merely ask, “How many bugs did we write?” They ask:

How quickly do we detect them?

How many users are affected?

Can we roll back safely?

Do we learn from incidents?

Did we remove a class of defect, or only patch one instance?

Learning: treat incidents as information

Good organizations do postmortems, root-cause analysis, vulnerability disclosure, bug bounty programs, incident reviews, and process improvements.

NASA’s own lesson summaries illustrate this mindset: its Software Engineering Handbook discusses the Mars Climate Orbiter loss as a mismatch of imperial vs. metric units and ties the lesson to stronger software assurance, requirements validation, interface verification, and data-consistency testing.

The mature industry response is therefore not “be perfect.” It is:

Prevent what you can, detect what you cannot prevent, contain what escapes, recover quickly, and learn so the same class of failure becomes less likely.

5. Is bug-free software possible?

It depends what “bug-free” means.

In the absolute, everyday sense: no, not for large real-world applications.

A large application cannot be proven “bug-free” in the same way a theorem can be proven, because the word “bug” depends on user expectations, changing requirements, unstated assumptions, environment behavior, business rules, legal rules, and future situations nobody has imagined yet.

Also, even mathematically, there are hard limits. General automatic verification of arbitrary program behavior runs into undecidability barriers: not every meaningful question about arbitrary programs can be mechanically decided.

But in a narrower, more precise sense: yes, sometimes.

You can build software that is bug-free relative to a formal specification, for specified properties, under stated assumptions.

That distinction matters enormously.

The seL4 project, for example, proves that the seL4 OS kernel implements its specification correctly, with computer-checked mathematical evidence; its documentation also describes proofs of functional correctness, security properties, compilation correctness on certain architectures, and the fact that proof work evolves with hardware and features. CompCert is a formally verified C compiler intended for high-assurance software; its core guarantee is that generated executable code behaves as prescribed by the semantics of the source program, although its own documentation is careful about scope, noting parts such as source-text transformation and assembling/linking that are not fully formally verified.

So the honest answer is:

Perfect software is possible only after you shrink the meaning of “perfect.”

You must specify:

Perfect with respect to which behavior?

For which inputs?

Under which hardware assumptions?

With which compiler/runtime assumptions?

Against accidental failure, adversarial attack, or user misunderstanding?

For how long, as the environment changes?

A formally verified system may still have a bad specification. It may correctly implement the wrong requirement. It may be secure in one threat model and vulnerable in another. It may be logically correct and still unusable. It may satisfy every stated property and still surprise users.

6. A note on Proof

If bugs are misunderstandings made executable, then the obvious question is: how do we catch misunderstandings before they become executable?

That question is one of the reasons I started building Proof.

By Proof, I do not mean a magical guarantee that software can never fail. I mean a more practical layer between requirements, code, tests, documentation, and review. Tests tell us whether selected examples worked. Proof asks whether we described the intended behavior clearly enough that a tool, a reviewer, or an AI agent can challenge it before production does.

This matters even more in the age of AI-generated software. AI can now produce code very quickly. It can also produce plausible tests, plausible documentation, and plausible explanations. But plausibility is not evidence. If the original intent is vague, AI may simply automate the misunderstanding faster.

That is the gap Proof is meant to address: preserving intent as software changes. A requirement should not be a disposable note that dies once implementation begins. It should remain connected to the code that implements it, the tests that check it, and the evidence that shows what has actually been verified.

Proof does not eliminate all bugs. Nothing does. But it can move some bugs earlier — from production incidents into requirements, obligations, inconsistencies, missing edge cases, and untested assumptions. In that sense, Proof is not about perfection. It is about debugging intent before reality does it for us.

7. The practical goal is not “zero bugs”

For most software, “zero bugs” is not an engineering goal. It is a slogan.

A better goal is:

No catastrophic bugs, no repeated bugs, no silent bugs, no unactionable bugs, and fewer entire classes of bugs over time.

The best teams try to make bugs:

less likely through design,

less severe through isolation,

more visible through observability,

faster to fix through deployment discipline,

less repeatable through learning,

and sometimes impossible through stronger languages, constraints, and formal methods.

The deepest lesson is that software quality is not only a technical property. It is a property of a whole system: people, incentives, tools, architecture, feedback loops, and philosophy of risk.

A bug is what happens when reality finds a path through your assumptions.

Six things I realised after Mythos disappeared

Leonid Bugaev — Tue, 16 Jun 2026 19:15:32 GMT

Once I heard that Anthropic releasing Mythos 5, and it will be temporary in subscription layer only for 3 weeks, my first action was to buy one more account to squeeze as much intelligence of it as possible. Feds removed it 3 days later, the rest is history.

Can’t say that my inner world has changed, as I naturally always try to imagine the worst case scenarios, but I defo not expected it so soon. It made me reflect on my current AI usage, both personal and business one to re-evaluate all the risks.

TLDR:

Gains I get from Claude Code and Codex are unbelievable (LLM + tight harness), no true OSS alternative yet, and I ready to pay way more for it.
I do not need more intelligence, current level is already amazing - I need more throughput.
LLM costs for businesses killing margins, geo political risks create so much uncertainty, turbulence and even more fragmentation ahead. Those who own harness and tune products to avoid vendor lock-in will survive.
Like any constraint, tuning your products for cheaper and dumber models, makes it better, and you become way more creative. And it scales much better as well.

1. My productivity is now limited by LLM throughput

First of all, I don’t regret buying the additional account. Actually, I already have at least five now: one on Codex, two on Claude, and also a couple of open-source or hosted models, and I pay like $700/m for it already. But that is not really the point.

My productivity is now basically measured by how much access I have to the capability of the language model. I am not limited by ideas. I am not limited by tasks. I am limited by how many high-quality model sessions I can run before I hit a limit.

As an individual person, of course, I cannot spend millions on this like a business. But I try to squeeze as much intelligence as I can from current subsidised AI market before it finish.

So yes, I got hooked to Codex and Claude Code and my output depends on them.

2. I don’t need more intelligence. I need more bandwidth.

The funny thing is that I don’t even feel I need some dramatically smarter model right now. Of course, everyone wants a better model, and if tomorrow something twice as smart appears, I will obviously try to burn it down with tokens immediately. But honestly, the current level of intelligence from Opus and Codex 5.5 is already enough for the majority of tasks I am doing.

It already does almost everything I wanted it to do. The problem is not that I am waiting for AGI to become productive. I am productive already. The problem is that I do not have enough throughput.

I am not really limited by intelligence anymore. I am limited by bandwidth, limits, price, speed, and access. If I had cheap, almost unlimited access to the current level of models, I could do a ridiculous amount of things. Not in the future, not with some theoretical next generation, but with what already exists now.

3. We got spoiled, and everything can be cut off

The second thing is that the risk is real. Everything can be cut off at any moment.

We have become spoiled by products like Claude Code and Codex. They are good in a dangerous way, because after using them it becomes painful to go back. You start expecting the model to understand the task, keep the context, follow the direction, and behave like something close to a useful engineering partner.

Then you try open-source models… In benchmarks, some of them look close to Opus, in real life, very nuanced. They have bugs, edge cases, harness not optimised for them, hard to steer in the process.

But this is also useful. They show where your system depends too much on the best model. If your workflow only works when the model is extremely smart, maybe the workflow is not good enough.

4. Weak models are annoying, but they make your system better

Limitations always enforce creativity. Weak models are annoying because they do not forgive you. They do not magically understand what you meant. They do not compensate for bad context, unclear prompts, or lazy workflow design. But this is exactly why they are useful.

When I made experiment on weaker models, they forced me to improve my own applications. Better prompting, better context, better task boundaries, better harness. A strong model can hide a lot of bad architecture. A weaker model exposes it immediately.

Of course I would prefer Opus level models, without rate limit, and very cheap, but we live in the real world.

Good news is that many business tasks tasks do not need the strongest model. If the task is clear, constrained, and repeatable, you can often use cheaper models, simpler models, or even deterministic tools. This is where the economics start to make sense.

So one lesson is simple: prepare for weaker models. Not because they are great, but because your system should not collapse without the best one.

5. If the model can disappear, the harness can disappear too

If it is so easy to cut access to a model, then it is also easy to cut access to the harness around it. Claude Code itself, Codex, OpenCode, or anything similar can change, disappear, become expensive, become limited, or move in a direction that does not work for you.

This is why depending on someone else’s harness in serious development is a mistake, especially if this is your business.

I think I made the right decision when I started building my own harness, my own workflow engine. Not because I want to rebuild everything for fun, although apparently I do enjoy making my life more complicated. But because I want independence.

I want to own how models are called. I want to own how context is prepared. I want to own routing, evals, constraints, tools, and the workflow itself. Even open-source harnesses can become someone else’s roadmap. That is fine for experiments. For business, I do not want my core workflow to be someone else’s pricing experiment.

Building your own harness is realistic, especially if you have a lot of real material to test on.

6. For individuals, this is a subscription. For business, it is unit economics.

For individual use, I am ready to pay even more for products like Claude Code or Codex. They are so good that even if they cost twice as much, I would probably still use them. As an top tier engineer, the economics work very well for me. If it increases my output, improves my loops, and lets me keep working, it is worth it.

But business economics are completely different. You do not want to build a business where your whole margin is eaten by the LLM. That simply does not work. Price and speed become key business components. If you can optimize your application for a model that is 10 or 100 times cheaper, and it still works well, that changes the game.

Some businesses may only become possible when this pricing vs intelligence problem is solved. Plan-based pricing is dying, and API-based pricing already changing how management looks to LLM costs.

There is also the regulatory part. Mythos was one signal. Manus and Meta was another. Governments will intervene more. Companies will want privacy and protection. Serious enterprise players will not want to depend on a model, a provider, a country, or a harness they do not control.

So the conclusion for me is very simple.

Be independent from the model. Optimize your application for weaker, not-so-smart models. Stop thinking about this as open-ended magic and start thinking about workflows.

And most importantly: own the harness.

Death of Security by Obscurity

Leonid Bugaev — Thu, 28 May 2026 18:21:35 GMT

I should feel very scared right now. Everyone should be freaking scared. But I think we have had so many emotional events happening in the world recently that people have stopped feeling much of anything, and that is the most dangerous part of where we are. The line we used to tell ourselves — “we are not a bank, we are not NASA, we don’t need that level of security” — is no longer an option. We just haven’t felt it yet.

Try this thought experiment. Imagine your company’s source code is made public tomorrow. All of it. How would you feel? I bet most of you would be freaking scared. Not because of IP. Because of the quality. Because of the spaghetti conditions in some files. Because of the strange customer-specific branch nobody touched in three years. Because of the comment that says “// TODO: fix this before prod” still sitting in prod. Because of the auth path that “almost” works. Because of the secret that probably should have been rotated.

For years, a lot of teams treated security as something between a badge, a process, and a hope. Of course everyone said the right thing. “We treat security as a first-class citizen,” and so on. But in practice, unless you were in a regulated industry, security was often optional in the only sense that matters — optional in priority. You followed best practices. You ran dependency scanners. Maybe you had a penetration test every six months because customers asked for it. Maybe you had a badge in your sales deck. But you weren’t really thinking about security as part of how the product works.

Banks, Automotive, Aerospace was different. In those industries, security is existential — if a bank loses trust, the bank is dead, and if an automotive system fails the wrong way, people die. So they built heavy processes around it: requirements, reviews, evidence, traceability, release gates. All the painful stuff. For a long time it was easy for the rest of us to look at that and say: yes, but we are not a bank.

I used to think this way too. At Tyk, I work with banks, governments, and large enterprises, and I was often annoyed by how slow some of their security processes were. Every release needed another check. Every patch had to go through another team. Every dependency update could become a discussion. Sometimes it took weeks or months. From the outside it looked like bureaucracy, and a lot of it was bureaucracy. But I changed my mind on the core idea. Those industries understood something the rest of us could safely ignore for a while: security is not something you add at the end. It is part of what the system is.

Security is a market

This is the part most engineers do not internalise. There is a real economy of people who make money by finding vulnerabilities in software. Some sell to bug bounty programmes. Some sell to brokers. Some sell to whoever is buying. And some just use what they find directly — exfiltrate data, sell the data, blackmail companies, take systems hostage.

That market used to be expensive to enter. Finding bugs took time. Understanding a custom system took time. Building an exploit took time. So attackers focused where the return was high — WordPress, Drupal, popular CMS plugins, well-known SaaS — anything they could exploit a million times after building it once. If you were niche, you were maybe scanned, but rarely understood. That asymmetry was your moat. Nobody admitted it out loud, but it was the moat.

A few years ago at Tyk we had a slightly crazy idea: let’s find open-source Tyk users across the world and see if some of them could become paid users. The idea wasn’t crazy. The crazy part was how easy the technical side was. I wrote a scanner that could scan the public IPv4 internet in a matter of hours. The whole internet. Once you do something like that yourself, “nobody will find us” stops sounding like a serious argument.

You can reproduce a small version of this at home. Run a basic HTTP application on a fresh public IP, expose a port, and watch the logs. Within minutes you start seeing requests: WordPress paths, admin URLs, old plugin routes, random probes, exploit attempts for software you are not even running. Most of it is dumb traffic. That is exactly the point. The internet does not need to know who you are before it starts touching your system.

So scanning was already cheap. The thing that just changed is understanding.

What AI actually changed

AI did not invent insecure software. We were already very good at writing insecure software. What AI changed is the cost of finding the insecurity, and the cost of understanding an unfamiliar system. A model can read a codebase and ask the questions a tired team will never ask, because everyone is busy shipping the next thing. It can build a personalised exploit for an unfamiliar service in hours, sometimes minutes. It can chain three small bugs that nobody would have chained manually because the manual cost was too high.

The reason this is genuinely scary is the floor, not the ceiling. The economics have flipped to the point where a kid in a basement with a decent model can build a personalised exploit for your specific codebase, scan the whole public internet in an afternoon, find every instance of your software, and run that exploit against all of them. Nothing about that sentence requires a state actor.

And this is not theoretical. Anthropic’s Project Glasswing reports that Anthropic and around 50 partners used Claude Mythos Preview to find more than ten thousand high- or critical-severity vulnerabilities across important software in the first month, with the bottleneck shifting from finding vulnerabilities to verifying and patching them. Cloudflare pointed the same model at more than fifty of their own repositories and found that real vulnerability research needs a harness — architecture context, narrow tasks, validation — but with that harness, it works. Anthropic also says Mythos-class capabilities will soon exist in many AI labs. Open source historically catches up in around six months. The gap closes; it does not stay open.

So the relevant question is not whether this exact model is public today. The question is whether your security model assumes this capability stays rare. I would not bet a company on that.

Assume your source code is already out

Here is the uncomfortable truth: you should stop wondering whether your source code is going to leak. You should assume it already has.

I know how that sounds. But look at what just happened to GitHub. On May 20, 2026, GitHub said it had detected and contained a compromise of an employee device involving a poisoned third-party VS Code extension. GitHub’s assessment was that GitHub-internal repositories were exfiltrated, with the attacker’s claim of around 3,800 repositories being directionally consistent with the investigation. GitHub said it had no evidence of impact to customer repositories outside its own internal ones, but it still had to rotate critical secrets and continue analysing logs.

Think about who this happened to. This is GitHub. Owned by Microsoft. With MDM, EDR, hardened endpoints, mature processes, more security engineering than almost any company on earth. One developer. One poisoned VS Code extension. Source code gone.

And the exposure window was tiny. The Nx Console advisory says the malicious version was live in the Visual Studio Marketplace for about 18 minutes and in OpenVSX for about 36 minutes. Eighteen minutes was enough.

Now compare that to your company. If this happened to GitHub, with everything GitHub has, do you really believe nobody has done the equivalent to your team? Be honest. How many extensions did your team install last week? Do you know what any of them do? When did each one last update? Who reviewed it?

So assume it. Assume some snapshot of your source code is already out there. It does not even need to be the latest one — an older snapshot is enough to know where to look. And once an attacker has that, they can use AI to do exactly what defenders are starting to do: read everything, model the system, and build the exploit shaped specifically for you.

That changes the threat model fundamentally. Black-box testing — sending requests, observing responses, fuzzing endpoints, inferring behaviour — is already dangerous. White-box is a different category. With your code in hand, the attacker does not need to guess. They can follow your authentication logic. They can inspect your authorisation paths. They can see your tenant isolation, find the one resolver that does not validate ownership, see the timeout path nobody tested, find the internal endpoint that was “safe” because nobody knew it existed, find the retry that is not idempotent, read the comment that says “this should never happen,” and find the strange customer-specific branch that everyone forgot about.

If your answer to that scenario is “well, they probably do not know how the system works,” then the source code itself was part of your security boundary. And that boundary is gone.

CI/CD is not plumbing anymore

The other place this hits hard is the build system. We used to think of CI/CD as plumbing. From an attacker’s point of view, CI/CD is one of the most interesting machines in the company — it usually has source code, deployment credentials, package publishing tokens, cloud access, GitHub tokens, and secrets for half of the internal systems.

The Trivy incident in March 2026 is the most uncomfortable example because Trivy is a security tool. A trusted security scanner became the attack vector — version tags in aquasecurity/trivy-action were force-pushed to credential-stealing malware, and the action stole everything CI runners had access to. CanisterWorm followed a similar pattern: attackers stole npm tokens from compromised pipelines and used them to publish backdoored versions across every namespace they could reach. The malicious packages ran on postinstall — install was enough.

So zero trust now has to include code. But notice the ladder. First, don’t trust your dependencies — pin versions, quarantine updates. Then, don’t trust the actions running in your pipelines — pin those to commit SHAs, not tags. Then ask the harder question: what if the platform itself is compromised, the way GitHub just was? At that point pinning helps but is no longer a complete answer. Each step up the ladder gives the attacker less leverage, but no step makes the problem zero.

This sounds exhausting. It is. But the alternative is worse.

The CVE model is dead

The CVE model still matters, but it cannot be the centre of your security process anymore. For a lot of teams the hidden workflow is still: wait for the CVE, check the severity, patch by priority, hope customers update before something bad happens. That workflow was already fragile. The new world breaks it.

VulnCheck found that in the first half of 2025, 32.1% of known exploited vulnerabilities had exploitation evidence on or before the day the CVE was issued. For a large share of exploited vulnerabilities, the CVE was not the start of danger. It was already late.

The system producing CVEs is also overloaded. NIST said CVE submissions increased 263% between 2020 and 2025. NIST enriched nearly 42,000 CVEs in 2025, more than any prior year, and still said it was not enough to keep up. In April 2026, NIST moved to a risk-based enrichment model where some CVEs are listed but not immediately enriched.

And none of that machinery will know the weird things inside your own system. A CVE will not know that your GraphQL resolver crashes a customer’s system on one malformed input. It will not know that your retry path is unsafe when the downstream service writes but your load balancer times out first. It will not know that your PII is in the same database as everything else, and one forgotten SQL injection path could expose more than anyone expected. You need a model of your own system, not just a feed of public bugs.

The intuitive response is “patch faster.” That is not enough either. No matter how fast you patch, if your architecture and development flow do not enforce security boundaries by default — if they do not force you to think about security as you write the code — patching will not save you. Every new release becomes a new attack surface, and a personalised exploit can be ready in five minutes. I am not exaggerating.

Cloudflare made the more important point. They described teams talking about a two-hour SLA from CVE release to patch in production, but if regression testing takes a day, getting to two hours means skipping something. Their conclusion is architectural: make exploitation harder even when a bug exists, put defences in front of the application, design the system so one flaw does not give access to everything else, and make fixes deployable everywhere at once.

That is the difference between reactive security and designed security. Reactive security tries to outrun the attacker. Designed security assumes bugs exist and limits what one bug can do.

So how do you live in this world?

The first thing is to actually accept where you are. You have to go through the stages — angry, scared, in denial, eventually acceptance. Most teams stop at denial. They tell themselves the old story: we are not big enough, we are not interesting enough, who would target us. That story is over.

Once you accept it, the next move is not “make everything secure.” That is too vague and usually becomes theatre. Start with the worst case. What can kill your company? How would I leak all customer data? How would I bypass authentication? How would I bypass tenant isolation? How would I poison a release? How would I get production credentials out of CI? How would I make one customer’s system go down with a single malformed request? How would I turn a slow downstream service into a cascading outage?

This is uncomfortable, but it gives you priorities. The worst case is not the same for every company. For a bank, anything touching the money is existential. If a bank loses money, the bank is dead. For a SaaS company automating LinkedIn outreach, leaking a list of customer emails is bad but survivable. The same company being used to impersonate its users and send messages on their behalf is not survivable. That is the death of the company.

If everything is critical, nothing is critical. But some things really are: authentication, authorisation, tenant isolation, secrets, CI/CD, package publishing, admin APIs, customer data, billing data, and anything that turns one bug into many customers affected. Be honest about which of these are existential for you, and put real process around those.

By real process I do not mean bureaucracy for its own sake. I mean what banks and government suppliers actually do, and I am saying this from experience — I spent years building software that went through that kind of pen-testing, the long kind, with people who do this for a living. Pin everything that runs in your build. Make secrets short-lived. Quarantine dependency updates before they reach your main branch. Treat anything that executes code in your dev or build environment as part of the product, because it is.

And then there is the boring part, the part that works much better than people want to admit: checklists.

Checklists are not exciting. They are one of the main reasons software engineering has ever shipped anything reliable. They are why planes fly. They are why a CT scanner does not lie to a radiologist. The reason they work is not because they are clever. They work because they force you to think about things you would otherwise skip.

If you have a login system, you should be forced to think about password reset, previous password reuse, account enumeration, timing attacks, brute force, lockout, and what happens when the email provider is down. If you make an outbound HTTP call, you should be forced to think about timeouts, DNS hangs, retries, idempotency, downstream slowness, partial success, and what happens if the service receives your request but your load balancer times out before you get the response. If your Go code starts a goroutine, you should be forced to think about cancellation, ownership, leaks, blocked channels, and behaviour under load.

None of this is advanced security research. It is basic engineering. But it only happens when the process forces it to happen. The hardest bugs to find are not the ones you wrote wrong; they are the ones you never wrote at all, because you never thought about that case.

And if you decide not to handle something, fine — say it. Log it as a known issue. Put an expiration on it. Hidden gaps are the dangerous ones.

Putting security on autopilot

After years of pen-testing work with banks and governments, I noticed the same kinds of gaps kept coming back. Login systems that didn’t think about timing attacks. HTTP calls that didn’t handle partial success. Goroutines that leaked under load. Not exotic bugs — basic ones, in different codebases, over and over.

So I started writing them down. Open-source projects I had investigated, real incidents I had seen up close, every checklist I had built across years of audits — all of it into a catalogue. That catalogue is what Proof runs on.

Then I built automation around it, because checklists in someone’s head do not scale. This is also why I care about MC/DC. Line coverage tells you a line ran. Modified Condition/Decision Coverage — required by the FAA for Level A software, where failure could be catastrophic — asks whether every logical part of a decision independently affected the outcome. The bug is rarely “this function was never tested.” It is “this function was tested, but not when auth is false, tenant is different, feature flag is enabled, downstream state is stale, and the retry path is active.” Not the happy path. The combination nobody specified.

Proof works in both directions. From spec to code: if your requirement says “user can log in,” Proof attaches sub-requirements for password reset, timing attacks, enumeration, lockout, dependency failures — and the CI check literally fails if a required item has no test attached. If you decide not to handle one of them, fine — mark it as a known issue, attach an expiry, and it stays visible instead of disappearing. From code to spec: static patterns scan for signals (HTTP client, goroutine, database call, queue), and when one is found, Proof asks the spec whether you have described what happens when the service is slow, down, or partially successful. If the spec is missing, the code is shouting at the spec.

Over time the spec stops being a static document and becomes a living source of truth — in my case, a graph of small interconnected requirements that I treat as more authoritative than the code itself, because the code can drift and the spec is the contract. Spec links to code, code to tests, tests back to spec. If one of them changes, the link becomes suspect, and you have to look again.

This does not replace human security work. It just makes the questions impossible to skip.

Security is everyone’s problem now

Not every company needs to copy bank-level process. That would kill a lot of teams. But every software company needs to copy the posture. Security is not something you bolt on at the end. Your moat is no longer the software you build, or the market expertise, or being first. Your moat is being something people can trust to depend on for a long time.

So assume the internet will find you. Assume your dependencies are not safe by default. Assume one developer tool can become the entry point. Assume your source code has already leaked. Assume attackers can use AI to understand your system faster than you can.

Then ask what still protects you. Do you know which parts of the system can kill the company? Can you rotate secrets quickly? Do you have tests for the behaviours that matter, not just the lines that execute? Do you know when code, tests, and specs drift apart?

Security by obscurity is what dies when attention gets cheap. That world is going away. The practical question now is simple. When someone looks closely at your system — with automation, with your source code in front of them, with AI, and with patience — what will they find?

And what will be your answer?

If any of this is hitting close to home and you want to see what putting security on autopilot looks like in practice, get in touch. I'd love to hear what your team is dealing with — and show you how Proof works on real code. reqproof.com

Source of truth: Code, Spec, or Requirement?

Leonid Bugaev — Thu, 14 May 2026 14:50:13 GMT

The code runs. The code breaks. The code is what production uses.

Specs and docs can help, but they often become stale. So we learned not to trust them too much.

Code is honest in a way documents are not. It may be wrong, but it does exactly what it does.

But I think there was another reason we trusted code so much.

For a long time, code was manual work. We wrote it ourselves. We spent time with it. We were not only typing syntax; we were thinking through the system while writing it. The thinking and the implementation were almost the same activity.

That is why code deserved this level of trust. Not because code was perfect. It was not. But because the code carried a lot of the human judgement that produced it.

With agentic coding, this becomes more complicated.

As an individual contributor, I can move incredibly fast now. I can open Claude Code or a similar tool, give it direction, and shape the project while it is being built. I may start with a rough spec or just an idea, but the real decisions often happen during implementation. I try something. I see the code. I realise the original idea was not quite right. I adjust. The agent tries another version.

This is not bad. Actually, it is one of the best parts of working this way. Implementation gives feedback. Sometimes the code teaches you what the spec should have been.

So code-first is very tempting. It keeps speed. It keeps flow. It works especially well when the person steering the agent has the whole picture in their head.

But that is also the problem.

In that case, the real source of truth is not the code and not the spec. It is the experienced person.

They know the edge cases. They know the dependencies. They know which interface is fragile. They know why some strange behaviour exists. They know when the generated code is technically correct but still wrong. They are continuously filling the gaps.

The agent is not really working from a complete spec. It is working with a human who carries the missing context.

This works while the system fits inside one brain.

But real systems do not stay like this. They grow. They get delegated. Teams split. People leave. New people join. Some engineers understand the product but not the architecture. Some understand the architecture but not the domain. Some are junior. Some are moving fast. And the agent only knows what we gave it, plus whatever it can infer.

At that point, memory becomes a bad source of truth.

We forget dependencies. We forget edge cases. We forget why something was built in a strange way. We forget which downstream system depends on a behaviour. Not because people are bad, but because we are humans.

Agents do not magically solve this. If something is not described, they will improvise. They will choose something plausible. And plausible is often enough to pass the first review.

The same is true for humans. If something is not specified, we should not expect it to behave exactly as we imagined.

This is where I think the word “spec” is not enough.

In many software teams, a spec is a temporary artifact. You write it to start the task. It helps the engineer or the agent. Then implementation happens, some things change, and the spec is effectively dead. Maybe it still exists in Notion, Linear, Jira, GitHub, or a markdown file. But nobody really trusts it six months later.

If the spec is temporary, of course code wins.

But maybe the better word is requirement.

And this is not a new idea. This is basically how regulated industries already work. In aerospace, medical, automotive, and similar places, requirements are normal. They can produce multi-hundred-page documents explaining the whole system, but usually it is not really one big document in the simple sense. It is a set of requirements, sub-requirements, interfaces, tests, evidence, and links. A graph that can be turned into a document when needed.

That difference matters.

A spec often says: here is what we want to build.

A requirement says: here is what the system must do, what it must not do, how we know it was implemented, and where the evidence is.

The requirement does not die when the task is done. The code implements it. The test proves it. The evidence is attached to it. If something changes, the requirement becomes part of the review again.

This is the part I think normal software engineering may need to borrow, but without copying all the heaviness.

Not because we suddenly want bureaucracy. I don’t. But because agentic engineering makes implementation faster, and the faster we implement, the easier it becomes to lose intent.

The dangerous drift is not always a bug. The code can work. The tests can pass. CI can be green. But the product intent may have moved a little. The domain behaviour may not be exactly right. A security assumption may be weaker. An interface may have changed in a way nobody noticed.

Everything is green, but it is green around slightly wrong intent.

I also do not think the answer is just “write a huge spec first.” That can fail too. A detailed spec can make wrong assumptions before reality pushes back. Then implementation starts, the spec is challenged, and now you have spec drift instead of code drift.

So for me the real question is not code-first or spec-first.

The real question is how we manage the drift between intent and implementation.

If the code does something different from the requirement, it should not silently become the new truth. But if the requirement is wrong, it should not block reality forever either. There should be a stop. The team should ask: is the code wrong, is the requirement wrong, or did we learn something?

Until that is resolved, something is broken in the process.

This is where traceability matters. Not as documentation for documentation’s sake, but as an invalidation mechanism.

If code changes, the related requirement and tests should become suspicious. If a test changes, the requirement should be checked. If a requirement changes, the code and tests should definitely be reviewed. The system should say: this thing changed, so these other things are no longer fully trusted.

This is also why evidence matters.

I do not trust humans. I do not trust AI either. I want the evidence.

A green CI run is useful, but evidence of what? That the code passes current tests? Or that the system still matches the original intent?

Those are not the same thing.

In larger projects, the original intent becomes a spec, then tasks, then subtasks, then test cases. Every step narrows the focus. Everyone implements their piece. Everyone tests their piece. Locally, everything can look correct. But who validates the final system against the original intent?

Usually not systematically.

This is the part I want to think more about. Maybe requirement management, in some lighter and more modern form, becomes much more important for normal software teams. Not because requirements are new, but because agentic development changes the cost of not having them.

Code is still the runtime truth. It tells us what the system does.

But requirement is the intent truth. It tells us what the system is supposed to mean.

And evidence is what connects them.

I do not have the full answer yet. I still do not know how detailed requirements should be before they become harmful. I do not know which parts should be formal and which should stay flexible. I do not know how to make traceability work inside Git, PRs, CI, and agentic coding tools without making everyone hate it.

But I think the direction is becoming clearer.

In the old world, code was expensive to produce, so code naturally became the main asset. In the agentic world, code may become cheaper to produce, but intent becomes easier to lose.

And if intent is the thing we can lose, maybe intent is the thing we need to manage much more seriously.

Trust Is the Bottleneck

Leonid Bugaev — Tue, 05 May 2026 16:01:15 GMT

Everyone is asking the same question now: if AI can help us create much more code, why aren’t engineering teams suddenly moving much faster?

I think the question is right, but the answer usually stops too early.

AI does make some things dramatically faster. MVPs are faster. Prototypes are faster. The time to validate an idea is reduced a lot. You can explore directions that previously weren’t worth the effort. This is real, and I don’t want to pretend otherwise. But creating the first version of something isn’t the same as maintaining a product, and creating more pull requests isn’t the same as creating more trusted change.

This is where the economics breaks. If your team can create ten times more pull requests, your product doesn’t automatically move ten times faster. Your company doesn’t become ten times faster. The economy doesn’t double or triple. Because the expensive part of mature engineering was never only the typing of code.

The expensive part is trust.

Can I trust this change? Does it match the intent? Does it break a hidden customer flow? Does it affect backwards compatibility? Are the docs updated? Are the tests proving the right thing? Did we think about security, performance, malformed input, error states, release notes, migration, support?

A pull request doesn’t answer all of this.
A pull request is just something asking to be trusted.

So I don’t think the interesting question is “can AI create more code?” It can. The interesting question is: what needs to exist around the code so we can safely absorb more change?

If we can scale trust, we can unlock the real scaling of AI. But not by sending maintainers ten times more PRs. That only moves the bottleneck. What I want is a pull request that comes with enough context that I can actually believe it: why this change exists, what it affects, which tests prove it, which docs changed, what can break, and what still needs a human decision.

If that is the kind of engineering problem you care about, subscribe.

If your trust model is green CI, you are in trouble

AI isn’t going back in the box. Even if you personally didn’t join the hype train, people around you probably did. Engineers use it to write code. PMs use it to write specs. Someone asks it to validate the plan, then write the code, then write the tests, then check everything against the same plan again.

I do it too. I ask AI to help me write the spec. Then I ask it to validate the plan. Then I ask it to write the code. Then validate its own code. Then write the tests. Then check everything against the original plan. It’s tempting because it works surprisingly well. In a lot of cases, it feels almost magical.

But this is exactly why it becomes dangerous.

For a long time, our basic engineering trust model was something like this: write the code, write the tests, pass CI/CD, review the pull request, ship. It was never perfect, but it was much better than nothing. Green CI never meant the product was correct. It meant the code passed the checks we had.

The problem is that those checks don’t prove intent. They don’t prove the requirement was correct. They don’t prove the tests were testing the right thing. They don’t prove the documentation was complete. They don’t prove the change matched the real product behavior we needed.

They prove that the current artifacts agreed with the current checks.

With AI, the whole chain can be generated. The spec can be wrong, the code can follow the wrong spec, the tests can validate the wrong code, the docs can describe the wrong behavior, and CI can still be green. Everything agrees with everything, but the intent is wrong.

That’s not trust. That’s a consistent mistake.

This is why high coverage isn’t enough either. In my previous article about jsonparser, the painful part wasn’t that I had no tests. I had near-100% coverage in the area that mattered. The problem was that malformed input behavior was never properly described. So the tests proved what existed, not what should have existed.

You cannot test what you never described.

Security makes this even less optional. For years, many teams survived with some quiet version of security by obscurity. Not officially, of course. Everyone says security matters. But in practice, a lot of software depended on nobody looking too closely, or on attackers moving slowly enough that maintainers had time to react.

That assumption is breaking. VulnCheck reported that in the first half of 2025, 32.1% of known exploited vulnerabilities had exploitation evidence on or before the day the CVE was issued. This doesn’t mean every vulnerability becomes an exploit in hours, but it does mean the old time cushion isn’t something you can build your product around anymore.

So things that felt optional before become normal engineering requirements: malformed input, authorization boundaries, resource limits, timeout behavior, error states, data exposure, public API behavior. These aren’t enterprise extras. They’re product requirements.

This is the uncomfortable part: the trust problem is now everyone’s problem. Even if your company hasn’t “adopted AI,” your people probably have. Even if your CI is green, it may be green against the wrong intent. Even if your coverage is high, it may cover the behavior you remembered to describe, not the behavior the product actually needs.

So we need a different source of truth. Not instead of CI/CD, not instead of tests, not instead of code review. Above them. Something that says what the system is supposed to do, which obligations apply, what evidence proves them, and what becomes suspicious when something changes.

Otherwise AI won’t only help us move faster.
It will help us move faster with a false feeling of safety.

The outside structure is not the product

I know this problem from open source. For the last 12 years at least, I worked a lot in open source. I had my own popular open source projects, and today at Tyk we build an open source API Gateway.

Open source is hard. Not because people are bad. Usually it’s the opposite. Someone from the outside sends you a pull request. Maybe it’s a bug fix. Maybe a new feature. Maybe it’s useful. Maybe it’s technically correct. Maybe they spent their evening on it.

But as a maintainer, you still need to get inside the context. You need to understand what’s happening and why this person is doing it. You can be fast and accept too much, or stay picky and make people unhappy. Neither option really solves the trust problem.

The real issue isn’t that contributors are bad. The issue is that they see the outside structure. They see the code. Maybe they see the tests. Maybe they see the docs. But they don’t see the intent in the same way the owner of the project sees it. They don’t know all the small product promises made over the years. They don’t know which ugly thing is accidental and which ugly thing is load-bearing. They don’t know which customer flow depends on some behavior that looks strange from the outside.

They are not inside of this bubble.

And this isn’t only open source. The same thing happens inside a company. Someone from support knows the product very well. They see customer pain every day. They may even be technical enough to raise a pull request. Someone from solutions architecture can do the same. Another team can contribute to your service. AI makes all of this easier.

But internal doesn’t automatically mean trusted.

A support engineer may understand the product from the customer side, but not the architecture. Another team may understand code, but not the local history. AI may generate something that looks clean, but it has no real ownership unless someone gives it context and checks it.

These contributions can become shallow. Not useless. Shallow. They touch the visible layer of the system, but they aren’t backed by the deep intent of the people who own this part of the product.

We tried relaxing quality gates a few times. More people contributing sounds obviously good, especially when every company has more backlog than humans. But we had cases where a simple line, a simple fix, broke everything. We had other cases where the fix was so big in scope that it was too dangerous to move there.

The conclusion wasn’t “only engineers can write code.”
The conclusion was: if you want to scale engineering, it’s always about trust.

This is also why “move fast” changes meaning when you have customers. When you’re still searching for an MVP, you can break things and call it learning. But when customers put your product inside their infrastructure, the product is no longer fully yours. They pay you for stability, security, and predictable behavior. In a way, you give away part of the ownership.

At Tyk, this is very real. We build software used by banks, governments, and large enterprises. Quality assurance isn’t some internal slogan. It’s part of the relationship with customers. Every software has bugs; I don’t want to pretend otherwise. But the price of a bug isn’t the same everywhere. Sometimes it’s legal. Sometimes it’s regulatory. Sometimes it’s very big money. Forget even the money for a second: what if the bank goes down? It can become a national-level issue.

Speed isn’t how quickly you can make a change.
Speed is how quickly you can safely absorb change.

Lehman’s software evolution work has a phrase that fits here: “The safe rate of change per release is constrained by the process dynamics.” In the same passage, he says that as the number, size, and architectural distance of changes increase, complexity and fault rate grow more than linearly.

This sounds academic, but it matches product reality. You can move only as fast as your safety norms allow. Your current team, your architecture, your customer base, your process, your quality gates, your review culture — all of this defines your real speed.

If AI gives you more change than your trust system can absorb, you aren’t scaling engineering. You’re scaling incoming work.

Temporary specs become archaeology

One of the deeper problems is how we treat specifications in consumer engineering.

Most of what we call a specification is a temporary artifact. You start with all the best practices. Maybe an RFC. Then it becomes a detailed Jira ticket. Maybe later there is an ADR. There are comments in GitHub. A Slack thread. A Confluence page. A few decisions made during review because reality was different from the original assumption.

At the moment, this feels normal. This is how software gets built. But after some time, all these artifacts pile up. If you want to understand how a component works, you need to dig through history. You need to understand why it ended up in this final state. Why this was done and not that. You may be lucky and find the exact explanation. In most cases it’s lost in someone’s head.

This is archaeology, not development.

The bigger problem is that these artifacts are independent. The RFC isn’t connected to all the code. The Jira ticket isn’t connected to all the tests. The docs are scattered across ten pages. The final implementation isn’t connected back to the original assumptions. It’s not a graph.

So we trust a person — engineer, architect, lead, PM — to hold the high-level picture in their head. We trust them to find all dependencies. We trust them to notice backwards compatibility issues. We trust them to know which docs need updating. We trust them to remember which customer flow can break.

And of course people forget. Not because they’re careless. Because this is too much context for one person to carry.

You fix a bug and break some other flow. You build a feature and forget a dependency with another service. You update two documentation pages and miss the other eight. The feature exists, but it’s unusable for one group of users. The implementation works, but not in the real production shape.

The spec was supposed to create clarity.
But because it was temporary, it becomes one more historical artifact.

This is also why spec-first isn’t enough. Spec-driven development is better than no spec. Planning before coding is obviously better than jumping into implementation. But if the spec is still treated as a temporary artifact, after a few iterations you end up in the same position, with intent chaos.

During development, the spec always changes. You start with assumptions. Researched assumptions, but still assumptions. Then implementation begins and reality appears. The architecture doesn’t work. A limitation appears. A reviewer notices a security issue. QA finds a case. A customer dependency changes the direction. And in many teams, the spec isn’t updated.

The real knowledge moves into GitHub comments, Slack messages, review threads, and people’s heads. The Jira ticket becomes stale. The implementation says one thing. The ticket says another.

Imagine someone from QA comes back from vacation and needs to test the feature. They see the ticket. They see the implementation. They have no idea what is happening. Why this? Why not that? Is this intended?

It’s so common. I bet a lot of you feel the same.

How I can know what I don’t know?

A lot of bugs are not in the code first. They are in the missing specification.

Have you actually described what should happen if the input is malformed? Have you described that this functionality must not allow SQL injection? What happens if the third-party service times out? What is the error state? Have you described authorization boundaries? Resource limits? Performance boundaries? What happens when something goes from ten requests per second in a test environment to thousands in production?

There are also more subtle cases. Concurrency. Non-deterministic behavior. Map iteration. Merge order. I’m looking at you, Go.

Have you described that the behavior should be deterministic? Did you write a test for it? Did the test prove the requirement, or did it just execute the code?

This is where checklists, obligations, processes, discipline, and all the boring stuff come in. I know people hate boring process. I hate fake process too. Documents nobody reads. Boxes people tick after the fact. Quality theatre.

But the useful version is different. An obligation is not a test case. It’s a category of behavior you are required to describe: malformed input, boundary behavior, error handling, access denied, determinism, idempotency, atomicity, nil safety, overflow safety, encoding safety.

The obligation doesn’t tell you the answer.
It forces you to ask the question.

That is why I like it. It turns “maybe someone remembers” into a deterministic process. The checklist itself is human judgment. But checking whether the spec covered the checklist can be mechanical.

This is also where AI can help without pretending to own the product. If the code uses goroutines, the system can ask where cancellation, lifecycle, and error propagation are described. If code depends on map iteration or merge logic, it can ask whether determinism or commutativity matters. If code reads time directly, it can ask whether time is part of the behavior and how this is tested. If code changes a public API, it can ask where compatibility and documentation obligations are.

This isn’t AI judging architecture taste. It’s tooling surfacing missed questions.

That is the “how I know what I don’t know” loop. Spec obligations force code and test evidence. Code shape can reveal missing spec questions.

What regulated industries got right

Consumer engineering and regulated engineering live in different worlds. Different tools. Different conferences. Different language. Some of it is archaic. Some of it is bureaucracy. I don’t want every SaaS team to become an avionics certification team.

But we shouldn’t ignore what they learned.

I expected to find paperwork. Annoyingly, I found a lot of things our world forgot to learn.

In aviation, automotive, medical devices, space systems, the spec isn’t treated as a temporary note. It’s a source of truth that lives together with the software. Requirements have IDs. They have layers. They are linked to documentation, tests, implementation, verification evidence. You can see blast radius. You can see what a change affects. During review, if implementation differs from the spec, the spec must be updated.

The useful idea is not the paperwork.
The useful idea is that intent is durable, traceable, and connected to evidence.

NASA’s FRET is one example of this direction. It lets users enter hierarchical system requirements in structured natural language, gives those requirements unambiguous semantics, and can show them as natural language, formal logic, diagrams, and interactive simulation.

That doesn’t mean every product team needs FRET or formal methods everywhere. It means the requirement is not just a document. It’s something you can analyze, link, verify, and keep alive.

This is where requirement management becomes interesting again for consumer engineering. Not the old heavy version copied blindly from regulated industries. Not paperwork for paperwork. But the useful part: a source of truth, cross-links, invalidation, traceability, and evidence.

Combined with everything consumer engineering learned over the years: CI/CD, pull requests, fast feedback, developer experience, automated tests, observability, docs, release automation.

We should not throw away modern engineering.
We should add the missing trust layer.

From pull request to evidence pack

Today a pull request usually gives me code, maybe tests, maybe a description. But it doesn’t give me the whole chain.

It doesn’t tell me the original intent. It doesn’t tell me which obligations apply. It doesn’t show the blast radius. It doesn’t show which docs changed or should have changed. It doesn’t show which specs this conflicts with. It doesn’t show what changed during implementation compared to the plan.

So the reviewer has to reconstruct all of that.

Again, archaeology.

What I want instead is an evidence pack. Not enterprise theatre. Not documents for the sake of documents. A practical package that makes the change reviewable.

Here is the intent. Here are the requirements. Here are the obligations. Here are the tests that witness them. Here are the docs. Here is the blast radius. Here is how it aligns with existing specs and where we checked for conflicts. Here is what changed during implementation. Here is what still needs human judgment.

Then the pull request isn’t only code. It’s the full chain of development.

This matters for open source. It matters for support engineers contributing fixes. It matters for other internal teams. It matters for AI agents. You don’t trust the contributor blindly. You don’t trust AI blindly. You trust the evidence chain, and then you still apply human judgment where judgment is needed.

This will feel slower at first. Writing obligations is slower than writing a vague ticket. Linking tests to requirements is slower than writing random tests. Updating docs through the graph is slower than pushing a change and hoping someone remembers. But not all friction is bad.

The question is whether the friction creates trust.

Bureaucracy gives you friction without trust.
Evidence gives you friction that lets more people move safely.

If this trust exists, then AI can actually help us scale. Not by dumping more pull requests into the same review bottleneck, but by making more changes reviewable, traceable, and safe to absorb in parallel.

Without trust, maintainers become managers of incoming things. Instead of thinking about architecture, future, and vision, they review an endless stream of pull requests, fixes, and generated artifacts.

That is not the scaling I want.

Why I am building Proof

This is why I am building Proof.

I don’t want another tool whose main purpose is to create more code. We already have many of those. The problem isn’t that we can’t produce enough artifacts. The problem is that the artifacts don’t preserve intent.

I want specs to stop being temporary. I want requirements to live with the software. I want obligations to force the boring questions before they become production bugs. I want code, tests, docs, and requirements to invalidate each other when they drift. I want a reviewer to see the evidence chain instead of rebuilding it from memory.

AI will make engineering faster. That part is already happening. But faster without trust is not enough.

For me, the real question is this: how can I end up in the position where it’s not just a pull request coming from someone from the outside, but a well-thought evidence pack that makes me believe I can merge it as soon as possible?

That is the scaling I care about.

Not just more code.

More trusted change.

I Had Near 100% Test Coverage. It Didn’t Matter.

Leonid Bugaev — Wed, 29 Apr 2026 17:06:16 GMT

I woke up and saw a wall of emails in my personal account. Then logged into my corporate Slack, and it was filled with Zendesk messages from customers. Everyone was looking for me.

The library I wrote, jsonparser, which got used by a lot of projects, got its very own public CVE. So everyone started freaking out looking at their scanners.

“That’s what the fame is,” was my first thought.

Now I remember some notifications I kept ignoring from the Google OSS Fuzz project, I signed up multiple years ago.

This lib was written in the pre-AI-agents era (so weird to say that now!). Every piece was handcrafted manually, using best practices, with full test coverage.

I checked the function which had the issue, and it literally had near 100% test coverage. But it did not matter, because the issue was in handling of malformed input data. One of the edge cases which was missed. In other words, the issue was in the specification of what this function should do and how it should behave in edge cases.

But it opened one more can of worms. I wrote this library like 6 years ago. I don’t remember anything. And my only source of truth is the code and the tests, which is rather cryptic and looks more like archaeology.

The issue is fixed now. But how do I prevent such issues happening in the future? And if 100% code coverage is not the answer, what is? And what is my source of truth?

So I started digging. And it went way deeper than I expected, and changed the way I look at software engineering forever.

Down the rabbit hole

I started thinking about what the gold standard of software quality is. My first answer was NASA. How does NASA solve these kinds of issues?

AI now produces so much code that I feel like I am losing ownership of it. Not only of the code. Of the intent.

I wanted to understand how people work when tests passing is still not enough and the price of being wrong is huge.

The surprising thing is that a lot of NASA’s work is public. Their software engineering requirements are public. FRET is public. Kind2 is public. A lot of the case studies are public. There are papers about aircraft, Mars rovers, superconducting magnets, and formal requirements that found bugs before code existed.

I started reading all of this not as an academic exercise, but because I had a very dumb practical problem: my tests were green, my coverage looked fine, and still one missed edge case was enough to create a public CVE.

Then I went deeper into automotive and aerospace. It opened a whole new world of software engineering for me. For some reason, our world of consumer software engineering and regulated software engineering in those industries almost do not intersect. Different tools, different conferences, different language. Sometimes it feels like they live in a parallel universe.

Some of it looks archaic. Some methodologies are weird.

Our engineering progressed a lot too. We got very good at moving fast and catching damage quickly. CI/CD, linters, tests, canaries, observability, rollbacks. I don’t want to pretend every SaaS product should behave like avionics certification.

But we optimized for speed. They optimized for evidence.

Their industry spends much more time asking what evidence they need before they are allowed to trust the change. Some of it is painful. Some of it is bureaucracy. But the idea underneath is not stupid: if you claim the system should behave in some way, you need a durable chain from that statement to tests, code, and evidence.

There is real proof there, but it is not the fantasy version I had in my head, where every line of every product is mathematically proven end-to-end. They prove specifications. They use model checking. They simulate models, like with Simulink, against many input/output cases. They measure structural coverage. They use formal proof where the criticality justifies it.

And they still use testing, code review, static analysis, and all the normal engineering work around it. The difference is that proof and evidence are attached to the parts where being wrong is not acceptable.

That actually made the idea useful for normal engineering.

This is a huge topic, which I will cover in future articles. But the first concrete thing I found was MC/DC. It is one of the ways safety-critical industries look at coverage, and it made standard line coverage look very weak to me.

Line coverage says a line was touched at runtime. It does not say that the decision was tested.

Why 90% line coverage can still mean 60% real coverage

I still use line coverage. I still look at it.

But line coverage is bullshit. You should not trust it. Not on its own.

In Go, when you run:

go test -cover ./...

you mostly get statement coverage. The tool tells you whether a statement executed during the test run. That’s useful. But it doesn’t tell you whether the decision was tested.

Take a tiny parser-style example:

func isDigit(c byte) bool {
	return c >= '0' && c <= '9'
}

Now test it like this:

func TestIsDigit(t *testing.T) {
	if !isDigit('5') {
		t.Fatal("5 should be a digit")
	}
	if isDigit('x') {
		t.Fatal("x should not be a digit")
	}
}

Looks fine. The line ran. The function returned true once. The function returned false once. Your coverage report can look perfect.

But what did you actually prove?

You tested '5'. You tested 'x'. You didn’t prove the lower boundary. You didn’t prove that '/' fails because it’s before '0'. You didn’t prove that ':' fails because it’s after '9'.

The line is covered. The boundary is not.

MC/DC stands for Modified Condition/Decision Coverage. It asks the question line coverage does not ask: did each condition independently affect the outcome?

When your code says if a && b, line coverage tells you the if was hit. MC/DC asks whether a alone can change the result, and whether b alone can change the result.

For this line:

return c >= '0' && c <= '9'

there are two conditions:

c >= '0'
c <= '9'

A simplified MC/DC table looks like this:

The table is just a way to say: these are the cases that matter. This is the part ordinary coverage does not force you to say.

This used to be mostly a safety-critical tooling conversation. DO-178C requires MC/DC for the highest-criticality aviation software. The tooling was expensive, slow, and hard for normal teams to justify.

That changed. GCC 14 has -fcondition-coverage. Clang 18 has -fcoverage-mcdc. Rust is moving in the same direction with richer branch and condition coverage work, even if I would not call Rust MC/DC stable yet. Go does not have native MC/DC support, so I ended up adding code-level Go MC/DC measurement to Proof, and we have been extending the same direction to JavaScript and TypeScript as well.

What aerospace and automotive had because they were slow and diligent is now becoming available to normal engineering teams because AI changed the economics. You don’t need a certification lab to ask a harder question about your tests. You also don’t need to apply all of this to the whole company on day one. Start with the part where wrong behavior actually hurts.

The jsonparser numbers weren’t subtle

After the CVE fix, I wanted to understand why my previous approach didn’t make this kind of missing behavior obvious enough.

So I applied the MC/DC and requirements approach to jsonparser in a later public PR: buger/jsonparser#281.

Again: this PR didn’t fix the original CVE. It was the follow-up work after the CVE fix. But it was not just a paperwork exercise. The hardening pass found and fixed more real issues and removed dead code that my previous process had not made obvious.

That was the uncomfortable part for me. I started by asking: what did my tests actually prove?

On the main branch before that work, ordinary Go statement coverage was already decent:

85.3% coverage isn’t bad. Most teams would see that and move on. But decision coverage told a different story: only 66% of decisions were fully covered, and only 69.2% of conditions were proven independently.

And the more interesting part: some functions already looked perfect by ordinary coverage.

Examples from the before state:

parseInt                     100% statement coverage
Unescape                     100% statement coverage
decodeSingleUnicodeEscape    100% statement coverage

But MC/DC still found missing independent-condition evidence:

bytes.go:21   parseInt missing proof for c < '0'
escape.go:148 Unescape missing proof for len(in) > 0
escape.go:47  decodeSingleUnicodeEscape missing proof for h1 == badHex
escape.go:47  decodeSingleUnicodeEscape missing proof for h2 == badHex
escape.go:47  decodeSingleUnicodeEscape missing proof for h3 == badHex

100% line coverage can still leave a condition unproven.

The code ran. The decision wasn’t tested.

The bug was in what I forgot to describe

Coverage does not paint the whole picture. Even MC/DC. The bug can still be in the spec.

That is what happened with jsonparser. It was a classical case: you are building something, moving forward, and not looking back. You don’t know what you don’t know. I did not think about what would happen if this edge case appeared. I think most of us do not think about it this way.

I did not have any specs driving development or anything that forced me to think about the edge cases before writing the code. So of course I did not test for them. You cannot test for what you never described.

Testing assumes the specification is correct. That is the NASA/formal-methods lesson that changed how I think about this. The hard part is not testing the implementation. The hard part is questioning the specification itself.

This is where I found two different questions that I had been mashing together.

The first question starts from my specification: if this is what I claim the system should do, which logical cases need to be witnessed?

Not the code. The intent.

NASA built an open-source tool called FRET (Formal Requirements Elicitation Tool) that lets you write requirements in structured English and translates them into formal logic.

FRET includes an algorithm called FLIP (FuLl Independence Pair). FLIP takes a formalized requirement and generates the minimum set of test cases proving each boolean variable independently affects the outcome. Not every possible combination. Just the ones that matter.

I still have to write the requirement. I still have to decide what malformed input, boundaries, errors, and edge cases mean. FLIP does not do that for me.

But once the requirement is formalized, FLIP tells me exactly which test cases that requirement needs.

I built a tool called Proof that implements this approach.

That is the part I care about: how many tests are enough for this requirement?

Not “how many tests did I happen to write?” Enough for what I described.

The second question starts from my actual code: did my tests exercise every boolean condition in the implementation so each one independently affects the outcome?

This side does not care what I meant. It looks at what I wrote.

And sometimes it shows that my code has many more logical cases than my spec. So maybe my spec is not accurate enough.

Or my spec says this edge case matters, but my tests don’t witness it.

Or my tests cover implementation details, but the behavior is under-described.

I learned this the hard way on jsonparser. The spec side and the code side kept disagreeing in useful ways, and that is where code drift and spec drift become visible.

The gap goes in both directions. Sometimes the code is wrong. Sometimes the tests are weak. Sometimes the spec is too vague.

Sometimes all of it combined badly.

Checklists, not memory

What can be more deterministic than a checklist? In aerospace and automotive, everything has its own checklist. The price of a mistake is too high to rely on someone’s memory. I think checklists are the driving force behind quality engineering in those industries.

When you do not have specifications, it is very hard to create a checklist. When you are building a feature, you can have test cases, but that is a moving target. The items are constantly changing. You need something that will be the same all the time.

In this context, an obligation is not a test case. It is a category of behavior you are required to describe. Malformed input is an obligation. Boundary behavior is an obligation. Error handling is an obligation. For each one that applies to your requirement, you need at least one test case that proves how the system behaves in that category. The obligation does not tell you the answer. It forces you to ask the question.

You cannot rely on humans here. Even on me, to be frank. I can miss these items too. You need deterministic checklists.

In practice, the questions are very simple:

What will happen if this is malformed data? What will happen if this is slow and the request times out? What will happen if the database is down? What will happen if you have a very large object? What will happen if the function returns different values with the same inputs?

These are the cases where security issues and data bugs tend to live. For jsonparser, these are the exact cases I had not thought about.

Without obligations, edge cases depend on memory. Maybe I remember to test malformed data. Maybe the AI remembers. Maybe a reviewer notices. Maybe no one does.

At the moment, it is just a matter of whether someone forgets or not forgets to test it.

This is where the CVE fix actually changed how I work. The fix itself was mechanical. But the obligations I wrote afterward forced me to think about the cases I had skipped. Every one of those became an explicit question I had to answer. Not “did someone remember to test this?” but “here is the list, and each item needs a witness.”

Obligations turn edge cases from “someone remembered to test this” into a deterministic process.

When I first started writing obligations for jsonparser, it was actually quite easy with modern AI tooling. I reviewed all of the specs. The flow is: you cannot pass this check until the checklist is green, until you define obligations for all of those cases, and until you define test cases for all of those cases as well.

This is what the double link looks like in practice:

// In the code — annotated with the requirement it implements:
// SYS-REQ-863
func (s *Service) lookupCache(req Request) (*Result, bool) {
    // ...
}

// In the test — annotated with both the requirement AND the specific MC/DC row:
// Verifies: SYS-REQ-863
// MCDC SYS-REQ-863: cache_lookup_requested=T, component_inputs_unchanged=F,
//                    cached_component_result_reused=F => TRUE
func TestMCDC_SYS_REQ_863_Row1(t *testing.T) {
    evalVerifyScenario(t, "SYS-REQ-863", map[string]bool{
        "cache_lookup_requested":         true,
        "component_inputs_unchanged":     false,
        "cached_component_result_reused": false,
    }, true)
}

Each test is not just “test the function.” Each test is: “prove that this specific variable independently affects the outcome of this specific requirement.”

If I change the spec, I can see exactly which MC/DC rows are affected and which tests need to be reviewed. If I change a test, I can see which spec requirement it was proving and check whether the spec still says the same thing. If I add a new variable to the requirement, FLIP will generate new witness rows, and the missing tests become immediately visible.

This is the double link. Change the spec, review the tests. Change the tests, review the spec. If you have not touched the spec, why would you touch the test?

This is where the “how many tests are enough?” question changed for me. Before, the answer was always vibes. Write enough tests. Cover important paths. Don’t overdo it. Be pragmatic.

All true, and also not very helpful.

Now I think about it differently. Enough tests means enough evidence that every condition I described, or every condition my code actually contains, can independently affect the behavior I care about.

It is not about how many tests I have. It is about whether I really, really trust my system and whether it actually does what I described.

The true challenge is legacy

You can always start a new project and have a really nice experience with all of this. But the true challenge lies in the big legacy projects. They make up like 90% of all software. They bring the majority of the money. And they are the ones where wrong behavior actually hurts.

I work with very complex software. At Tyk, we build API gateway software used by banks, governments, and other serious enterprise customers. I am a very sceptical person. I always want some proof. At the same time, I understand that software is always about compromises.

But the game is changing. What was not possible in the past is now possible for small teams in terms of quality and processes. The wind is changing with AI.

The true power happens when you can apply some of those approaches to legacy large enterprise codebases. If it works there, it will work everywhere.

I know how challenging it is. You cannot do it in one go. You cannot just make a switch and start using a new process.

This is not only about the technical part. It is also about the people part. Even at the size of Tyk, with like a hundred people, it is not about the implementation. It is about the processes and the people. The technical part is the easiest one.

In order to convince people that you can actually make it, you need to be able to do it in parts. Start small, then scale.

Can you take small parts, turn them into a repeatable process, and then start scaling? That is how it works in the majority of cases.

So I picked the policy engine. Authorization and gateway policy decisions are obviously critical. If the policy engine behaves incorrectly, you are not talking about a cosmetic bug.

I applied the same kind of thinking to the Tyk policy package in a public PR: TykTechnologies/tyk#7932.

81% ordinary coverage. 64.3% decision coverage. The normal coverage number says most statements ran. The MC/DC number says a lot of policy decisions still do not have independent evidence.

For a policy engine, the second number is the one I care about.

Code coverage is not about a metric

It is about trust.

What do we trust? In classical software engineering, we say: here is the code and here are the tests, the tests are the source of truth. If you want to know how the system works, read the tests.

I do not believe that anymore. Not with AI writing code. Not with AI writing tests. Not with AI validating its own assumptions.

The source of truth cannot just be tests anymore. AI can write those too.

A passing test can prove that the code agrees with the test. It cannot prove that both agree with my intent.

So I moved the source of truth up. For me, it has to be the specification: the static description of what I expect the system to do.

Then code implements it. Tests witness it. Coverage measures evidence around it. Traceability keeps the chain from silently rotting.

I started this whole journey because of one CVE in a library I wrote six years ago. I ended up in a completely different place.

I thought the problem was in the code. It was in what I forgot to describe.

I thought coverage was the answer. It was the wrong question.

The first article was about losing intent. This one is about binding intent back to code.

AI Writes Your Code. Nobody Verifies the Intent.

Leonid Bugaev — Thu, 23 Apr 2026 15:09:01 GMT

I live in two different worlds now.

In one, AI made me more productive than I have ever been.

I have written more software in the last two years than across the rest of my career. I have barely written any code manually in the last year.

That part is real.

The speed boost is real.

The weird part is what came with it.

AI helps me ship more.

But it also asks me to trust more.

That is the uncomfortable part.

I am not just delegating typing.

I am delegating thinking, validation, and judgment too.

And I am still not sure where the safe line is.

In the other world, I lead engineering for software used by banks, governments, and other regulated environments, where mistakes are expensive and confidence matters more than speed.

And if you ask whether AI made us ship features 2x faster there, the honest answer is no.

Not even close.

That does not mean AI was useless.

It helped somewhere else.

It reduced noise.

A lot of engineering time in a big system does not go into writing the feature. It goes into interruption-based work: support engineers trying to understand how a feature behaves, PMs trying to figure out whether something is a bug or intended behavior, solution architects pulling in senior engineers just to inspect a corner of the system.

Tools that let people talk to the codebase, inspect it safely, and even generate tests or benchmarks to validate a hypothesis helped a lot with that.

People were less interrupted.

Context switching got better.

Engineers were happier.

But the main bottleneck did not move.

Implementation got dramatically faster. Trust did not.

That is the wall I keep hitting in both worlds.

The part people keep smoothing over

The industry keeps talking as if faster code generation automatically means faster engineering.

It does not.

In a lot of teams, it just means mistakes can scale faster than judgment.

As an individual engineer, I can create software much faster than before. Good software too. Clean structure. Tests. Refactors. Nice terminal output.

And still I trust it less than I want to.

Maybe less than before, because I know how much invisible reasoning I no longer fully own.

As a Head of Engineering, I can see the same problem from the other side.

We can accelerate some parts of the flow.

But we still have to verify whether the thing we built is actually the right thing, and whether it behaves correctly in the bigger system.

In a complex product, implementation is a relatively small slice of the work.

Validation and verification are the bigger slice.

That is why I keep coming back to the same phrase:

verification gap

The verification gap is the distance between what I mean and what I can actually prove.

Between intended behavior and demonstrated behavior.

That gap always existed.

AI did not invent it.

It just made it wider, faster, and easier to ignore until production forces the issue.

Why this got worse with AI

When humans wrote the code, the same brain often held the intent, the implementation, and the validation loop together.

Not perfectly.

People still shipped bugs. Specs were incomplete. Tests missed things.

But there was at least one place where the system could be understood as a whole: the person writing it.

That is no longer the default.

Now the human writes the prompt.

The model writes the code.

The model writes the tests.

The human skims the diff.

The model writes the cleanup.

The CI passes.

The feature ships.

And if the original intent was slightly wrong, incomplete, or misunderstood, that mistake does not stay in one place anymore.

It gets propagated through the whole stack.

The plan is based on the wrong assumption.
The implementation is based on the wrong assumption.
The tests are based on the wrong assumption.
The “manual validation” is often you asking the same model to sanity-check itself.

And then you look at the whole thing and it feels solid.

But it is solid on top of the wrong assumption.

So what exactly are we proving at that point?

That the system is internally consistent with the assumption it invented for itself.

Not that it matches your intent.

That is why so much AI productivity discourse feels fake to me.

A lot of teams did not automate engineering.

They automated typing.

That difference matters more than most people want to admit.

Bug free is not the same as intent-correct

People keep saying: just write better tests.

I do write tests.

AI writes tests for me too.

That is not the point.

Tests verify behavior for cases somebody thought of.

That somebody used to be a human.

Now it is often a human plus a model.

That is still not the same thing as verifying intent.

You can have 100% line coverage and still completely miss the thing that matters.

You can have a green CI run and still not know whether the software behaves the way you intended.

You can even have bug-free code in a narrow sense and still have software that is wrong.

A green pipeline can still be a polished misunderstanding.

That is one of the biggest traps in the current AI coding wave.

We are getting very good at generating artifacts.

Code.

Tests.

Docs.

Migration scripts.

Benchmarks.

RFC drafts.

None of that answers the deeper question:

does the system actually do what we mean?

Software is not flat. It is layers.

The problem gets worse as the software gets bigger.

Software is not flat.

It is layers.

It is wide, deep, and full of interacting components, hidden assumptions, backwards compatibility constraints, old decisions nobody remembers, and behavior that only makes sense if you know four other subsystems.

Any project that lives long enough eventually reaches a point where one brain is no longer enough.

That was true before AI.

It is still true now.

AI does not remove that limit.

In some cases it makes you hit it faster, because you can generate change faster than you can understand its consequences.

That is why the industry created all the layers around engineering in the first place:

CI/CD
QA
RFCs
Architecture reviews
Team ownership boundaries
Support escalation paths
Approval workflows

These are not random rituals.

They are patches over the same underlying problem:

software complexity grows beyond what one brain can safely manage.

Where does intent live now?

I think mainstream software engineering is still missing something fundamental.

We do not maintain a real source of truth for intent.

If I ask where the intended behavior of a system lives right now, the honest answer in most teams is:

all of it combined badly.

Some of it is in source code.

Some of it is in tests.

Some of it is in RFCs.

Some of it is in Jira tickets.

Some of it is in Confluence.

Some of it is in the heads of senior engineers.

None of those is the place where I can go and see, clearly, how the system is supposed to behave right now.

That is not a source of truth.

That is archaeology.

And that feels like a drastic difference from fields like aerospace or automotive.

They have their own fragmentation problems too. Different groups write requirements, validate them, implement them, monitor them. Those worlds often barely talk to each other.

But at least intended behavior is treated as a first-class artifact.

There is an SRS.

There are explicit requirements.

There is a recognized place where intent is supposed to live.

In mainstream software, especially for something complex like an API gateway, that still feels almost unimaginable.

We mostly reconstruct intent after the fact from scattered artifacts.

And then we act surprised when regressions keep happening.

Why enterprise teams do not get the full AI payoff

This is also why the conversation about AI productivity is often too shallow.

Yes, implementation is faster.

Sometimes dramatically faster.

But if speed of implementation is no longer the hard part, then what is?

That is the real question.

If a feature can be implemented in hours instead of weeks, why have so many teams not seen the full payoff?

Because implementation was never the only bottleneck.

The harder part is deciding what should be built, making that intent explicit enough, and then verifying that the resulting system still matches it after the code, tests, and surrounding context have all changed.

That is where the time goes.

That is also where a lot of current AI hype becomes unserious.

People showcase how fast a model can produce code.

Fine.

Show me how fast your team can decide what is correct, verify that the behavior matches the intent, and avoid turning six months of hyperproductivity into twelve months of regression cleanup.

At work, we effectively built a zero-trust environment.

We do not blindly trust humans.

We do not blindly trust AI.

We review the code.

We validate the assumptions.

We check the tests.

That posture protected quality when AI adoption accelerated.

But it also meant we did not suddenly become 10x faster.

We became less noisy.

More focused.

Better at answering questions.

Faster in implementation.

Still constrained by verification.

Not everyone needs safety. Everyone needs trust.

As an individual engineer, the same tension shows up in a different shape.

I can move incredibly fast.

But I know that if I let trust slide too far, I eventually stop building and start doing bug fixing and regression management full-time.

The software turns into glue and patches.

You can feel your taste slipping if you are not careful.

It all kind of works, but you are no longer fully sure why.

Safety bar differs. Obviously.

A bank flow is not the same thing as a weekend prototype.

One component inside a product may deserve a much stricter baseline than another.

But trust? Everyone needs that.

If I built a website, a product, a service, an internal tool, whatever it is, I need to trust that it actually follows my intent closely enough for the context it lives in.

That is the standard I care about.

Not some abstract perfection.

Not a fantasy of zero bugs.

Not a productivity screenshot.

Trust.

Can I tell how my software behaves right now?

Do my docs, specs, tests, and code align with each other?

Do I know which parts are intentional, which parts are accidental, and which parts are cargo cult left over from earlier decisions?

When I change something, am I making the system better, or just shifting uncertainty around?

So what is engineering now, exactly?

Where is the place of the human?

Where is the place of judgment?

And which part should I never offload, even if AI is very good at pretending it can carry it for me?

Those were already hard questions before AI.

AI did not create them.

It amplified them.

It exposed how incomplete our current software practices already were.

Why I am writing this

That is why I do not think a smarter model or a shinier coding assistant will solve this by itself.

The missing layer is verification.

Not just whether the code runs.

Not just whether the tests pass.

Not just whether the reviewer approved.

I mean verification of intent.

That is what I have been thinking about for a long time now, and why I am starting this newsletter.

I want to write about the gap itself, what causes it, why it compounds, why mainstream software and regulated engineering barely learn from each other, and what it would take to close it.

Not with slogans.

With examples, systems, failures, tools, and uncomfortable questions.

AI did not remove the hard part of engineering.

It moved it from writing to verification.

If this problem feels familiar, subscribe.

This is what I am writing about now.