Trust Is the Bottleneck

AI can write specs, code, tests, and docs. If all of them agree on the wrong intent, green CI isn’t enough.

May 05, 2026

Everyone is asking the same question now: if AI can help us create much more code, why aren’t engineering teams suddenly moving much faster?

I think the question is right, but the answer usually stops too early.

AI does make some things dramatically faster. MVPs are faster. Prototypes are faster. The time to validate an idea is reduced a lot. You can explore directions that previously weren’t worth the effort. This is real, and I don’t want to pretend otherwise. But creating the first version of something isn’t the same as maintaining a product, and creating more pull requests isn’t the same as creating more trusted change.

This is where the economics breaks. If your team can create ten times more pull requests, your product doesn’t automatically move ten times faster. Your company doesn’t become ten times faster. The economy doesn’t double or triple. Because the expensive part of mature engineering was never only the typing of code.

The expensive part is trust.

Can I trust this change? Does it match the intent? Does it break a hidden customer flow? Does it affect backwards compatibility? Are the docs updated? Are the tests proving the right thing? Did we think about security, performance, malformed input, error states, release notes, migration, support?

A pull request doesn’t answer all of this.
A pull request is just something asking to be trusted.

So I don’t think the interesting question is “can AI create more code?” It can. The interesting question is: what needs to exist around the code so we can safely absorb more change?

If we can scale trust, we can unlock the real scaling of AI. But not by sending maintainers ten times more PRs. That only moves the bottleneck. What I want is a pull request that comes with enough context that I can actually believe it: why this change exists, what it affects, which tests prove it, which docs changed, what can break, and what still needs a human decision.

If that is the kind of engineering problem you care about, subscribe.

If your trust model is green CI, you are in trouble

AI isn’t going back in the box. Even if you personally didn’t join the hype train, people around you probably did. Engineers use it to write code. PMs use it to write specs. Someone asks it to validate the plan, then write the code, then write the tests, then check everything against the same plan again.

I do it too. I ask AI to help me write the spec. Then I ask it to validate the plan. Then I ask it to write the code. Then validate its own code. Then write the tests. Then check everything against the original plan. It’s tempting because it works surprisingly well. In a lot of cases, it feels almost magical.

But this is exactly why it becomes dangerous.

For a long time, our basic engineering trust model was something like this: write the code, write the tests, pass CI/CD, review the pull request, ship. It was never perfect, but it was much better than nothing. Green CI never meant the product was correct. It meant the code passed the checks we had.

The problem is that those checks don’t prove intent. They don’t prove the requirement was correct. They don’t prove the tests were testing the right thing. They don’t prove the documentation was complete. They don’t prove the change matched the real product behavior we needed.

They prove that the current artifacts agreed with the current checks.

With AI, the whole chain can be generated. The spec can be wrong, the code can follow the wrong spec, the tests can validate the wrong code, the docs can describe the wrong behavior, and CI can still be green. Everything agrees with everything, but the intent is wrong.

That’s not trust. That’s a consistent mistake.

This is why high coverage isn’t enough either. In my previous article about jsonparser, the painful part wasn’t that I had no tests. I had near-100% coverage in the area that mattered. The problem was that malformed input behavior was never properly described. So the tests proved what existed, not what should have existed.

You cannot test what you never described.

Security makes this even less optional. For years, many teams survived with some quiet version of security by obscurity. Not officially, of course. Everyone says security matters. But in practice, a lot of software depended on nobody looking too closely, or on attackers moving slowly enough that maintainers had time to react.

That assumption is breaking. VulnCheck reported that in the first half of 2025, 32.1% of known exploited vulnerabilities had exploitation evidence on or before the day the CVE was issued. This doesn’t mean every vulnerability becomes an exploit in hours, but it does mean the old time cushion isn’t something you can build your product around anymore.

So things that felt optional before become normal engineering requirements: malformed input, authorization boundaries, resource limits, timeout behavior, error states, data exposure, public API behavior. These aren’t enterprise extras. They’re product requirements.

This is the uncomfortable part: the trust problem is now everyone’s problem. Even if your company hasn’t “adopted AI,” your people probably have. Even if your CI is green, it may be green against the wrong intent. Even if your coverage is high, it may cover the behavior you remembered to describe, not the behavior the product actually needs.

So we need a different source of truth. Not instead of CI/CD, not instead of tests, not instead of code review. Above them. Something that says what the system is supposed to do, which obligations apply, what evidence proves them, and what becomes suspicious when something changes.

Otherwise AI won’t only help us move faster.
It will help us move faster with a false feeling of safety.

The outside structure is not the product

I know this problem from open source. For the last 12 years at least, I worked a lot in open source. I had my own popular open source projects, and today at Tyk we build an open source API Gateway.

Open source is hard. Not because people are bad. Usually it’s the opposite. Someone from the outside sends you a pull request. Maybe it’s a bug fix. Maybe a new feature. Maybe it’s useful. Maybe it’s technically correct. Maybe they spent their evening on it.

But as a maintainer, you still need to get inside the context. You need to understand what’s happening and why this person is doing it. You can be fast and accept too much, or stay picky and make people unhappy. Neither option really solves the trust problem.

The real issue isn’t that contributors are bad. The issue is that they see the outside structure. They see the code. Maybe they see the tests. Maybe they see the docs. But they don’t see the intent in the same way the owner of the project sees it. They don’t know all the small product promises made over the years. They don’t know which ugly thing is accidental and which ugly thing is load-bearing. They don’t know which customer flow depends on some behavior that looks strange from the outside.

They are not inside of this bubble.

And this isn’t only open source. The same thing happens inside a company. Someone from support knows the product very well. They see customer pain every day. They may even be technical enough to raise a pull request. Someone from solutions architecture can do the same. Another team can contribute to your service. AI makes all of this easier.

But internal doesn’t automatically mean trusted.

A support engineer may understand the product from the customer side, but not the architecture. Another team may understand code, but not the local history. AI may generate something that looks clean, but it has no real ownership unless someone gives it context and checks it.

These contributions can become shallow. Not useless. Shallow. They touch the visible layer of the system, but they aren’t backed by the deep intent of the people who own this part of the product.

We tried relaxing quality gates a few times. More people contributing sounds obviously good, especially when every company has more backlog than humans. But we had cases where a simple line, a simple fix, broke everything. We had other cases where the fix was so big in scope that it was too dangerous to move there.

The conclusion wasn’t “only engineers can write code.”
The conclusion was: if you want to scale engineering, it’s always about trust.

This is also why “move fast” changes meaning when you have customers. When you’re still searching for an MVP, you can break things and call it learning. But when customers put your product inside their infrastructure, the product is no longer fully yours. They pay you for stability, security, and predictable behavior. In a way, you give away part of the ownership.

At Tyk, this is very real. We build software used by banks, governments, and large enterprises. Quality assurance isn’t some internal slogan. It’s part of the relationship with customers. Every software has bugs; I don’t want to pretend otherwise. But the price of a bug isn’t the same everywhere. Sometimes it’s legal. Sometimes it’s regulatory. Sometimes it’s very big money. Forget even the money for a second: what if the bank goes down? It can become a national-level issue.

Speed isn’t how quickly you can make a change.
Speed is how quickly you can safely absorb change.

Lehman’s software evolution work has a phrase that fits here: “The safe rate of change per release is constrained by the process dynamics.” In the same passage, he says that as the number, size, and architectural distance of changes increase, complexity and fault rate grow more than linearly.

This sounds academic, but it matches product reality. You can move only as fast as your safety norms allow. Your current team, your architecture, your customer base, your process, your quality gates, your review culture — all of this defines your real speed.

If AI gives you more change than your trust system can absorb, you aren’t scaling engineering. You’re scaling incoming work.

Temporary specs become archaeology

One of the deeper problems is how we treat specifications in consumer engineering.

Most of what we call a specification is a temporary artifact. You start with all the best practices. Maybe an RFC. Then it becomes a detailed Jira ticket. Maybe later there is an ADR. There are comments in GitHub. A Slack thread. A Confluence page. A few decisions made during review because reality was different from the original assumption.

At the moment, this feels normal. This is how software gets built. But after some time, all these artifacts pile up. If you want to understand how a component works, you need to dig through history. You need to understand why it ended up in this final state. Why this was done and not that. You may be lucky and find the exact explanation. In most cases it’s lost in someone’s head.

This is archaeology, not development.

The bigger problem is that these artifacts are independent. The RFC isn’t connected to all the code. The Jira ticket isn’t connected to all the tests. The docs are scattered across ten pages. The final implementation isn’t connected back to the original assumptions. It’s not a graph.

So we trust a person — engineer, architect, lead, PM — to hold the high-level picture in their head. We trust them to find all dependencies. We trust them to notice backwards compatibility issues. We trust them to know which docs need updating. We trust them to remember which customer flow can break.

And of course people forget. Not because they’re careless. Because this is too much context for one person to carry.

You fix a bug and break some other flow. You build a feature and forget a dependency with another service. You update two documentation pages and miss the other eight. The feature exists, but it’s unusable for one group of users. The implementation works, but not in the real production shape.

The spec was supposed to create clarity.
But because it was temporary, it becomes one more historical artifact.

This is also why spec-first isn’t enough. Spec-driven development is better than no spec. Planning before coding is obviously better than jumping into implementation. But if the spec is still treated as a temporary artifact, after a few iterations you end up in the same position, with intent chaos.

During development, the spec always changes. You start with assumptions. Researched assumptions, but still assumptions. Then implementation begins and reality appears. The architecture doesn’t work. A limitation appears. A reviewer notices a security issue. QA finds a case. A customer dependency changes the direction. And in many teams, the spec isn’t updated.

The real knowledge moves into GitHub comments, Slack messages, review threads, and people’s heads. The Jira ticket becomes stale. The implementation says one thing. The ticket says another.

Imagine someone from QA comes back from vacation and needs to test the feature. They see the ticket. They see the implementation. They have no idea what is happening. Why this? Why not that? Is this intended?

It’s so common. I bet a lot of you feel the same.

How I can know what I don’t know?

A lot of bugs are not in the code first. They are in the missing specification.

Have you actually described what should happen if the input is malformed? Have you described that this functionality must not allow SQL injection? What happens if the third-party service times out? What is the error state? Have you described authorization boundaries? Resource limits? Performance boundaries? What happens when something goes from ten requests per second in a test environment to thousands in production?

There are also more subtle cases. Concurrency. Non-deterministic behavior. Map iteration. Merge order. I’m looking at you, Go.

Have you described that the behavior should be deterministic? Did you write a test for it? Did the test prove the requirement, or did it just execute the code?

This is where checklists, obligations, processes, discipline, and all the boring stuff come in. I know people hate boring process. I hate fake process too. Documents nobody reads. Boxes people tick after the fact. Quality theatre.

But the useful version is different. An obligation is not a test case. It’s a category of behavior you are required to describe: malformed input, boundary behavior, error handling, access denied, determinism, idempotency, atomicity, nil safety, overflow safety, encoding safety.

The obligation doesn’t tell you the answer.
It forces you to ask the question.

That is why I like it. It turns “maybe someone remembers” into a deterministic process. The checklist itself is human judgment. But checking whether the spec covered the checklist can be mechanical.

This is also where AI can help without pretending to own the product. If the code uses goroutines, the system can ask where cancellation, lifecycle, and error propagation are described. If code depends on map iteration or merge logic, it can ask whether determinism or commutativity matters. If code reads time directly, it can ask whether time is part of the behavior and how this is tested. If code changes a public API, it can ask where compatibility and documentation obligations are.

This isn’t AI judging architecture taste. It’s tooling surfacing missed questions.

That is the “how I know what I don’t know” loop. Spec obligations force code and test evidence. Code shape can reveal missing spec questions.

What regulated industries got right

Consumer engineering and regulated engineering live in different worlds. Different tools. Different conferences. Different language. Some of it is archaic. Some of it is bureaucracy. I don’t want every SaaS team to become an avionics certification team.

But we shouldn’t ignore what they learned.

I expected to find paperwork. Annoyingly, I found a lot of things our world forgot to learn.

In aviation, automotive, medical devices, space systems, the spec isn’t treated as a temporary note. It’s a source of truth that lives together with the software. Requirements have IDs. They have layers. They are linked to documentation, tests, implementation, verification evidence. You can see blast radius. You can see what a change affects. During review, if implementation differs from the spec, the spec must be updated.

The useful idea is not the paperwork.
The useful idea is that intent is durable, traceable, and connected to evidence.

NASA’s FRET is one example of this direction. It lets users enter hierarchical system requirements in structured natural language, gives those requirements unambiguous semantics, and can show them as natural language, formal logic, diagrams, and interactive simulation.

That doesn’t mean every product team needs FRET or formal methods everywhere. It means the requirement is not just a document. It’s something you can analyze, link, verify, and keep alive.

This is where requirement management becomes interesting again for consumer engineering. Not the old heavy version copied blindly from regulated industries. Not paperwork for paperwork. But the useful part: a source of truth, cross-links, invalidation, traceability, and evidence.

Combined with everything consumer engineering learned over the years: CI/CD, pull requests, fast feedback, developer experience, automated tests, observability, docs, release automation.

We should not throw away modern engineering.
We should add the missing trust layer.

From pull request to evidence pack

Today a pull request usually gives me code, maybe tests, maybe a description. But it doesn’t give me the whole chain.

It doesn’t tell me the original intent. It doesn’t tell me which obligations apply. It doesn’t show the blast radius. It doesn’t show which docs changed or should have changed. It doesn’t show which specs this conflicts with. It doesn’t show what changed during implementation compared to the plan.

So the reviewer has to reconstruct all of that.

Again, archaeology.

What I want instead is an evidence pack. Not enterprise theatre. Not documents for the sake of documents. A practical package that makes the change reviewable.

Here is the intent. Here are the requirements. Here are the obligations. Here are the tests that witness them. Here are the docs. Here is the blast radius. Here is how it aligns with existing specs and where we checked for conflicts. Here is what changed during implementation. Here is what still needs human judgment.

Then the pull request isn’t only code. It’s the full chain of development.

This matters for open source. It matters for support engineers contributing fixes. It matters for other internal teams. It matters for AI agents. You don’t trust the contributor blindly. You don’t trust AI blindly. You trust the evidence chain, and then you still apply human judgment where judgment is needed.

This will feel slower at first. Writing obligations is slower than writing a vague ticket. Linking tests to requirements is slower than writing random tests. Updating docs through the graph is slower than pushing a change and hoping someone remembers. But not all friction is bad.

The question is whether the friction creates trust.

Bureaucracy gives you friction without trust.
Evidence gives you friction that lets more people move safely.

If this trust exists, then AI can actually help us scale. Not by dumping more pull requests into the same review bottleneck, but by making more changes reviewable, traceable, and safe to absorb in parallel.

Without trust, maintainers become managers of incoming things. Instead of thinking about architecture, future, and vision, they review an endless stream of pull requests, fixes, and generated artifacts.

That is not the scaling I want.

Why I am building Proof

This is why I am building Proof.

I don’t want another tool whose main purpose is to create more code. We already have many of those. The problem isn’t that we can’t produce enough artifacts. The problem is that the artifacts don’t preserve intent.

I want specs to stop being temporary. I want requirements to live with the software. I want obligations to force the boring questions before they become production bugs. I want code, tests, docs, and requirements to invalidate each other when they drift. I want a reviewer to see the evidence chain instead of rebuilding it from memory.

AI will make engineering faster. That part is already happening. But faster without trust is not enough.

For me, the real question is this: how can I end up in the position where it’s not just a pull request coming from someone from the outside, but a well-thought evidence pack that makes me believe I can merge it as soon as possible?

That is the scaling I care about.

Not just more code.

More trusted change.

The Verification Gap

Discussion about this post

Ready for more?