The Specification Is the Product Now

Published in March 2026

The Software Development Lifecycle Is Dead by Boris Tane.

The argument: AI agents haven't merely accelerated the software development lifecycle (SDLC).

They've collapsed it.

Requirements, design, implementation, testing, code review, deployment, and monitoring — those used to be separate phases.

Now the loop is shorter: intent, agent builds and deploys, observe, repeat.

The new skill is context engineering, and the new safety net is observability.

I find the article compelling in parts.

I think it's wrong in the most important part.

The human isn't the bottleneck

Boris argues that code review via PRs is a relic.

Agents generate too much code for human review queues.

He advocates agents committing to main with automated verification, human review only on exceptions.

Just because an agent can spin off 500 PRs doesn't mean those are correct or desirable.

The review queue is there for a reason.

I explored a related argument in Software Is Cheap Now: Thorsten Ball's "I Am the Bottleneck Now" video, where a bug came in on Slack, he pasted it into Codex, and it was done in five minutes.

He was the bottleneck.

The pipeline could have gone straight from Slack to Codex to a review thread.

The entire ticket/triage/sprint system exists because human engineers are expensive.

If that constraint is lifted, the loop needs to change.

Pieter Levels takes this to the extreme.

He runs Claude Code directly on his production servers with --dangerously-skip-permissions.

No deployment step, no code review, no checking.

He claims to have delivered 10x his normal output for the week and to have emptied his entire feature/bug board for the first time across eight products.

He's sitting on $3M in revenue per year, and I'm not, so something is clearly working.

But AI adds code and it doesn't refactor.

It hacks the codebase toward something more complicated rather than simplifying.

At the pace Pieter ships across eight products, the technical debt accrual must be enormous.

And Claude makes security mistakes constantly without explicit guidance.

One of his week's accomplishments was migrating from URL-based login tokens to session tokens, a basic security practice.

The combination of no permissions, direct production deployment, and zero review is a security incident waiting to happen.

Boris, Thorsten, and Pieter all frame the human as a speed constraint to be optimised out of existence.

I don't buy it.

A human who ships a bad deployment feels the weight of the 2am pager, the post-mortem, the reputation hit.

An LLM feels nothing.

Humans persist in the loop because they're the only ones who bear consequences.

We should redesign the SDLC to optimise for humans (the accountable reviewers), not machines (the prolific generators).

Where Boris is right

Where Boris is right: agents monitoring rollouts with feedback loops.

The vision of agents adjusting traffic percentages based on error rates and auto-rolling back during latency spikes.

LLMs perform well when they receive feedback, and production telemetry is a powerful signal.

This is probably the most immediately actionable idea in the piece.

Charity Majors made the argument that observability is the last line of defence back in 2019 in I test in prod.

Production has always been the real test environment, whether we like it or not.

It's just more true now that agents ship faster than humans can review.

If code is disposable, what survives?

The more interesting question Boris doesn't ask: if code is disposable, what's the lasting artifact?

Thorsten says you'd build everything and throw away what you don't like.

Fine.

But then what do you keep?

The code gets rebuilt from scratch whenever you need it.

What you keep is the specification.

Nicholas Zakas takes this seriously.

He describes building a chain of enterprise documents: PRDs (product requirements documents), ADRs (architecture decision records), technical design documents (TDDs), and task lists.

When something breaks, you trace back through the chain to find where ambiguity crept in.

A "save for later" bug was traced to a TDD that implied but didn't explicitly state a read-through cache pattern.

The fix wasn't in the code, but in the spec.

If I'm building an app in three years, I don't care about npm dependencies or framework versions.

I only need my specification to rebuild from scratch.

Writing detailed specifications and keeping them current is the only thing with lasting value.

Sau Sheong from GovTech Singapore pushes back on this.

If AI can generate specs from code as easily as code from specs, why not keep code as the primary artifact?

Code is unambiguous.

It compiles, runs, and can be tested.

Specifications are prone to drift and interpretation.

It's a fair objection.

His nuanced answer: for frequently rebuilt systems where reasoning is expensive, specifications are more durable.

For long-lived systems where implementation embodies hard-won edge cases, code remains the better source of truth.

We have decades of tooling and ecosystem optimised for code as the source of truth.

I think the answer depends on how disposable your code actually is.

If you're rebuilding from scratch regularly (and agents are making that more realistic), the spec wins.

If the code has survived five years of production edge cases, the code is the spec, whether you like it or not.

There's a problem with this, though.

Specifications without verification

Specifications without verification are just prose that the model can ignore.

Alex Jones (Principal Engineer at AWS, creator of K8sGPT) calls this "Provable Autonomy."

As agents become more autonomous, we need invariants and properties that can be mathematically reasoned about and falsified.

Observability and policy alone won't cut it.

You need constraints with teeth.

So I went deep on specification languages to see if any of them deliver this.

Allium is the most interesting new entrant.

It bills itself as an "LLM-native language" for behavioural specifications: when/requires/ensures rules, entity models, config blocks.

The key claim is that the LLM is the interpreter.

No compiler, no parser, no runtime.

You write structured constraints, and the LLM consumes them directly.

The problem is that the gap between Allium and Gherkin (without a test runner) is thin.

Gherkin's Given/When/Then provides the same structured constraint on the author.

You can use .feature files without Cucumber.

What Allium adds over Gherkin is first-class entity definitions with derived values, universal quantification (rules over all matching entities versus concrete examples), a config block, and cross-rule entity references.

It's debatable whether that justifies creating a new language rather than adopting a format that's been around for 15 years.

I looked at the rest of the landscape.

TLA+, Event-B, Dafny.

The pattern is consistent: everything that captures behaviour at the level of detail you'd want also verifies it:

Event-B has a prover (a tool that can mathematically prove properties hold).
TLA+ has a model checker (a tool that exhaustively tests every possible state).
Dafny has an SMT solver (a tool that automatically checks logical constraints).

Everything that skips verification is much less structured (think .cursorrules files).

Allium sits in an unstable middle: structured enough to create a maintenance burden, not verified enough to guarantee the structure means anything.

Typed pseudocode proved a stronger alternative than any of these for LLM consumption.

The insight comes from the PAL paper (Program-Aided Language Models, Gao et al. 2022): LLMs reason better when given structured code than prose.

The idea is to write TypeScript function signatures with real preconditions as code and natural language for the implementation holes:

function requestPasswordReset(user: User, email: string): ResetToken | Error {
  if (user.status !== 'active') {
    return new Error('User not active')
  }

  if (email !== user.email) {
    return new Error('Email mismatch')
  }

  return /* create token, send email */ // <- You write the comment pretending it's the code is there
}

This is strictly more rigorous than Allium.

The type checker validates structural consistency (Allium has zero verification).

Preconditions are real, executable code.

The spec and code are the same artifact, so there's no sync problem.

And you get free IDE support.

The strongest argument for a separate spec language has nothing to do with LLMs.

Code is imperative by nature, while specs are declarative.

TypeScript can express "this thing exists and has this type," but cannot express "this thing is derived from those other things" without becoming implementation.

The moment you write the derivation formula, you've committed to a computation strategy: class? getter? function? in-memory?

A spec should just state the relationship.

SQL, of all things, handles this well.

DDL (data definition language) plus views is declarative.

Views state derivations without dictating computation.

Foreign keys express relationships.

CHECK constraints are real preconditions and LLMs have massive amounts of training data.

For the entity model part of a specification, SQL is more rigorous, better-tooled, and more widely understood than any new spec language.

But I tested this thinking against a concrete problem, and the conclusion shifted.

I maintain a CLI for producing podcast and interview videos (the same one I wrote about in You Can't Hide a Secret from a Process That Runs as You).

It's a multi-phase workflow with dozens of steps, approval gates, and derived artifacts where later outputs depend on earlier ones.

The documentation is detailed, with exact commands for every step.

An LLM can read that documentation and understand every step, but the failures aren't about understanding.

The LLM skips steps, reorders them, cuts corners on repetitive work, and forgets soft constraints buried in prose once the context grows long enough.

When the context window compacts, it loses track of where it is entirely.

A specification language would be a nicer description of the same workflow.

But the LLM already understands the workflow fine; it just does what it wants.

A better description doesn't help if nothing enforces it.

What actually worked was treating the workflow like a build system.

A function that reads the current state, compares it to the workflow definition, and returns what's done, what's next, and what's stale.

When a fix in an early phase invalidates everything downstream, the function marks all derived artifacts as stale.

The LLM can't skip ahead because the tooling keeps reporting stale outputs until the source is fixed and everything is rebuilt.

This is Make and Bazel thinking applied to a content pipeline.

The specification isn't a document the LLM reads. It's code that runs, checks the state, and refuses to proceed.

What context engineering misses

Boris's conclusion, that "context is all that's left," is provocative but incomplete.

Software is about modelling a problem and finding solutions.

There's a code aspect, but the harder part is domain-specific problem-solving that the LLM may not grasp.

Context engineering helps, but domain understanding and judgment go beyond "providing context to an agent."

Code used to be an expensive, scarce resource.

Now it's judgment, domain understanding, and the willingness to carry the pager at 2am when the agent ships something wrong.

Enjoyed this post?

I write about Kubernetes, TypeScript, software design, and AI. You can get new posts delivered to your inbox or via RSS.

Subscribe to my newsletter RSS Feed