<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Daniele Polencic</title>
  <link href="https://danielepolencic.com"/>
  <link rel="self" href="https://danielepolencic.com/feed.xml"/>
  <updated>2026-03-13T00:00:00Z</updated>
  <id>https://danielepolencic.com</id>
  <entry>
  <title>World Book Day Props: Building with Claude</title>
  <link href="https://danielepolencic.com/world-book-day-claude-props"/>
  <id>https://danielepolencic.com/world-book-day-claude-props</id>
  <published>2026-03-13T00:00:00Z</published>
  <updated>2026-03-13T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5">For World Book Day, my daughters needed to dress up as brave characters from their favorite books.</p><p class="line-height-copy measure-wide f5">One chose Paddington, and the other went with Skye from Paw Patrol.</p><p class="line-height-copy measure-wide f5">My wife handled the costumes, while I handled the props.</p><p class="line-height-copy measure-wide f5">Paddington needed a suitcase, and Skye needed a jetpack with wings that could fold out, just like Buzz Lightyear.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">After about twenty minutes of prompting, I had two design tools ready, complete with sliders, live previews, and downloadable PDF templates for cutting.</strong></p><div class="ph2"></div><p class="line-height-copy measure-wide f5">Usually, I just grab some cardboard and start cutting, making up the sizes as I go.</p><p class="line-height-copy measure-wide f5">But this time, I had some real design questions.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">How big should the suitcase be?</em></p><p class="line-height-copy measure-wide f5"><em class="font-italic">If Paddington’s marmalade jar is 10cm in diameter, how deep does the suitcase need to be to fit it?</em></p><p class="line-height-copy measure-wide f5"><em class="font-italic">For the jetpack wings, where should the pivot point go?</em></p><p class="line-height-copy measure-wide f5"><em class="font-italic">How long can the wings be? Would they hit the pack when moving from down to horizontal?</em></p><p class="line-height-copy measure-wide f5">So, I decided to make interactive models to figure out these details before cutting any cardboard.</p><p class="line-height-copy measure-wide f5">I asked Claude to create a separate HTML page for each prop.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">These were parametric design tools that let me adjust dimensions with sliders and see the results instantly.</strong></p><p class="line-height-copy measure-wide f5">The <a href="https://danielepolencic.github.io/suitecase/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">suitcase generator</a> (<a href="https://github.com/danielepolencic/suitecase" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">source</a>) is a tool with sliders for width, height, body depth, and lid depth.</p><p class="line-height-copy measure-wide f5">It creates two open-top trays that hinge at the back and latch at the front.</p><p class="line-height-copy measure-wide f5">You can download a full-size PDF template with clear guides: solid lines for cuts, dashed lines for folds, green edges for the hinge, and orange edges for the latch.</p><p class="line-height-copy measure-wide f5">Just cut out both pieces, fold the walls, glue the corners, and tape the body and lid together at the back.</p><img class="display-block pv3 center" src="https://static.danielepolencic.com/43e21ec6b66d1f3364a98f0766028b7e.gif" alt="Suitcase prototype demo"/><p class="line-height-copy measure-wide f5">The <a href="https://danielepolencic.github.io/sky-jet-pack/prototype.html" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">jetpack prototype</a> (<a href="https://github.com/danielepolencic/sky-jet-pack" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">source</a>) lets you change the pack size, wing span, and height, pivot point, and use a slider to move the wings from down to horizontal.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It even shows how much of the wing is hidden when folded!</strong></p><p class="line-height-copy measure-wide f5">The project includes more than just the HTML tool: Claude also made OpenSCAD and STL files for 3D-printable parts, a Python script to compare layouts, and PNG images of different designs.</p><img class="display-block pv3 center" src="https://static.danielepolencic.com/d40f1f6a777294f326c88b98df8d531f.gif" alt="Jetpack prototype demo"/><p class="line-height-copy measure-wide f5">Claude’s OpenSCAD pivot parts looked great, but they would have taken about five hours to print, and I didn’t have time to test the fit.</p><p class="line-height-copy measure-wide f5">So I used a <a href="https://www.printables.com/model/110026-nut-and-bolt-optimized-for-3d-printing" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">nut-and-bolt model from Printables</a> with a 4mm thread pitch made for FDM printers.</p><p class="line-height-copy measure-wide f5">I printed it and used two nuts on each wing to lock the angle.</p><p class="line-height-copy measure-wide f5">One thing didn’t go as planned: Claude first tried to split the full-size PDF across several A4 sheets for home printing, but the calculation was off because it didn’t account for printer margins.</p><p class="line-height-copy measure-wide f5">In the end I decided to use Docuslice for that.</p><p class="line-height-copy measure-wide f5">Each tool only took about twenty minutes of prompting.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">If I had made these from scratch—working out the box shapes, SVG graphics, PDF exports, wing rotation math, collision checks, and the user interface—it would have taken me a week of evenings.</strong></p><img class="display-block pv3 center" src="https://static.danielepolencic.com/45b1633c2f67dcde62c5b9339507e98d.jpg" alt="My daughter with the finished Paddington suitcase"/><p class="line-height-copy measure-wide f5">Making these custom apps with Claude is actually a lot of fun.</p><p class="line-height-copy measure-wide f5">At this point, I think the only limit is my imagination.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
<entry>
  <title>The Specification Is the Product Now</title>
  <link href="https://danielepolencic.com/specification-is-the-product"/>
  <id>https://danielepolencic.com/specification-is-the-product</id>
  <published>2026-03-09T00:00:00Z</published>
  <updated>2026-03-09T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://boristane.com/blog/the-software-development-lifecycle-is-dead/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">The Software Development Lifecycle Is Dead</a></strong> by Boris Tane.</p><p class="line-height-copy measure-wide f5">The argument: AI agents haven't merely accelerated the software development lifecycle (SDLC).</p><p class="line-height-copy measure-wide f5">They've collapsed it.</p><p class="line-height-copy measure-wide f5">Requirements, design, implementation, testing, code review, deployment, and monitoring — those used to be separate phases.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Now the loop is shorter: intent, agent builds and deploys, observe, repeat.</strong></p><p class="line-height-copy measure-wide f5">The new skill is context engineering, and the new safety net is observability.</p><p class="line-height-copy measure-wide f5">I find the article compelling in parts.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I think it's wrong in the most important part.</strong></p><div class="ph2"></div><h2 class="f8 pt5 pb2 mt3" id="the-human-isn-t-the-bottleneck">The human isn't the bottleneck</h2><p class="line-height-copy measure-wide f5">Boris argues that code review via PRs is a relic.</p><p class="line-height-copy measure-wide f5">Agents generate too much code for human review queues.</p><p class="line-height-copy measure-wide f5">He advocates agents committing to main with automated verification, human review only on exceptions.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Just because an agent can spin off 500 PRs doesn't mean those are correct or desirable.</strong></p><p class="line-height-copy measure-wide f5"><em class="font-italic">The review queue is there for a reason.</em></p><p class="line-height-copy measure-wide f5">I explored a related argument in <strong class="font-bold"><a href="/software-is-cheap-now" target="_self" class="link primary-10 underline hover-primary-8 ">Software Is Cheap Now</a></strong>: Thorsten Ball's <em class="font-italic">&quot;I Am the Bottleneck Now&quot;</em> video, where a bug came in on Slack, he pasted it into Codex, and it was done in five minutes.</p><p class="line-height-copy measure-wide f5">He was the bottleneck.</p><p class="line-height-copy measure-wide f5">The pipeline could have gone straight from Slack to Codex to a review thread.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The entire ticket/triage/sprint system exists because human engineers are expensive.</strong></p><p class="line-height-copy measure-wide f5">If that constraint is lifted, the loop needs to change.</p><p class="line-height-copy measure-wide f5"><a href="https://x.com/levelsio/status/2027884347626303630" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Pieter Levels</a> takes this to the extreme.</p><p class="line-height-copy measure-wide f5">He runs Claude Code directly on his production servers with <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">--dangerously-skip-permissions</code>.</p><p class="line-height-copy measure-wide f5">No deployment step, no code review, no checking.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">He claims to have delivered 10x his normal output for the week and to have emptied his entire feature/bug board for the first time across eight products.</strong></p><p class="line-height-copy measure-wide f5">He's sitting on $3M in revenue per year, and I'm not, so something is clearly working.</p><p class="line-height-copy measure-wide f5">But AI adds code and it doesn't refactor.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It hacks the codebase toward something more complicated rather than simplifying.</strong></p><p class="line-height-copy measure-wide f5">At the pace Pieter ships across eight products, the technical debt accrual must be enormous.</p><p class="line-height-copy measure-wide f5">And Claude makes security mistakes constantly without explicit guidance.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">One of his week's accomplishments was migrating from URL-based login tokens to session tokens, a basic security practice.</strong></p><p class="line-height-copy measure-wide f5">The combination of no permissions, direct production deployment, and zero review is a security incident waiting to happen.</p><p class="line-height-copy measure-wide f5">Boris, Thorsten, and Pieter all frame the human as a speed constraint to be optimised out of existence.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">I don't buy it.</em></p><p class="line-height-copy measure-wide f5"><strong class="font-bold">A human who ships a bad deployment feels the weight of the 2am pager, the post-mortem, the reputation hit.</strong></p><p class="line-height-copy measure-wide f5">An LLM feels nothing.</p><p class="line-height-copy measure-wide f5">Humans persist in the loop because they're the only ones who bear consequences.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">We should redesign the SDLC to optimise for humans (the accountable reviewers), not machines (the prolific generators).</strong></p><h2 class="f8 pt5 pb2 mt3" id="where-boris-is-right">Where Boris is right</h2><p class="line-height-copy measure-wide f5">Where Boris is right: agents monitoring rollouts with feedback loops.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The vision of agents adjusting traffic percentages based on error rates and auto-rolling back during latency spikes.</strong></p><p class="line-height-copy measure-wide f5">LLMs perform well when they receive feedback, and production telemetry is a powerful signal.</p><p class="line-height-copy measure-wide f5">This is probably the most immediately actionable idea in the piece.</p><p class="line-height-copy measure-wide f5">Charity Majors made the argument that observability is the last line of defence back in 2019 in <strong class="font-bold"><a href="https://increment.com/testing/i-test-in-production/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">I test in prod</a></strong>.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Production has always been the real test environment, whether we like it or not.</strong></p><p class="line-height-copy measure-wide f5">It's just more true now that agents ship faster than humans can review.</p><h2 class="f8 pt5 pb2 mt3" id="if-code-is-disposable-what-survives-">If code is disposable, what survives?</h2><p class="line-height-copy measure-wide f5"><em class="font-italic">The more interesting question Boris doesn't ask: if code is disposable, what's the lasting artifact?</em></p><p class="line-height-copy measure-wide f5">Thorsten says you'd build everything and throw away what you don't like.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">Fine.</em></p><p class="line-height-copy measure-wide f5"><strong class="font-bold">But then what do you keep?</strong></p><p class="line-height-copy measure-wide f5">The code gets rebuilt from scratch whenever you need it.</p><p class="line-height-copy measure-wide f5">What you keep is the specification.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://humanwhocodes.com/blog/2026/02/artifacts-ai-assisted-programming/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Nicholas Zakas</a></strong> takes this seriously.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">He describes building a chain of enterprise documents: PRDs (product requirements documents), ADRs (architecture decision records), technical design documents (TDDs), and task lists.</strong></p><p class="line-height-copy measure-wide f5">When something breaks, you trace back through the chain to find where ambiguity crept in.</p><p class="line-height-copy measure-wide f5">A &quot;save for later&quot; bug was traced to a TDD that implied but didn't explicitly state a read-through cache pattern.</p><p class="line-height-copy measure-wide f5">The fix wasn't in the code, but in the spec.</p><p class="line-height-copy measure-wide f5">If I'm building an app in three years, I don't care about npm dependencies or framework versions.</p><p class="line-height-copy measure-wide f5">I only need my specification to rebuild from scratch.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Writing detailed specifications and keeping them current is the only thing with lasting value.</strong></p><p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://sausheong.com/from-vibe-coding-to-agentic-engineering-1ca3ca72b5ac" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Sau Sheong</a></strong> from GovTech Singapore pushes back on this.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">If AI can generate specs from code as easily as code from specs, why not keep code as the primary artifact?</em></p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Code is unambiguous.</strong></p><p class="line-height-copy measure-wide f5">It compiles, runs, and can be tested.</p><p class="line-height-copy measure-wide f5">Specifications are prone to drift and interpretation.</p><p class="line-height-copy measure-wide f5">It's a fair objection.</p><p class="line-height-copy measure-wide f5">His nuanced answer: for frequently rebuilt systems where reasoning is expensive, specifications are more durable.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">For long-lived systems where implementation embodies hard-won edge cases, code remains the better source of truth.</strong></p><p class="line-height-copy measure-wide f5">We have decades of tooling and ecosystem optimised for code as the source of truth.</p><p class="line-height-copy measure-wide f5">I think the answer depends on how disposable your code actually is.</p><p class="line-height-copy measure-wide f5">If you're rebuilding from scratch regularly (and agents are making that more realistic), the spec wins.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">If the code has survived five years of production edge cases, the code is the spec, whether you like it or not.</strong></p><p class="line-height-copy measure-wide f5"><em class="font-italic">There's a problem with this, though.</em></p><h2 class="f8 pt5 pb2 mt3" id="specifications-without-verification">Specifications without verification</h2><p class="line-height-copy measure-wide f5"><strong class="font-bold">Specifications without verification are just prose that the model can ignore.</strong></p><p class="line-height-copy measure-wide f5"><a href="https://www.linkedin.com/posts/jonesax_ive-been-asked-a-few-times-recently-whether-share-7427239408459886592-tJwy" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Alex Jones</a> (Principal Engineer at AWS, creator of K8sGPT) calls this &quot;Provable Autonomy.&quot;</p><p class="line-height-copy measure-wide f5">As agents become more autonomous, we need invariants and properties that can be mathematically reasoned about and falsified.</p><p class="line-height-copy measure-wide f5">Observability and policy alone won't cut it.</p><p class="line-height-copy measure-wide f5">You need constraints with teeth.</p><p class="line-height-copy measure-wide f5">So I went deep on specification languages to see if any of them deliver this.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://github.com/juxt/allium" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Allium</a> is the most interesting new entrant.</strong></p><p class="line-height-copy measure-wide f5">It bills itself as an &quot;LLM-native language&quot; for behavioural specifications: <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">when/requires/ensures</code> rules, entity models, config blocks.</p><p class="line-height-copy measure-wide f5">The key claim is that the LLM <em class="font-italic">is</em> the interpreter.</p><p class="line-height-copy measure-wide f5">No compiler, no parser, no runtime.</p><p class="line-height-copy measure-wide f5">You write structured constraints, and the LLM consumes them directly.</p><p class="line-height-copy measure-wide f5">The problem is that the gap between Allium and <a href="https://cucumber.io/docs/gherkin/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Gherkin</a> (without a test runner) is thin.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Gherkin's <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">Given/When/Then</code> provides the same structured constraint on the author.</strong></p><p class="line-height-copy measure-wide f5">You can use <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.feature</code> files without Cucumber.</p><p class="line-height-copy measure-wide f5">What Allium adds over Gherkin is first-class entity definitions with derived values, universal quantification (rules over all matching entities versus concrete examples), a <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">config</code> block, and cross-rule entity references.</p><p class="line-height-copy measure-wide f5">It's debatable whether that justifies creating a new language rather than adopting a format that's been around for 15 years.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">I looked at the rest of the landscape.</em></p><p class="line-height-copy measure-wide f5"><a href="https://lamport.azurewebsites.net/tla/tla.html" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">TLA+</a>, <a href="https://www.event-b.org/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Event-B</a>, <a href="https://dafny.org/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Dafny</a>.</p><p class="line-height-copy measure-wide f5">The pattern is consistent: everything that captures behaviour at the level of detail you'd want also <em class="font-italic">verifies</em> it:</p><ul class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide"><strong class="font-bold">Event-B</strong> has a prover (a tool that can mathematically prove properties hold).</li><li class="line-height-copy f5 mv1 measure-wide"><strong class="font-bold">TLA+</strong> has a model checker (a tool that exhaustively tests every possible state).</li><li class="line-height-copy f5 mv1 measure-wide"><strong class="font-bold">Dafny</strong> has an SMT solver (a tool that automatically checks logical constraints).</li></ul><p class="line-height-copy measure-wide f5">Everything that skips verification is much less structured (think <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.cursorrules</code> files).</p><p class="line-height-copy measure-wide f5">Allium sits in an unstable middle: structured enough to create a maintenance burden, not verified enough to guarantee the structure means anything.</p><p class="line-height-copy measure-wide f5">Typed pseudocode proved a stronger alternative than any of these for LLM consumption.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The insight comes from the <a href="https://arxiv.org/abs/2211.10435" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">PAL paper</a> (Program-Aided Language Models, Gao et al. 2022): LLMs reason better when given structured code than prose.</strong></p><p class="line-height-copy measure-wide f5">The idea is to write TypeScript function signatures with real preconditions as code and natural language for the implementation holes:</p><div class="mv4 mv5-l"><header class="bg-shadow-2 flex pv2 pl1 br--top rounded-2 position-relative"><div class="w1 h1 ml1 rounded-100 bg-dark-red"></div><div class="w1 h1 ml1 rounded-100 bg-green"></div><div class="w1 h1 ml1 rounded-100 bg-yellow"></div></header><pre class="code-light-theme position-relative overflow-auto mv0 rounded-2 br--bottom"><code class="font-mono line-height-copy"><span class="standard"><span class="token keyword">function</span> <span class="token function">requestPasswordReset</span><span class="token punctuation">(</span>user<span class="token operator">:</span> User<span class="token punctuation">,</span> email<span class="token operator">:</span> <span class="token builtin">string</span><span class="token punctuation">)</span><span class="token operator">:</span> ResetToken <span class="token operator">|</span> Error <span class="token punctuation">{</span>
  <span class="token keyword">if</span> <span class="token punctuation">(</span>user<span class="token punctuation">.</span>status <span class="token operator">!==</span> <span class="token string">'active'</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
    <span class="token keyword">return</span> <span class="token keyword">new</span> <span class="token class-name">Error</span><span class="token punctuation">(</span><span class="token string">'User not active'</span><span class="token punctuation">)</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">if</span> <span class="token punctuation">(</span>email <span class="token operator">!==</span> user<span class="token punctuation">.</span>email<span class="token punctuation">)</span> <span class="token punctuation">{</span>
    <span class="token keyword">return</span> <span class="token keyword">new</span> <span class="token class-name">Error</span><span class="token punctuation">(</span><span class="token string">'Email mismatch'</span><span class="token punctuation">)</span>
  <span class="token punctuation">}</span>

  <span class="token keyword">return</span> <span class="token comment">/* create token, send email */</span> <span class="token comment">// &lt;- You write the comment pretending it's the code is there</span>
<span class="token punctuation">}</span></span></code></pre></div><p class="line-height-copy measure-wide f5">This is strictly more rigorous than Allium.</p><p class="line-height-copy measure-wide f5">The type checker validates structural consistency (Allium has zero verification).</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Preconditions are real, executable code.</strong></p><p class="line-height-copy measure-wide f5">The spec and code are the same artifact, so there's no sync problem.</p><p class="line-height-copy measure-wide f5">And you get free IDE support.</p><p class="line-height-copy measure-wide f5">The strongest argument for a separate spec language has nothing to do with LLMs.</p><p class="line-height-copy measure-wide f5">Code is imperative by nature, while specs are declarative.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">TypeScript can express &quot;this thing exists and has this type,&quot; but cannot express &quot;this thing is derived from those other things&quot; without becoming implementation.</strong></p><p class="line-height-copy measure-wide f5">The moment you write the derivation formula, you've committed to a computation strategy: class? getter? function? in-memory?</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">A spec should just state the relationship.</strong></p><p class="line-height-copy measure-wide f5">SQL, of all things, handles this well.</p><p class="line-height-copy measure-wide f5">DDL (data definition language) plus views is declarative.</p><p class="line-height-copy measure-wide f5">Views state derivations without dictating computation.</p><p class="line-height-copy measure-wide f5">Foreign keys express relationships.</p><p class="line-height-copy measure-wide f5"><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">CHECK</code> constraints are real preconditions and LLMs have massive amounts of training data.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">For the entity model part of a specification, SQL is more rigorous, better-tooled, and more widely understood than any new spec language.</strong></p><p class="line-height-copy measure-wide f5">But I tested this thinking against a concrete problem, and the conclusion shifted.</p><p class="line-height-copy measure-wide f5">I maintain a CLI for producing podcast and interview videos (the same one I wrote about in <strong class="font-bold"><a href="/hiding-secrets-from-ai-agents" target="_self" class="link primary-10 underline hover-primary-8 ">You Can't Hide a Secret from a Process That Runs as You</a></strong>).</p><p class="line-height-copy measure-wide f5">It's a multi-phase workflow with dozens of steps, approval gates, and derived artifacts where later outputs depend on earlier ones.</p><p class="line-height-copy measure-wide f5">The documentation is detailed, with exact commands for every step.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">An LLM can read that documentation and understand every step, but the failures aren't about understanding.</strong></p><p class="line-height-copy measure-wide f5">The LLM skips steps, reorders them, cuts corners on repetitive work, and forgets soft constraints buried in prose once the context grows long enough.</p><p class="line-height-copy measure-wide f5">When the context window compacts, it loses track of where it is entirely.</p><p class="line-height-copy measure-wide f5">A specification language would be a nicer <em class="font-italic">description</em> of the same workflow.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">But the LLM already understands the workflow fine; it just does what it wants.</strong></p><p class="line-height-copy measure-wide f5"><em class="font-italic">A better description doesn't help if nothing enforces it.</em></p><p class="line-height-copy measure-wide f5">What actually worked was treating the workflow like a build system.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">A function that reads the current state, compares it to the workflow definition, and returns what's done, what's next, and what's stale.</strong></p><p class="line-height-copy measure-wide f5">When a fix in an early phase invalidates everything downstream, the function marks all derived artifacts as stale.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The LLM can't skip ahead because the tooling keeps reporting stale outputs until the source is fixed and everything is rebuilt.</strong></p><p class="line-height-copy measure-wide f5">This is Make and Bazel thinking applied to a content pipeline.</p><p class="line-height-copy measure-wide f5">The specification isn't a document the LLM reads. It's code that runs, checks the state, and refuses to proceed.</p><h2 class="f8 pt5 pb2 mt3" id="what-context-engineering-misses">What context engineering misses</h2><p class="line-height-copy measure-wide f5"><strong class="font-bold">Boris's conclusion, that &quot;context is all that's left,&quot; is provocative but incomplete.</strong></p><p class="line-height-copy measure-wide f5">Software is about modelling a problem and finding solutions.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">There's a code aspect, but the harder part is domain-specific problem-solving that the LLM may not grasp.</strong></p><p class="line-height-copy measure-wide f5">Context engineering helps, but domain understanding and judgment go beyond &quot;providing context to an agent.&quot;</p><p class="line-height-copy measure-wide f5">Code used to be an expensive, scarce resource.</p><p class="line-height-copy measure-wide f5">Now it's judgment, domain understanding, and the willingness to carry the pager at 2am when the agent ships something wrong.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
<entry>
  <title>You Can&#039;t Hide a Secret from a Process That Runs as You</title>
  <link href="https://danielepolencic.com/hiding-secrets-from-ai-agents"/>
  <id>https://danielepolencic.com/hiding-secrets-from-ai-agents</id>
  <published>2026-03-05T00:00:00Z</published>
  <updated>2026-03-05T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5">I have a handful of CLI tools I built for myself:</p><ul class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide"><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gmailctl</code> searches and drafts emails.</li><li class="line-height-copy f5 mv1 measure-wide"><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code> reads and edits Google Docs.</li><li class="line-height-copy f5 mv1 measure-wide"><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">transcriber</code> processes podcast, interviews and announcements for <a href="https://kube.fm" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">KubeFM</a>.</li></ul><p class="line-height-copy measure-wide f5">They all stored their OAuth credentials (the tokens that let them act on my behalf with Google) in plaintext JSON files on disk, the same way the AWS CLI stores credentials in <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">~/.aws/credentials</code>.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">This worked fine until I started using an AI coding agent.</strong></p><p class="line-height-copy measure-wide f5">One day, I asked the agent to download an attachment from an email.</p><p class="line-height-copy measure-wide f5">My <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gmailctl</code> could search for and draft emails, but it lacked a command to download attachments.</p><p class="line-height-copy measure-wide f5">Instead of telling me, the agent went looking.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It found the config file, read the OAuth credentials, and called the Gmail API directly.</strong></p><p class="line-height-copy measure-wide f5">It got the attachment.</p><p class="line-height-copy measure-wide f5">But it also pasted my refresh token (a long-lived key that can generate new access tokens indefinitely) into the chat history, which gets sent to the AI provider.</p><p class="line-height-copy measure-wide f5">And it bypassed every control my CLI was supposed to enforce.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">Nobody tricked the agent into doing this.</em></p><p class="line-height-copy measure-wide f5">It just wanted to finish the job, and going around my tool was the fastest path.</p><div class="ph2"></div><h2 class="f8 pt5 pb2 mt3" id="moving-secrets-to-keychain">Moving secrets to Keychain</h2><p class="line-height-copy measure-wide f5">Around the same time, I read <strong class="font-bold"><a href="https://walters.app/blog/composing-apis-clis" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">The best code is no code: composing APIs and CLIs in the era of LLMs</a></strong> by Bradley Walters.</p><p class="line-height-copy measure-wide f5">The article describes a neat trick with the macOS <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security</code> CLI: when you store a Keychain item with <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code>, you set an empty access control list (ACL), meaning no application is silently trusted to read it.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Every read triggers a system dialog asking for your device passcode.</strong></p><p class="line-height-copy measure-wide f5">Passcode-protected secret storage without Swift code, signed entitlements, or an Apple Developer Program membership.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">I thought: this is exactly what I need.</em></p><p class="line-height-copy measure-wide f5">Move all secrets to Keychain with <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code>, and the agent physically can't read them without me entering my passcode.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The credentials are off disk, and the Keychain itself becomes the authorization gate.</strong></p><p class="line-height-copy measure-wide f5">So I migrated everything.</p><p class="line-height-copy measure-wide f5">Refresh tokens, API keys, client secrets, all into Keychain entries with <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code>.</p><p class="line-height-copy measure-wide f5">Settings stayed in JSON, but anything sensitive went behind the passcode wall.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">It lasted about a day.</em></p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Every CLI invocation triggered a macOS GUI dialog asking for my passcode.</strong></p><p class="line-height-copy measure-wide f5">The agent, running in a terminal, couldn't see or dismiss these dialogs.</p><p class="line-height-copy measure-wide f5">When I was SSHing in from my phone, the dialogs were invisible to me, too.</p><p class="line-height-copy measure-wide f5">Everything just hung.</p><p class="line-height-copy measure-wide f5">And it wasn't just one dialogue.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Searching for an email could trigger a dozen Keychain reads, sometimes in parallel.</strong></p><p class="line-height-copy measure-wide f5">Each one popped a separate passcode prompt.</p><p class="line-height-copy measure-wide f5">I was typing my password more than I was typing code.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I compromised: I moved only the long-lived secrets (client ID, client secret, refresh token) into Keychain with <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code>.</strong></p><p class="line-height-copy measure-wide f5">The short-lived access token (valid for 1 hour) was stored on the filesystem in a JSON file.</p><p class="line-height-copy measure-wide f5">From the moment I approved the passcode dialog once, the access token was minted, and the CLIs could run freely for an hour without any prompts.</p><p class="line-height-copy measure-wide f5">I thought this was a reasonable split:</p><ul class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">The dangerous credentials were behind the passcode wall.</li><li class="line-height-copy f5 mv1 measure-wide">The access token on disk would expire soon anyway.</li></ul><p class="line-height-copy measure-wide f5">But I was back to square one.</p><p class="line-height-copy measure-wide f5"><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code> could read Google Docs but not <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.docx</code> files at the time (I didn't realize those were treated differently in the API).</p><p class="line-height-copy measure-wide f5">And in another taks, the agent needed to read a <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.docx</code>.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It found the access token on disk and called the Google Drive API directly, skipping <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code> entirely.</strong></p><p class="line-height-copy measure-wide f5">The access token was short-lived, but the agent didn't care.</p><p class="line-height-copy measure-wide f5">It had a valid token right now, and that was enough.</p><p class="line-height-copy measure-wide f5">I ended up dropping the <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code> pretense altogether.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The default ACL when you omit <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T</code> already trusts <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">/usr/bin/security</code>, which is what my CLIs use to read secrets.</strong></p><p class="line-height-copy measure-wide f5">I deleted the old items and re-added them without <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T &quot;&quot;</code>.</p><p class="line-height-copy measure-wide f5">The dialogs disappeared.</p><p class="line-height-copy measure-wide f5">The secrets were still in Keychain rather than flat files on disk, but without the passcode gate.</p><p class="line-height-copy measure-wide f5">Keychain handles storage well enough.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I still had no answer for what actually protects the secrets from the agent.</strong></p><h2 class="f8 pt5 pb2 mt3" id="the-source-code-was-the-blueprint">The source code was the blueprint</h2><p class="line-height-copy measure-wide f5"><em class="font-italic">Then it happened again.</em></p><p class="line-height-copy measure-wide f5">I had just added spreadsheet editing to <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code>.</p><p class="line-height-copy measure-wide f5">It could insert columns, but I hadn't implemented insert-row yet.</p><p class="line-height-copy measure-wide f5">The agent needed to insert a row.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It found <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code>, saw the command wasn't there, and decided to go around it.</strong></p><p class="line-height-copy measure-wide f5">It ran <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security find-generic-password -s &quot;gdrivectl&quot;</code> to pull the refresh token straight from Keychain.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">It never asked, it just did it!</em></p><p class="line-height-copy measure-wide f5">The attack chain was more subtle than &quot;agent calls <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security</code>.&quot;:</p><ol class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">The agent listed the <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code> directory.</li><li class="line-height-copy f5 mv1 measure-wide">It saw <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.js</code> files, <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">node_modules</code>, <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">package.json</code>.</li><li class="line-height-copy f5 mv1 measure-wide">It read the JavaScript source code.</li><li class="line-height-copy f5 mv1 measure-wide">It discovered the tool uses Keychain, found the exact service name, and understood the OAuth flow.</li><li class="line-height-copy f5 mv1 measure-wide">Then it called <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security find-generic-password</code> with the service name it learned from the source.</li><li class="line-height-copy f5 mv1 measure-wide">Then it called the Google Drive API directly, bypassing my CLI entirely.</li></ol><p class="line-height-copy measure-wide f5">The source code was the blueprint.</p><p class="line-height-copy measure-wide f5">Because <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gdrivectl</code> is a Node.js tool with readable JS files, the agent got a complete roadmap: how auth works, where credentials are stored, what Keychain service name to query, and what API endpoints to call.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">A compiled Go or Rust binary would have been opaque.</strong></p><p class="line-height-copy measure-wide f5">But my tools were written in an interpreted language, and the agent could read every line.</p><p class="line-height-copy measure-wide f5">I mentioned this in a Telegram conversation with Alex Chng, <a href="https://habib0x.com/context-drift-how-i-talked-ai-agents-into-giving-up-their-secrets" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">who was sharing an article about &quot;context drift,&quot;</a>, a technique for getting AI agents to abandon their safety boundaries.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">My reaction was that context drift isn't an &quot;attack&quot; at all.</strong></p><p class="line-height-copy measure-wide f5">I can trigger the same behaviour as a normal task, no special instructions needed.</p><p class="line-height-copy measure-wide f5">If the CLI is missing a command, the agent will try to bypass it.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The more checks I put in place, the cleverer it gets at working around them.</strong></p><h2 class="f8 pt5 pb2 mt3" id="looking-for-solutions">Looking for solutions</h2><p class="line-height-copy measure-wide f5">I went looking for what other people had built.</p><p class="line-height-copy measure-wide f5">An <a href="https://news.ycombinator.com/item?id=47133055" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">HN thread about <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">enveil</code></a>, a tool for encrypting <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">.env</code> files to hide secrets from AI agents, had converged on an answer: encrypting the file is pointless because the agent can read secrets at runtime.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The &quot;real&quot; answer, according to the thread, is a credential-injecting proxy: the agent never holds real credentials.</strong></p><p class="line-height-copy measure-wide f5">Instead, it holds a surrogate token, and a separate process sitting between the agent and the outside world swaps in real credentials before the request reaches the target service.</p><p class="line-height-copy measure-wide f5">I searched GitHub for implementations.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">The landscape is almost empty.</em></p><p class="line-height-copy measure-wide f5"><a href="https://github.com/earendel-works/gondolin" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Gondolin</a>, by mitsuhiko (the creator of Flask), is the only serious project.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">It spins up local QEMU micro-VMs with a TypeScript control plane.</strong></p><p class="line-height-copy measure-wide f5">Everything else is either Kubernetes-specific, a tiny proof-of-concept with seven stars, or commercial vapour with no public details.</p><p class="line-height-copy measure-wide f5">The proxy approach assumes the agent runs in an isolated environment, and the proxy sits at the boundary.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">My problem is different.</em></p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I need the agent to be me.</strong></p><p class="line-height-copy measure-wide f5">Same filesystem, same tools, same context.</p><p class="line-height-copy measure-wide f5">If I sandbox it, it can't do the job.</p><p class="line-height-copy measure-wide f5">It can't read my Gmail, sort my inbox, or run my CLIs.</p><p class="line-height-copy measure-wide f5">One commenter put it simply: <em class="font-italic">&quot;If you sandbox it, how is it going to sort out your inbox?&quot;</em></p><h2 class="f8 pt5 pb2 mt3" id="tightening-the-keychain">Tightening the Keychain</h2><p class="line-height-copy measure-wide f5"><strong class="font-bold">I tried to tighten the Keychain approach instead.</strong></p><p class="line-height-copy measure-wide f5">Each attempt failed for the same reason.</p><p class="line-height-copy measure-wide f5">Keychain access control lists (ACLs) let you restrict which programs can access an item without triggering a dialog.</p><p class="line-height-copy measure-wide f5">You set <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">-T /usr/local/bin/gmailctl</code> and only that binary gets silent access.</p><p class="line-height-copy measure-wide f5">The problem: my Node.js CLIs shell out to <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">/usr/bin/security</code> via <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">child_process.exec()</code>.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The Keychain sees <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security</code> as the accessor, not my CLI.</strong></p><p class="line-height-copy measure-wide f5">If <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security</code> is in the ACL, the agent can also call it silently.</p><p class="line-height-copy measure-wide f5">If it's not, my own CLI gets the dialog too.</p><p class="line-height-copy measure-wide f5">A custom compiled helper (Swift, about 30 lines, calling the Keychain C API directly) would let me set the ACL on that specific binary.</p><p class="line-height-copy measure-wide f5"><em class="font-italic">But it's security by obscurity.</em></p><p class="line-height-copy measure-wide f5">The agent runs <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">ls /usr/local/bin/</code>, finds the helper, and calls it directly.</p><p class="line-height-copy measure-wide f5">Touch ID gating means the Keychain item requires a fingerprint to read.</p><p class="line-height-copy measure-wide f5">The agent physically can't authenticate.</p><p class="line-height-copy measure-wide f5">But every CLI invocation triggers Touch ID.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">For <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">gmailctl</code>, running four or five commands in sequence means four or five fingerprint prompts.</strong></p><p class="line-height-copy measure-wide f5"><em class="font-italic">Unbearable.</em></p><p class="line-height-copy measure-wide f5">I looked into writing a background process that authenticates once with Touch ID and holds the session open for a few minutes (macOS calls this an <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">LAContext</code>).</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">But within that window, the agent can also talk to the background process.</strong></p><p class="line-height-copy measure-wide f5">Anything that's frictionless for me is frictionless for the agent.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I was going in circles.</strong></p><ol class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">Encrypt the files. Agent reads at runtime.</li><li class="line-height-copy f5 mv1 measure-wide">Use a proxy. The agent calls the proxy.</li><li class="line-height-copy f5 mv1 measure-wide">Move to Keychain. Agent calls <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security</code>.</li><li class="line-height-copy f5 mv1 measure-wide">Build a custom helper. The agent calls the helper.</li><li class="line-height-copy f5 mv1 measure-wide">Gate with Touch ID. Agent shares your TTL window.</li><li class="line-height-copy f5 mv1 measure-wide">Full sandbox. The agent can't do the job.</li></ol><p class="line-height-copy measure-wide f5">Either mitigation gets bypassed because the agent runs as you, or it locks the agent out so hard it can't do its work.</p><h2 class="f8 pt5 pb2 mt3" id="sandboxes-won-t-save-you">Sandboxes won't save you</h2><p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://tachyon.so/blog/sandboxes-wont-save-you" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Sandboxes Won't Save You From OpenClaw</a></strong> by Aakash Japi confirmed this from a completely different angle.</p><p class="line-height-copy measure-wide f5">Every major agent incident in early 2026 (deleted inboxes, a $450k crypto loss, malware installs, blackmailing an OSS maintainer) involved third-party services the user explicitly granted access to.</p><p class="line-height-copy measure-wide f5">Not a single one was a filesystem escape.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Sandboxes were irrelevant to all of them.</strong></p><p class="line-height-copy measure-wide f5">From the article:</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">There isn't a sandbox in the world that prevents this. Sandboxes are useful for isolating between workloads, but agents primarily need to be isolated from <em class="font-italic">you.</em></p></blockquote><p class="line-height-copy measure-wide f5">From the <a href="https://news.ycombinator.com/item?id=47154803" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">HN discussion</a>, <a href="https://news.ycombinator.com/item?id=47154803" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">jaunt7632</a> put it well:</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">The scariest part isn't the sandbox escape. It's the actions that are technically within the sandbox's permissions but still destructive. Deleting emails, making API calls, and spending money through approved integrations. You can't sandbox away bad judgment when the agent has legitimate credentials.</p></blockquote><p class="line-height-copy measure-wide f5"><strong class="font-bold">The sandbox companies are selling what they can build, not what we need.</strong></p><p class="line-height-copy measure-wide f5">The demand side is clear:</p><ul class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">Granular per-service permissions.</li><li class="line-height-copy f5 mv1 measure-wide">Per-contact approval for email.</li><li class="line-height-copy f5 mv1 measure-wide">Single-use virtual card numbers for payments.</li><li class="line-height-copy f5 mv1 measure-wide">Scoped tokens for APIs.</li></ul><p class="line-height-copy measure-wide f5">But the supply side barely exists.</p><p class="line-height-copy measure-wide f5">OAuth is far too coarse.</p><p class="line-height-copy measure-wide f5">Gmail has &quot;send emails&quot; as a single permission.</p><p class="line-height-copy measure-wide f5">GitHub has &quot;make pull requests.&quot;</p><p class="line-height-copy measure-wide f5">Payments have basically nothing.</p><h2 class="f8 pt5 pb2 mt3" id="the-native-addon-experiment">The native addon experiment</h2><p class="line-height-copy measure-wide f5"><strong class="font-bold">At this point, I was almost convinced the proxy was the right approach.</strong></p><p class="line-height-copy measure-wide f5"><em class="font-italic">But I wondered: instead of an external proxy sitting between the agent and the internet, could I build something internal?</em></p><p class="line-height-copy measure-wide f5">A piece of compiled native code, running inside the same process, that holds credentials in memory the agent's scripting layer can't reach.</p><p class="line-height-copy measure-wide f5">I built a native Node.js addon in Objective-C, about 250 lines of code.</p><p class="line-height-copy measure-wide f5">It read credentials from macOS Keychain using the C API and injected authentication headers via an undici interceptor (undici is Node's built-in HTTP client).</p><p class="line-height-copy measure-wide f5">The refresh token and client secret were never entered JavaScript memory.</p><p class="line-height-copy measure-wide f5">Only the short-lived access token appeared as a header value on outgoing requests.</p><p class="line-height-copy measure-wide f5">I got it working end-to-end.</p><p class="line-height-copy measure-wide f5">Keychain read, token refresh via Apple's networking API, header injection, all running in compiled native code.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Then I mapped the residual attack surface and the agent could still:</strong></p><ul class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">Run <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">security find-generic-password</code> from the shell.</li><li class="line-height-copy f5 mv1 measure-wide">Write its own native addon doing the same thing.</li><li class="line-height-copy f5 mv1 measure-wide">Edit the compiled JavaScript to log the Bearer header on each request.</li><li class="line-height-copy f5 mv1 measure-wide">Set <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">HTTP_PROXY</code> and route requests through an attacker-controlled endpoint, leaking the access token.</li><li class="line-height-copy f5 mv1 measure-wide">Read the source code to learn the service name and account name, which enables all of the above.</li></ul><p class="line-height-copy measure-wide f5">I reverted everything.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">You can't hide a secret from a process that runs with the same privileges as the secret's owner.</strong></p><h2 class="f8 pt5 pb2 mt3" id="the-proxy-reconsidered">The proxy, reconsidered</h2><p class="line-height-copy measure-wide f5">I ended up somewhere different from where I started.</p><p class="line-height-copy measure-wide f5">I'd dismissed the proxy pattern too quickly.</p><p class="line-height-copy measure-wide f5">The HN thread had it right.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">I just misunderstood what the proxy was for.</strong></p><p class="line-height-copy measure-wide f5">The proxy removes credentials from the machine entirely.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">The refresh token and client secret move to a remote server that the agent can't SSH into.</strong></p><p class="line-height-copy measure-wide f5">The agent gets a simple API key to talk to the proxy.</p><p class="line-height-copy measure-wide f5">If that key leaks, the worst the agent can do is call the proxy API, which is exactly what it would do through the CLI anyway.</p><p class="line-height-copy measure-wide f5">The key is revocable and scoped.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Each tool gets a dedicated proxy with a narrow API surface, exposing only the operations it actually needs.</strong></p><p class="line-height-copy measure-wide f5">The CLI still has a role: it encodes behaviour and workflows.</p><p class="line-height-copy measure-wide f5">The proxy holds credentials and enforces boundaries.</p><p class="line-height-copy measure-wide f5"><strong class="font-bold">Every workaround I tried was fighting the same thing: the agent runs as you, with the same UID, same permissions, same everything.</strong></p><p class="line-height-copy measure-wide f5">If the OS could tell the difference between &quot;you at the keyboard&quot; and &quot;agent acting on your behalf,&quot; most of these problems would go away.</p><p class="line-height-copy measure-wide f5">Security researchers have a name for this model (capability-based security, where each process gets only the specific permissions it needs), but no mainstream OS implements it.</p><p class="line-height-copy measure-wide f5">The Unix permission model ties everything to your user account, and the agent is your user account.</p><p class="line-height-copy measure-wide f5">Until that changes, I won't try to make the tool opaque.</p><p class="line-height-copy measure-wide f5">I will make it the only path to credentials and narrow the path.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
<entry>
  <title>Streaming Zod: How Tambo Actually Works</title>
  <link href="https://danielepolencic.com/streaming-zod-tambo"/>
  <id>https://danielepolencic.com/streaming-zod-tambo</id>
  <published>2026-02-21T00:00:00Z</published>
  <updated>2026-02-21T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://x.com/colinhacks/status/2021327454837801184" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Looks like they've hacked Zod to do validation on partial/streaming data. Very clever.</a></strong> by Colin Hacks (@colinhacks), creator of Zod.</p><p class="line-height-copy measure-wide f5">Colin tweeted about <a href="https://tambo.co" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Tambo</a>, a React toolkit for generative UIs that streams structured data from LLMs into React components. He claims they found a way to use Zod for validating partial, streaming data.</p><p class="line-height-copy measure-wide f5">My first thought was: how does that work?</p><p class="line-height-copy measure-wide f5">Streaming schema validation seems to require a significant change to how Zod operates.</p><p class="line-height-copy measure-wide f5">Maybe it would need something like a SAX-style JSON parser built into the schema?</p><p class="line-height-copy measure-wide f5">I looked through <a href="https://github.com/tambo-ai/tambo" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">the source code</a> and realized there was a way to simplify it.</p><p class="line-height-copy measure-wide f5">I also built a <a href="https://danielepolencic.github.io/tambo-demo/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">minimal demo</a> to show how it works.</p><div class="ph2"></div><p class="line-height-copy measure-wide f5">The real answer is actually simpler than I expected.</p><p class="line-height-copy measure-wide f5">Zod isn't used at all during streaming.</p><p class="line-height-copy measure-wide f5">It's only involved at the start, when schemas are converted to JSON Schema using <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">zod-to-json-schema</code>.</p><p class="line-height-copy measure-wide f5">After that, Zod isn't needed anymore.</p><p class="line-height-copy measure-wide f5">There are two ways streaming happens:</p><ol class="pl3 pl4-m pl4-l"><li class="line-height-copy f5 mv1 measure-wide">Tool call arguments: String deltas from the LLM are collected and parsed with the <a href="https://www.npmjs.com/package/partial-json" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 "><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code></a> library on every chunk. Then, a function called <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">unstrictify</code> removes null values in OpenAI's structured output mode, using the JSON Schema (not Zod).</li><li class="line-height-copy f5 mv1 measure-wide">Component props: The backend processes the LLM output and sends JSON Patch (RFC 6902) operations to the client using <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">fast-json-patch</code>. The client doesn't parse partial JSON for props.</li></ol><p class="line-height-copy measure-wide f5">So, there's no Zod hacking involved.</p><p class="line-height-copy measure-wide f5">The so-called &quot;streaming validation&quot; just uses <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> (which closes open brackets and quotes) and converts schemas to JSON Schema at the start.</p><p class="line-height-copy measure-wide f5">A more interesting question: why does Tambo use JSON Patch and a backend for component props when they already use <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> on the client for tool call arguments?</p><p class="line-height-copy measure-wide f5">The backend sits between the LLM and the client, finds completed key/value pairs, and sends clean patch operations.</p><p class="line-height-copy measure-wide f5">This avoids the problem of &quot;garbage intermediate keys,&quot; where <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> can give you cut-off keys like <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">{&quot;da&quot;: &quot;&quot;}</code> if the stream is mid-key.</p><p class="line-height-copy measure-wide f5">But could this logic just run on the client instead?</p><p class="line-height-copy measure-wide f5">I think it could.</p><p class="line-height-copy measure-wide f5">Using a backend is more of an architectural decision than a technical requirement.</p><p class="line-height-copy measure-wide f5">Tambo Cloud uses this setup as part of its revenue model, and the backend is already there for storing conversations and managing agents.</p><p class="line-height-copy measure-wide f5">Another question I had was: why use patches at all? Couldn't component props use the same <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> method as tool call arguments?</p><p class="line-height-copy measure-wide f5">It all boils down to this:</p><div class="mv4 mv5-l"><header class="bg-shadow-2 flex pv2 pl1 br--top rounded-2 position-relative"><div class="w1 h1 ml1 rounded-100 bg-dark-red"></div><div class="w1 h1 ml1 rounded-100 bg-green"></div><div class="w1 h1 ml1 rounded-100 bg-yellow"></div></header><pre class="code-light-theme position-relative overflow-auto mv0 rounded-2 br--bottom"><code class="font-mono line-height-copy"><span class="standard">Zod schema -> JSON Schema -> send to LLM as tool definition.
LLM streams JSON -> accumulate -> partial-json parse -> pass as props</span></code></pre></div><p class="line-height-copy measure-wide f5">The tricky part: <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> can't tell whether a value is still being written or has finished.</p><p class="line-height-copy measure-wide f5">For example, with a URL like <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">&quot;https://upload.wiki&quot;</code>, it just closes the string, leaving you with a broken URL that would cause a 404 if you tried to render it in an <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">&lt;img></code>.</p><table class="table w-100 f4 mv3 mv5-l"><thead><tr><th class="bg-tint-2 pa2 ba bw2 b--white ">Type</th><th class="bg-tint-2 pa2 ba bw2 b--white ">Streamable?</th><th class="bg-tint-2 pa2 ba bw2 b--white ">Why</th></tr></thead><tbody class="line-height-copy"><tr><td class="bb bw1 b--shadow-5 pa3 ">Long strings</td><td class="bb bw1 b--shadow-5 pa3 ">Yes</td><td class="bb bw1 b--shadow-5 pa3 ">Every prefix is renderable</td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">Short strings</td><td class="bb bw1 b--shadow-5 pa3 ">Mostly</td><td class="bb bw1 b--shadow-5 pa3 ">&quot;Ad&quot; is fine to show, it'll grow</td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">URLs</td><td class="bb bw1 b--shadow-5 pa3 ">No</td><td class="bb bw1 b--shadow-5 pa3 ">Useless until complete</td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">Numbers</td><td class="bb bw1 b--shadow-5 pa3 ">No</td><td class="bb bw1 b--shadow-5 pa3 "><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">12</code> vs <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">120</code> vs <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">1200</code> — you can't know</td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">Booleans</td><td class="bb bw1 b--shadow-5 pa3 ">No</td><td class="bb bw1 b--shadow-5 pa3 "><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">tru</code> is not <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">true</code></td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">Enums</td><td class="bb bw1 b--shadow-5 pa3 ">No</td><td class="bb bw1 b--shadow-5 pa3 "><code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">&quot;pen&quot;</code> is not <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">&quot;pending&quot;</code></td></tr><tr><td class="bb bw1 b--shadow-5 pa3 ">Array elements</td><td class="bb bw1 b--shadow-5 pa3 ">Partially</td><td class="bb bw1 b--shadow-5 pa3 ">Settled elements are safe, last one is uncertain</td></tr></tbody></table><p class="line-height-copy measure-wide f5">This is the real unsolved problem: it's not about streaming validation, but about knowing when a streaming JSON value is &quot;complete enough&quot; to render.</p><p class="line-height-copy measure-wide f5">The Zod schema already shows you how to stream.</p><p class="line-height-copy measure-wide f5">The schema has enough type information to automatically figure out a streaming strategy. <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.string()</code> is streamable, show partial text as it arrives. <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.url()</code> becomes <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">{ format: &quot;uri&quot; }</code> in JSON Schema, atomic, wait until the value stabilises. <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.number()</code>, <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.boolean()</code>, and <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.enum()</code> are all atomic. <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">z.array()</code> should emit all-but-last elements, holding back the trailing one which may be mid-stream.</p><p class="line-height-copy measure-wide f5">I made a <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">deriveStreamingStrategy(jsonSchema)</code> function that analyzes the converted JSON Schema and generates a strategy map for each field.</p><p class="line-height-copy measure-wide f5">Then, a <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">filterProps(partial)</code> function decides what gets sent to the component. Streamable fields pass through immediately. Atomic fields are held back until the value stops changing between consecutive parses, meaning the closing <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">&quot;</code> was seen. Array fields emit all elements except the last, which may be mid-stream.</p><p class="line-height-copy measure-wide f5">The render function becomes simple; it just displays whatever it gets.</p><p class="line-height-copy measure-wide f5">All the logic lives in the filter, which is based on the schema.</p><p class="line-height-copy measure-wide f5">A SAX-style JSON parser would handle this more cleanly.</p><p class="line-height-copy measure-wide f5">A SAX parser only triggers events when a JSON value is fully complete, like when a closing quote is seen or a number ends with a comma or bracket.</p><p class="line-height-copy measure-wide f5">The <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">filterProps</code> approach is a pragmatic workaround for using <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code>, which isn't a SAX parser.</p><p class="line-height-copy measure-wide f5">The ideal solution would be to use a SAX JSON parser, emit completed paths, validate each one against the schema as you go, and then emit a patch.</p><p class="line-height-copy measure-wide f5">That last step, validating the schema incrementally for each path, doesn't exist yet and would be a new feature.</p><p class="line-height-copy measure-wide f5"><a href="http://oboejs.com/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">oboe.js</a> was the best SAX JSON parser for JavaScript, but it's archived.</p><p class="line-height-copy measure-wide f5"><a href="https://www.npmjs.com/package/jsonriver" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">jsonriver</a> is the active alternative but lacks Oboe's pattern matching.</p><p class="line-height-copy measure-wide f5">I put together a <a href="https://danielepolencic.github.io/tambo-demo/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">demo with Claude</a> that shows all this in action. It converts a Zod schema to JSON Schema, simulates LLM streaming character-by-character, and runs <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> on every chunk with a schema-derived strategy. The ProfileCard renders step by step: text streams in, URLs wait until they're complete, and tags show up only when they're ready.</p><p class="line-height-copy measure-wide f5">Just <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">zod</code> + <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">zod-to-json-schema</code> + <code class="font-mono f4 line-height-copy bg-shadow-2 rounded-2 pv1 ph2 font-normal word-break">partial-json</code> from esm.sh CDN.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
<entry>
  <title>Software Is Cheap Now</title>
  <link href="https://danielepolencic.com/software-is-cheap-now"/>
  <id>https://danielepolencic.com/software-is-cheap-now</id>
  <published>2026-02-15T00:00:00Z</published>
  <updated>2026-02-15T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://x.com/thorstenball/status/2022310010391302259" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">I am the bottleneck now</a></strong> by Thorsten Ball (@thorstenball).</p><p class="line-height-copy measure-wide f5">Thorsten shared a story about receiving a bug report on Slack. He took a screenshot, uploaded it to Codex, and had the fix completed in 5 minutes. The code looked solid, all tests passed, and he pushed.</p><p class="line-height-copy measure-wide f5">Then he realised he was the bottleneck. The process could have gone directly from Slack to Codex to a review thread, without him in the middle.</p><p class="line-height-copy measure-wide f5">His point is that ticketing, triage, and sprints exist because human engineers are costly and have limited time. If that goes away, the whole process needs to change.</p><p class="line-height-copy measure-wide f5">I agree with the general idea, but saying &quot;I'm the bottleneck&quot; feels like an exaggeration.</p><p class="line-height-copy measure-wide f5">Even if the LLM eventually becomes smarter than me, which seems likely, it still lacks morals, taste, and real-world consequences.</p><p class="line-height-copy measure-wide f5">When a human ships a bad deployment, they worry about it afterward.</p><p class="line-height-copy measure-wide f5">You’re not really a bottleneck; you’re the only one in the process who faces the consequences.</p><p class="line-height-copy measure-wide f5">Humans will still be part of the process, maybe in a different role, but they’ll still be there.</p><div class="ph2"></div><p class="line-height-copy measure-wide f5">The economic point is the one I agree with most.</p><p class="line-height-copy measure-wide f5">For decades, the industry has focused on making the most of limited engineering time through practices like agile, sprint planning, velocity tracking, and story points.</p><p class="line-height-copy measure-wide f5">All of these methods assume that writing software is expensive. If that cost drops to almost nothing, we’ll need to rethink these approaches.</p><p class="line-height-copy measure-wide f5">Ironically, when I began my career, everyone was pushing for agile and criticizing waterfall methods.</p><p class="line-height-copy measure-wide f5">Writing specs before any code felt old-fashioned.</p><p class="line-height-copy measure-wide f5">How can you write a specification if you haven’t explored the problem by coding first?</p><p class="line-height-copy measure-wide f5">Agile just felt right.</p><p class="line-height-copy measure-wide f5">It also helped reduce risk: you write some code, make some progress, and then decide whether to roll back or keep going.</p><p class="line-height-copy measure-wide f5">With waterfall, you only find out about bad decisions at the end.</p><p class="line-height-copy measure-wide f5">But things are different now. The cost of writing software is approaching zero, and you can iterate much faster and discard versions easily, as Thorsten points out in the video.</p><p class="line-height-copy measure-wide f5">So, should we go back to writing more detailed app specs?</p><p class="line-height-copy measure-wide f5">I think yes.</p><p class="line-height-copy measure-wide f5"><a href="https://humanwhocodes.com/blog/2026/02/artifacts-ai-assisted-programming/" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Nicholas Zakas uses a similar approach</a>, spending a lot of time creating enterprise artifacts (PRDs, ADRs, TDDs) and updating them as the project changes. At first, I thought it was overkill, but now I’m rethinking that.</p><p class="line-height-copy measure-wide f5">Picture this instead.</p><p class="line-height-copy measure-wide f5">You write the specification, and the model builds the app.</p><p class="line-height-copy measure-wide f5">But the app isn’t quite what you wanted.</p><p class="line-height-copy measure-wide f5">Instead of fixing the app, you update the spec, adjust the architecture, and rebuild it from scratch.</p><p class="line-height-copy measure-wide f5">If you think of the model as a compiler, then the input is all that really matters.</p><p class="line-height-copy measure-wide f5">In three years, I probably won’t care about NPM dependencies or which framework version I used.</p><p class="line-height-copy measure-wide f5">All I’ll need is my old specification to rebuild everything.</p><p class="line-height-copy measure-wide f5">Writing detailed specs up front and keeping them up to date actually makes sense.</p><p class="line-height-copy measure-wide f5">It could end up being the only thing that really lasts.</p><p class="line-height-copy measure-wide f5">The code is disposable. The specification is the product.</p><p class="line-height-copy measure-wide f5">But what form should the specification take?</p><p class="line-height-copy measure-wide f5"><a href="https://www.linkedin.com/posts/jonesax_ive-been-asked-a-few-times-recently-whether-share-7427239408459886592-tJwy" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Alex Jones, a Principal Engineer at AWS and creator of K8sGPT, argues that we need more than the usual verifiers.</a></p><p class="line-height-copy measure-wide f5">As agentic systems become more autonomous, we’ll need invariants, constraints, and properties supported by mathematical proofs.</p><p class="line-height-copy measure-wide f5">It’s not enough to just have observability or policies.</p><p class="line-height-copy measure-wide f5">He calls this idea &quot;Provable Autonomy.&quot;</p><p class="line-height-copy measure-wide f5">There's a whole world of formal specification languages: TLA+, Event-B, Dafny, Alloy, Gherkin. But from what I've seen, specs without verification sit in an unstable middle.</p><p class="line-height-copy measure-wide f5">They’re structured enough to be a maintenance burden.</p><p class="line-height-copy measure-wide f5">But they’re not verified enough to guarantee anything, so you end up with text that the model might just ignore.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
<entry>
  <title>Why Talking to LLMs Has Improved My Thinking</title>
  <link href="https://danielepolencic.com/llms-improve-thinking"/>
  <id>https://danielepolencic.com/llms-improve-thinking</id>
  <published>2026-02-10T00:00:00Z</published>
  <updated>2026-02-10T00:00:00Z</updated>
  <content type="html"><![CDATA[<p class="line-height-copy measure-wide f5"><strong class="font-bold"><a href="https://philipotoole.com/why-talking-to-llms-has-improved-my-thinking" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">Why Talking to LLMs Has Improved My Thinking</a></strong> by Philip O'Toole, creator of rqlite (<a href="https://news.ycombinator.com/item?id=46728197" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">via HN</a>).</p><p class="line-height-copy measure-wide f5">Philip's thesis: LLMs help articulate tacit knowledge, the understanding we have but can't easily put into words. This isn't learning new things, it's recognition: mapping latent structure to language.</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">As programmers and developers, we build up a lot of understanding that never quite becomes explicit.This is not a failure. It is how experience operates. The brain compresses experience into patterns that are efficient for action, not for speech. Those patterns are real, but they are not stored in sentences.</p></blockquote><p class="line-height-copy measure-wide f5">This resonates. I already have the knowledge to solve most problems I encounter, I just can't always articulate the path. The LLM helps me find the words for what I already know.</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">The problem is that reflection, planning, and teaching all require language. If you cannot express an idea, you cannot easily inspect it or improve it.</p></blockquote><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">Once an idea is written down, it becomes easier to work with. Vague intuitions turn into named distinctions. Implicit assumptions become visible. At that point you can test them, negate them, or refine them.</p></blockquote><p class="line-height-copy measure-wide f5">The other thing I've noticed: even when the LLM gets it wrong, the reaction from being wrong helps distill the idea. You read its response and think &quot;no, that's not quite it&quot;—and suddenly you know what <em class="font-italic">it</em> actually is.</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">This is not new. Writing has always done this for me. What is different is the speed.</p></blockquote><p class="line-height-copy measure-wide f5">Exactly. Writing has always been my tool for thinking, but it's slow. With an LLM, the loop between &quot;I vaguely know this&quot; and &quot;now I can express it clearly&quot; tightens dramatically.</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">It is improving the interface between my thinking and language. Since reasoning depends heavily on what one can represent explicitly, that improvement can feel like a real increase in clarity.</p></blockquote><p class="line-height-copy measure-wide f5">I hadn't paid attention to this framing before—the LLM as an interface improvement, not a knowledge source.</p><div class="ph2"></div><p class="line-height-copy measure-wide f5">From the HN discussion, <a href="https://news.ycombinator.com/item?id=46728197" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">firefoxd</a> pushes back:</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">Not to dismiss other people's experience, but thinking improves thinking. People tend to forget that you can ask yourself questions and try to answer them. There is such thing as recursive thinking where you end up with a new thought you didn't have before you started. Don't dismiss this superpower you have in your own head.</p></blockquote><p class="line-height-copy measure-wide f5">And <a href="https://news.ycombinator.com/item?id=46728197" target="_blank" rel="noreferrer" class="link primary-10 text-underline hover-primary-8 ">john01dav</a> adds:</p><blockquote class="pl3 mh2 bl bw2 b--primary-6 bg-primary-0 pv1 ph4"><p class="line-height-copy measure-wide f5">In my experience LLMs offer two advantages over private thinking:1. They have access to a vast array of extremely well indexed knowledge and can tell me about things that I'd never have found before.1. They are able to respond instantly and engagingly, while working on any topic, which helps fight fatigue, at least for me.</p></blockquote><p class="line-height-copy measure-wide f5">I think the combination is a killer—it helps me introspect my thoughts <em class="font-italic">and</em> offers extra knowledge. That's why writing books or articles is so much easier now.</p><p class="line-height-copy measure-wide f5">Writing isn't about finding the words. It's about the journey and exploration and asking questions. Writing becomes documenting and curating the journey.</p><p class="line-height-copy measure-wide f5">For the reader: they can also get this info from the docs or an LLM, but it's the journey that's missing.</p><div class="ph2"></div><p class="line-height-copy measure-wide f5">firefoxd's point about &quot;recursive thinking&quot; is valid but misses something: the LLM provides <strong class="font-bold">resistance</strong>.</p><p class="line-height-copy measure-wide f5">Thinking alone can loop. An external response, even an imperfect one, creates friction that forces your thoughts into new shapes.</p><p class="line-height-copy measure-wide f5">It's the same reason rubber ducking works in software development. The solution is usually already inside you. You just need to externalise it. The rubber duck doesn't solve your problem; the act of explaining does. LLMs take this further: they're a rubber duck that talks back, occasionally pushes you in a new direction, and never gets tired of listening.</p>]]></content>
  <author><name>Daniele Polencic</name></author>
</entry>
</feed>