Sure, but couldn't you say the same for letting other people contribute code too? In either case, you make the choice of how deeply you want to review it. You can ask the AI or the human to explain things that aren't clear.
For me it's case by case in either scenario. Sometimes it's not that important to look closely at a specific subsystem that's self-contained or just simple, other times I need to carefully audit whatever touches a different system. You need a good sense of the existing codebase/architecture in the first place to make these determinations.
What I've noticed reviewing all my colleagues' AI generated code PRs is: it really is just code, and the rare comment here and there is still added by the human.
We're already trying to light tokens on fire as fast as possible to stay on acceptable required use leaderboards, why not light some more for system understanding and housekeeping.
That is: they either reiterate what the code does, or would if the code were slightly clearer, or they tell half truths that are more confusing than helpful. Mostly they fail to emphasise the salient things, like the why over the what, that are not obvious from the code.
Being reduced to an inconsequential middle manager is more exhausting than being reduced to a code monkey is the hot insight i've been hanging my hat on.
To be clear: This is really horrible in an IC for the paycheck role. I quit my job on principle because of code/token maxing. Very few are in the place to do this. I've been enjoying AI as an independent, but i still mean to fight the good fight for every line engineer.
I now on all of my projects have an ai journal that stands as a ledger for every change the ai has made, and why it was made. I don’t read it that hard personally because I spend so much time planning with my agent before letting it code. However I have found it very useful in sharing code between people, or having Claude look through the journal to gain context when modifying or adding a feature.
There seems to be a big push for 0 reviewing, which is insane.
Even for most small changes I will ask it to do a simple "production review" then I use my experience and judgment to decide on which items need to actually be addressed or not
The same output that is such a bad thing in this article can also be used to gain context, by making a thorough plan with your ai first, reading through the plan and proposing changes just like you would with a real developer.
You can also use this output to have the ai write a journal as well. The journal can be as detailed as possible and essentially a ledger of all of the changes your ai has made to the code. This allows not only for your teammates reviewing your pr to gain greater context, but also can be used by yourself, or even the ai itself to figure the why behind a particular implementation was done the way it was, far into the future even.
Lastly how many of us ever deploy code without actually checking the feature works e2e? I would gather not many of us do, I don’t, because even though we may have a greater understanding of the code, we can make mistakes in the code or in our logic. And I keep coming back to why would we treat llms any differently? I believe we should be spending our energy thoroughly manually testing a feature to make sure when we brainstormed we actually did get every edge case, and it works well.
I did one small side web project by only writing spec tests and prompts and testing the results in a browser, never reading nor editing a single line of generated code. It was something for home and so low stakes, but it worked remarkably well and was much better tested than the typical 2022-era home project of mine.
Honestly I don’t even write tests manually because of coverage checks. Being that the coverage check is not something easily manipulated, I always tell the ai, don’t ever change configs, and make the coverage pass whatever I set it to, most times > 95%. I just tell the AI, make this coverage pass.
I find tremendous success with this technique, or anytime really I can find an objective way for the ai to test its work.
If you don’t read the tests to check they confer your intent or specifications, they’re more like tautologies than tests, you know?
I wrote the tests. That was how I expressed the spec and my intentions.
(NOT a lawyer)
Previously, liability and indemnification could be bureaucratically laundered to "engineers", because it was a huge diffuse set of people.
Now the bag is left with top of the chain for authorizing LLMs. Gia Tan went the hard way with xz. LLM-trolling is the new social engineering.
You are missing another dimension how easy it would be to migrate if adding new feature hits a ceiling and LLM keeps breaking the system.
Imagine all tests are passing and code is confirming the spec, but everything is denormalized because LLM thought this was a nice idea at the beginning since no one mentioned that requirement in the spec. After a while you want to add a feature which requires normalized table and LLM keeps failing, but you also have no idea how this complex system works.
Don't forget that very very detailed spec is actually the code
The tests, sure. But certainly not the code itself, as that sits far too close to the implementation (i.e. it is the implementation). An almost infinite number of implantations can fulfill “does foo when bar”, so how can we prove that ours is the spec itself?
It’s kind of like a scientist coming up with a hypothesis post-hoc to fit the results of the experiment.
A more complete spec will capture performance requirements, input preconditions and output postconditions, error handling and recovery behaviors, threading behaviours, hardware assumptions, etc. It's hard to do these things without leaning at least somewhat on the specific language runtime you are using, otherwise you'd end up regurgitating the C standard each time you design a software system.
It's this sort of stuff that is meant when people say "sufficiently detailed".
If you're actually testing all these things, then I might agree with you that you can do it in the tests, but almost no one actually is. I'd struggle to write a test suite that tests all the specification-level assumptions I draw from my language and target platforms.
Also the quality of tests in general in projects is often so so and that's reflected in the output of LLMs even more so.
Came here to say this, but you said it for me. If the problem were merely one of insufficient rigour or detail in specs, it would have been solved long before LLMs.
In the age of AI this is more true than you know. Given a detailed enough spec and test suite you can effectively rewrite any application with any language in a fully automated way.
I've coined that as "Duck coding" :D If it quacks like a duck, walks like a duck and looks like a duck - it's duck enough for as far as the spec is concerned. Does it matter what is inside the duck?
Right now we are just starting vibe coded software, nobody knows how it will behave in 2 years or 5 or 10. My guess it won't. So we will enter age of scratch software. You build it, ship it. And after few months you will ship entirely new one. And then again. And again. And again. Because maintenance is hard and costly and writing from scratch will cost like 1k$ in tokens.
And users will have problem of migrating the data if possible at all. But if migration is hard and everything changes all the time does it even matter if you are using X o Y software? Does it even matter since you can write your own software and migrate your data there?
I think we saw how this ends with Chinese manufacturing. You buy some stuff from AliExpress for 2$ and throw it away in two weeks and buy a new one. So quality does not matter anymore.
“The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore. But that doesn’t necessarily mean we stop being rigorous, it could mean we should move rigor elsewhere.“
Direct reports, when delegated tasks by managers, product non-deterministic outputs much faster than team leads/managers can review, understand or approve every diff. Being a manager of software developers has always been a non-deterministic form of software engineering.
https://simonwillison.net/2026/May/6/vibe-coding-and-agentic...
“The thing that really helps me is thinking back to when I’ve worked at larger organizations where I’ve been an engineering manager. Other teams are building software that my team depends on.
If another team hands over something and says, “hey, this is the image resize service, here’s how to use it to resize your images”... I’m not going to go and read every line of code that they wrote.
I’m going to look at their documentation and I’m going to use it to resize some images. And then I’m going to start shipping my own features. And if I start running into problems where the image resizer thing appears to have bugs or the performance isn’t good, that’s when I might dig into their Git repositories and see what’s going on. But for the most part I treat that as a semi-black box that I don’t look at until I need to.”Let's say for example it caches on something stupid like the CRC32 of the input image -- good enough that the couple dozen images in your test dataset don't collide, you don't see it in smoke testing your app, but real world data has collisions on a daily basis.
This gets into production and customer A sees a resized version of customer B's document for a thumbnail. Now customer A is wondering how many other customers are seeing resized versions of their private documents in thumbnail images. They are very very mad.
If the image resize service was built by "another team" then that other team is responsible for the bug and will take most of the heat for it. If it was built by an "agent swarm" or "gas town" or whatever under my direction then I'm 100% responsible for it and rightly deserve the heat.
That is why I cannot understand any approach that doesn't involve reading the code at all. Testing alone is not sufficient. MTTR is not sufficient because you can't make a customer less mad about a data privacy bug by fixing it.
1. You can treat software like a black box when other people developed it for you because they can stand behind it. They have their own reputations to uphold. You can't when AI developed it for you because YOU are responsible for 100% of the bugs in it. If you take this trendy stance of "I never read or write code, just specs", you are just rolling the dice on what you stamp your name on.
2. Just because you have unit tests and you've tested the software by clicking through the app doesn't mean you've found every bug. There have always been bug types, like the example checksum collision, that are easier to detect by reading the code than by running the code because it will work most of the time even though the approach is wrong.
> There have always been bug types, like the example checksum collision, that are easier to detect by reading the code than by running the code
AI seems radically, insanely more qualified to not write bugs like that. I doubt that if you polled developers 99% would be able to tell you what a CRC32 even is, let alone why it's insufficient as a cache key.
The original example from Simon Willison referred not to pulling in a 3rd party library, but working "at larger organizations" where "another team hands over something". In other words we area all working on the same product for the same company, they have been assigned another part of it and I'm expected to use their code.
In that scenario of course I care that someone else is responsible! It may affect whether I get fired or not!
It's different if you're a solo founder of a startup and for everything you ship, the buck stops with you. But proportionally many many more devs are in a situation where they are a cog in a machine.
> AI seems radically, insanely more qualified to not write bugs like that. I doubt that if you polled developers 99% would be able to tell you what a CRC32 even is, let alone why it's insufficient as a cache key.
I actually do agree that AI generally writes pretty good code. Doesn't mean I'm not gonna check. Sometimes it is too clever for its own good, such as re-implementing from scratch something that already exists and is well-proven.
The whole example is kind of contrived in the first place (how many environments don't have an excellent "image resizing" solution to reach for off the shelf?), so I hope you don't mind my bug example is also contrived.
I've personally had a LLM write an image resizing library for me. It's a fairly basic one, I didn't need anything fancy. I could have used something off the shelf but it was at a time when I was testing what Claude could do. And to be honest, it just worked. One shot, if I recall correctly, or at least, one session with a few tweaks and never touched again. It's been embedded in a larger app for several months and I don't recall hitting a single bug with that, specifically. So I'm not sure your complaints about "the 5th iteration" being broken have much grounds here.
> one session with a few tweaks and never touched again
> and I don't recall hitting a single bug with that, specifically.
And there you got your answer. If every scenario was as simple as that, we wouldn't really need software development teams. I'm not saying that you can't good result with an LLM tool, but most software are in constant flux and software engineering is about keeping the cost of making new changes minimal.
So if you have a dependency, you want to treat it as a black box, because it lowers the cognitive load. But you don't want it to suddenly change its contract, including breaking it in some strange way. And that brings me to...
> That's why you have them write tons of tests.
Tests are not implementation guarantee. They are a canary to warn about some errors. You assume the code is going to written in good faith, but you place alert points to warn you about possible mistakes. Because you can't really test the full implementation without having a brittle test suite (which you have to maintain).
And tests relies on a lot of assumptions (mocks, initial cases, fakes,...). Those should be treated with care. Because as soon as one are wrong, the test cases it affects are make-believe.
The only true testing of your software is done in production. Everything else is about avoiding the easy mistakes.
Now I have been on HN long enough to know that we used to despise code written by contractors which we now depend on.
The single person who did the service might just quit and go to another job. They might be external consultants that rotate away when the contract ends. It might be a SaaS service where you don't control the code at all - nor the composition of their team.
We have trusted services, contractors and teams within our companies before. Now suddenly _everyone_ has ALWAYS read and meticulously analyzed every single line of code they have ever imported to a project?
LLM have no reputation to lose. Their work may or may not be aligned with your goals and they can’t care if they messed up.
I disagree. Being a manager of programmers requires that you trust your programmers and have some way of occasionally verifying the correctness and efficacy of what they build to make sure that that trust is still properly placed.
But on top of that, the user of an LLM isn't really akin to a manager of programmers. Human programmers are responsible for what they write [0], and even the ones that only cost ~50% of a senior's total comp [1] are still going to be able to fairly reliably explain to you why they made the decisions they did, and fairly reliably be able to follow instruction. LLMs just aren't there yet, and the major LLM providers may never care to get them there.
A programmer who's using LLMs is a programmer who's using LLMs... not a manager of other programmers. I'm not going to say that the tech will never advance to that point, but it's simply not there yet.
[0] Unless management decides otherwise, of course.
[1] Nvidia's CEO recently mentioned that he'd be "deeply alarmed" if senior staff aren't spending at least half of their total compensation [2] on LLM providers, so I'm going to use that as my benchmark for "expected annual LLM spend".
[2] ...meaning that each senior programmer costs their employer at least 50% more than their total compensation....
Unless the manager is also a principal/architect, I don’t find this to be agreeable.
It’s similar to saying that you are a non-deterministic chef when you order food from a restaurant.
These things just can't be in the critical path. They are ridiculously unreliable.
You can tell a human (an IC) the same things. You can then tell another human (a manager) "hey, check that IC A did these things". So far, these are the same. But there's now a critical difference: you can then hold those people accountable if they don't. They can be in the critical path. LLMs can't. People can improve. LLMs can't. People can work together. LLMs can't.
This doesn't always matter. You don't need things like accountability or improvement or teamwork all the time. But you do in reliable software.
This might be fine if you're building a tiny app, or if you're building a medium-sized app that follows a strict existing architecture (like a web app consisting mostly of forms). In which case, have fun.
But if you're building something slightly novel and interesting, then Claude is surprisingly bad at architecture and taste, and it tends to "fix" problems by spewing more slop. What you need instead is actual insight that leads to simplifying principles. This, in turn, allows breaking up the exponential complexity into disciplined patterns. This allows your code complexity to scale far more slowly, allowing an essentially linear number of tests to provide coverage.
I actually download and try people's vibe-coded developer tools. And frankly, those tools are some of the worst software I've used in my life, worse than even Unix-vendor Motif implementations from the early 90s.
Like, I'm super happy that people can vibe-code themselves simple, one-off personal tools. That's incredibly empowering. But that doesn't mean you can big, novel stuff the same way without a competent human actively in the loop.
Is the code bad or don't they do what they claim they do? Both are very different issues.
And I want to be clear that this isn't some non-technical novice vibe coding this garbage. This is often extremely talented developers with decades of experience who have apparently decided that they don't need to look at their code anymore.
You can get very good results out of AI agents. But mostly the people who get good results are the ones who still read the LLM output in detail, and who introduce the structure the LLMs are missing. But like I said, this distinction mostly becomes apparent past a certain size and novelty level.
I must be an anomaly because all of the vibe coded apps I'm running 24/7 don't keep crashing or stop suddenly working.
The constant urge I have today is for some sort of spec or simpler facts to be continuously verified at any point in the development process; Something agents would need to be aware of. I agree with the blog and think it's going to become a team sport to manage these requirements. I'm going to try this out by evolving my open source tool [1] (used to review specs and code) into a bit more of a collaborative & integrated plane for product specs/facts - https://plannotator.ai/workspaces/
I also tend to find especially that there's a lot of cruft in human written spec languages - which makes them overly verbose once you really get into the details of how all of this works, so you could chop a lot of that out with a good spec language
I nominate that we call this completely novel, evolving discipline: 'programming'
In ancient times we had tech to do exactly that: Programming languages and tests.
That's theorem provers and they're awful for anything of any reasonable complexity.
Shame us all for moving away from something so perfect, precise, and that "doesn't have edge cases."
Hey - if you invent a programming language that can be used in such a way and create guaranteed deterministic behavior based on expressed desires as simple as natural language - ill pay a $200/m subscription for it.
Another agent: Code -> Inverted Spec
then compare Spec and Inverted Spec.
If there is a Gap, a Human fixes and clarifies the Gap.
This is like Generator and Discriminator aspects of GAN models or Autoencoder models.
One of my biggest fears with using AI at work is that I will subconsciously start talking and writing like a bot, despite making conscious efforts to do the opposite. Just like how when you read a lot of books by one author, their style infects your own writing style.
Kidding, nah no worries. I do worry people become overly paranoid of bots as time passes.
It all failed. For a simple reason, popularized by Joel Spolsky: if you want to create specification that describes precisely what software is doing and how it is doing its job, then, well, you need to write that damn program using MS Word or Markdown, which is neither practical nor easy.
The new buzzword is "spec driven development", maybe it will work this time, but I would not bet on that right now.
BTW: when we will be at this point, it does not make sense anymore to generate code in programming languages we have today, LLM can simply generate binaries or at least some AST that will be directly translated to binary. In this way LISP would, eventually, take over the world!.
In the new world of mostly-AI code that is mostly not going to be properly reviewed or understood by humans, having a more and more robust manifestation and enforcement, and regeneration of the specs via the coding harness configuration combined with good old fashioned deterministic checks is one potential answer.
Taken to an extreme, the code doesn’t matter, it’s just another artifact generated by the specs, made manifest through the coding harness configuration and CI. If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.
“Clean room code generation-compiler-thing.”
The critical insight is that this is not true. When people depend on your software, replacing it with an entirely different program satisfying all of your specs and configurations is a large, months-long project requiring substantial effort and coordination even after new program is written. It seems to work in vibe coded side projects because you don't have those dependencies; if you got an angry email from a CEO saying that moving a critical button ruined their monthly review cycle, and demanding 7 days notice before you move any buttons going forwards, you'd just tell them no.
This makes sense since certain higher-level code produces certain lower-level code, while LLM cannot. If the transpired JS code doesn't work we could just find out the bug in minifiers, etc. but one cannot figure out why LLM fails at one task, especially considering LLMs, even SOTA ones, could be strongly affected by even small prompt changes. Taking this into consideration, I don't think this is a sound reasoning why we don't need to review ai-generated code.
> The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore.
Exactly. However, this could also indicate a weaker review standard instead of just dropping review. We could also suggest an idea where devs mainly review code design or interfaces, leveraging one's *taste*, while leaving strict logic reasoning, validating and testing to other tools or approaches. It cannot pursuade me that the nature of LLM's code generation must lead to a complete cancel of the code review.
Anyway, I'm not opposing this article and its thought of shift in the future is really good.
I'm seeing in my experience that Claude has become better with every version at producing uniformity in its code output. Especially where the architecture is clear and documented. And even more so in languages with built in uniformity (Go, HTMX, SQL) where there is intentionally only one or two ways of doing things. In such environments, the output is nearly deterministic.
Is it? All the electricity and capital investment in computing hardware costs real money. Is this properly reflected in the fees that AI companies charge or is venture capital propping each one up in the hope that they will kill off the competition before they run out of (usually other people's) money?
This is too weird for me. At least with programming languages I can consult the documentation and if the programming language isn’t behaving as documented, it’s obviously a defect and if you’re savvy enough you often have open channels that accept contributions. Can we say the same for Claude or other AI solutions?
how can a local LLM with an open source agent harness provide the same trustworthiness?
I recall working on a project that used (MSVC) VC++ and a coworker found a bug in the compiler. We reported the issue to Microsoft and they eventually patched it.
You may find yourself arguing explicitly for open source dev tools if you continue down this line. There are many commercial cases where "you can fix it" does not apply to the dev toolchain and you will find yourself reliant on a provider. At that point, the trustworthiness of "compiler provider" and "local LLM provider" is the pertinent discussion (e.g. provider vs. provider instead of LLM vs compiler).
That’s only on the hobbyist level. On the enterprise level, there are lots of contracts involved that requires speedy bugs correction.
well sure, of course i would :) but ig i meant more so "can be fixed" in a way it can't with llms, open source or not
So something which must be true if this author is right is that whatever the new language is—the thing people are typing into markdown—must be able to express the same rigor in less words than existing source code.
Otherwise the result is just legacy coding in a new programming language.
And this is why starting with COBOL and through various implementations of CASE tools, "software through pictures" or flowcharts or UML, etc, which were supposed to let business SMEs write software without needing programmers, have all failed to achieve that goal.
I think it's an open question of whether we achieve the holy grail language as the submission describes. My guess is that we inch towards the submission's direction, even if we never achieve it. It won't surprise me if new languages take LLMs into account just like some languages now take the IDE experience into account.
Yes but also no. Writing source means rigorously specifying the implementation itself in deep detail. Most of the time, the implementation does not need to be specified with this sort of rigor. Instead the observable behavior needs to be specified rigorously.
Certainly you could write specification for a piece of software, and the software could meet the specification while also leaking credentials. Obviously, that would be a problem. But at some point, this starts to feel artificial and silly. The same software could reformat your hard disk, right?
At some point, we aren’t discussing whether or not AI is doing a bad job writing software. We’re discussing whether or not it’s actively malicious.
Memory leaks, deleting the hard drive, spending money would all be observable behavior.
By your reasoning that the "observable behavior needs to be specified rigorously" it seems like you'd have to list these all out. We do, after all, already have cases of AI deleting data.
That sounds harder and more error prone than what we're doing now by rigorously defining these defects out of existence in code.
The entire reason we have functions and components and modules etc is to isolate engineers from the things we do not need to care about. I should not need to care about the implementation details of most software, only if it meets my retirements.
The move to AI first software development will not happen because we find a way to specify as much in English as we previously would have specified in a programming language. The move will happen when and as we figure out how to specify the things that matter. We don’t need the same rigor. We need the correct rigor.
The only reason those details don’t matter to you is because someone has gone through the pain of ironing out every details that have not made it into the specifications. One one side you have the platform and on the other side you have the interface contract (requirements). Saying what’s in the middle doesn’t matter is strange. Because both the platform and the interface are dynamic and can shift drastically from their 1.0 version.
I disagree. Very often the reason the details don’t matter is that they are irrelevant. There are a million ways an app might remember my personal settings, as a simple example. SQLlite db, json file, ini, cloud storage, registry, etc. The specific implementation matters very little so long as it’s sane.
> Saying what’s in the middle doesn’t matter is strange
I understand your point but do not agree. I think over the next decade we will get increasingly good at specifying rigorously the parts of the surface that matter while increasingly caring less and less about the rest. We will not find a way to write rigorous code in English because that would necessarily be less efficient than just using a programming language.
It may not matter if you’re just an end user. But if you’re the one deciding the tools to be used, you may wonder about consistency (sqlite is better than a json file or ini file), availability (local storage is better than a cloud service), security risks,… Trusting an LLM to take care of that looks like negligence to me.
You are of course entitled to hold your opinion. How to work with LLMs successfully will be determined by those who believe it’s possible rather than those who argue it’s fundamentally negligent, though.
The arguments against AI coding have rapidly evolved from “it’s not possible” to “it’s possible but breaks down as soon as the system gets complex” to this “it works but it’s negligent” argument. The industry will continue to move on.
> Coding may be abstract, but execution of the resulting program is not. And results of the execution is driven by real world needs. Truth is that a human can invent things because it can pattern match across whole domains. You can say there is a mechanic solution to that, how can we do an algorithm that have the same result. AI cannot unless the algorithm was already created. I think the current state of AI is great for searching and creating starting point, but it can never get us to the finish line.
I've not seen anything since then that has changed my point of view. My job as a developer is always about creating pragmatic solutions for problems that exists outside of the computer world. I'm not attached to code and will gladly rewrite it if it's lacking or faulty. But the actual purpose is to get something that works well in the hand of the user. But the user's needs are not static, so I also create something that is flexible enough to be able to adapt it later when those needs changes.
So when I read comments that says they don't care about code, but also have no answer about how they will solve their user's problems or how will they modify the software to future changes, it seems so strange to me. Like is your belief backed by real world experience?
As a software engineer, I prefer the scenario where AI is incompetent at writing effective software. But I don’t think that scenario describes the real world.
> AI cannot unless the algorithm was already created
We have documented cases of AI producing novel mathematical proofs. The narrative that AI can only regurgitate existing info has been pretty soundly disproven at this point. I would not consider it a good thing that your perspective is unchanging in the face of new information.
This sort of assertion is also lacking merit because most coding is not about novel algorithms. Most coding is in fact straightforward regurgitation of ideas and code a thousand other developers have already written. So even if this were all AI can possibly do, it could still be a significant improvement over manually coding the same uninteresting boilerplate and CRUD logic.
I think this is perhaps a big piece of the disconnect between your perspective and mine. When we talk about “the middle”, you focus on the bit of critical logic there while I am focusing on the vast amount of logic there that is deeply uninteresting. It’s kind of like the “premature optimization” thing. Put your mental energy where it matters.
> So when I read comments that says they don't care about code, but also have no answer about how they will solve their user's problems or how will they modify the software to future changes, it seems so strange to me. Like is your belief backed by real world experience?
This is a bizarre statement because we haven’t talked about a user’s problems. You’ve seemingly manufactured a scenario in your head where I have AI producing code for no purpose. It’s no wonder you can’t see how this makes sense, because it doesn’t.
Why does everyone assume that anyone that is slightly critical has no experience whatsoever with the thing they critic? That is very dismissive. As soon as you start working on something complex, the agent requires a lot of guidance from an expert. That's not what's being marketed.
If someone told you they have a supersonic jet, but then you see a small Cessna. Would you take them at face value?
> Most coding is in fact straightforward regurgitation of ideas and code a thousand other developers have already written.
Which is why everyone has been using libraries and framework in the past decades. And why people goes to conferences and buy books. No one is keen to reinvent everything from scratch. If you think that we manually code everything that means you don't know anything about developer practice. Most of development is keeping a strong model of the software and tweak things here and there.
> When we talk about “the middle”, you focus on the bit of critical logic there while I am focusing on the vast amount of logic there that is deeply uninteresting.
Why is it uninteresting? It seems that the only people not interested in code are the ones that are not responsible for it when it's in production.
> You’ve seemingly manufactured a scenario in your head where I have AI producing code for no purpose
No, I'm asking how are you maintaining the code that you found uninteresting. Because all of it will be going in the software. What I've seen is a weird focus on Devex (you don't want to code uninteresting stuff) or HR (no need to hire expensive engineers), but not a peep about the users.
It’s because most of the criticisms around the capability (as opposed to the ethical concerns or costs) are shallow. (“But it could leak your creds.” Come on. More likely one of the ten thousand NPM dependencies you install will do that.)
I’m interested in meaningful discussion of AI’s complexities and concerns. “It can’t possibly work” reeks of ignorance, not experience.
> As soon as you start working on something complex, the agent requires a lot of guidance from an expert. That's not what's being marketed.
Ignore the marketing. The question is whether the AI makes you more efficient. We don’t have artificial super intelligence that’s going to replace all the experts in a domain. What we have is an extremely efficient coding machine that does need guidance at times. But it has also learned to do a whole lot of stuff without detailed guidance.
“It can’t fully replace the judgement of an expert so it’s useless” is a shallow argument. We don’t accept that logic for any other tool.
> Which is why everyone has been using libraries and framework in the past decades. And why people goes to conferences and buy books.
So you believe this stuff is complicated enough to go to conferences and buy books to understand it but you can’t see the value of a tool that intrinsically knows the major frameworks and knows how to write code in them quickly. Ok.
> No one is keen to reinvent everything from scratch.
That’s honestly not true. Developers love to reinvent stuff. I feel like every time I look at web dev in particular the world has moved to yet another framework.
But also that’s not what I meant. So much code is basically duplicative of code that others have written. If you were tasked with adding a view to an application that could show the user’s latest photos, you would fetch the photos (from disk, from a service, whatever), load them into memory, create some sort of UI collection object and create a bunch of photo objects to stuff into the collection. You might add sorting. You might add lazy loading or even unloading for out-of-view images to save memory. And none of that is novel. It’s all been written a thousand times in dozens of languages. And yeah, I’m sure there’s something in there that’s unique to your use case or has some special requirements you need to think through. But most of it is a well worn path.
> If you think that we manually code everything that means you don't know anything about developer practice. Most of development is keeping a strong model of the software and tweak things here and there.
I don’t know anything about you but I’ve been coding professionally for over 20 years. I’ve shipped small plugins that maybe a few dozen people used and I’ve shipped code that millions of people use.
Most of a software engineer’s job is managing complexity. Which is why we use libraries and frameworks that abstract away much of the complexity.
This is exactly why so much AI criticism rings hollow. We cannot pretend that we need to control every tiny choice an LLM might make when we happily use libraries that hide thousands or millions of such choices from us.
> Why is it uninteresting? It seems that the only people not interested in code are the ones that are not responsible for it when it's in production.
As explained above. Uninteresting as in not novel.
> No, I'm asking how are you maintaining the code that you found uninteresting. Because all of it will be going in the software.
The same way you maintain code today. Possibly more efficiently with AI able to reason over the code and save you time.
Surely you are not intending to suggest that only the person who writes code initially can maintain it.
> What I've seen is a weird focus on Devex (you don't want to code uninteresting stuff) or HR (no need to hire expensive engineers), but not a peep about the users.
I love to code uninteresting stuff. If I could get paid well to reimplement a standard library from scratch, I’d absolutely take the job. But I don’t have infinite time and no one wants to pay me for that. The interest in AI coding is largely not because it reduces uninteresting work for its own sake but because it can drastically speed up the pace of development in general.
The HR problem is very real and worries me a lot, but that’s not relevant to whether AI coding works.
The users are the last people who care about what your software looks like underneath. They care that it works to solve their problem. They don’t care if it’s pretty code. They don’t care if it’s a pain to change the code because it’s riddled with tech debt. They don’t care if you need to employ a million SREs to maintain service quality. Those are all real problems but they are not the customer’s concerns.
My workflow is centered about a correct output, not speed or efficiency. But what makes it easy to get correct result makes it easy to be fast and be economical in resource consumption too. Except that you need to average over the usual period of times you need to support a project (months or years). Nice tricks like slinging a PR over the wall in a day don't matter if it's not sustainable.
Things like writing code that's been written over and over again also don't matter. It's either I know how to do it (or at least the general pattern). And then it will be a walk in the park (some relax period even, like walking between bouts of running when jogging). Or I don't know the pattern and I need to be careful to get it right.
So what saves me time ultimately the reliability of my software. When I'm not busy fixing stuff right and left and can be fairly confident when releasing. I've not found any methodology that helps with that with AI tooling being actually helpful in that regards.
But yes, agree to disagree. I’m not here to evangelize AI and I hope you have success regardless of whether you use AI coding.
We should only need to specify observable behavior... but observable behavior is so broad that it includes common defects... but also we shouldn't need to specify the lack of defects even though they are observable.
It feels like, "the AI should read my mind and do the right thing without me fully specifying it."
I've found that adopting RFC Keywords (e.g. RFC 2119 [1]; MUST, SHOULD, MAY) at least makes the LLM report satisfaction. I'd love to see a proper study on the usage of RFC keywords and their effect on compliance and effectiveness.
The downside is the ospx markdown specs sometimes end up too granular, focusing on the wrong or less important details, so reading the specs feels like a slog.
Also at times aspects of the english language spec end up way more verbose than just giving a code example would be.
Why not? You just make every task faster. Not everything has to be an uncontrollable rocket launch.
> We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work
Why? To build what? You can only build as fast as you understand the business and your users.
It should be possible to go faster by having AI understand the business and users.
This just sounds like typical requirements management software (IBM DOORS for example, which has been around since the 90s).
It's kind of funny how AI evangelists keep re-discovering the need for work methods and systems that have existed for decades.
When I worked as a software developer at a big telecom company and I had no say in what the software was supposed to do, that was up to the software design people--they were the ones responsible for designing the software and defining all the requirements--I was just responsible for implementing that behavior in code.
And now that step can be 100% automated.
Information systems design was a solved problem in the 1970s. PRIDE turned it from an art into a proven, repeatable science. Programmers, afraid of losing their perceived importance, resisted the discipline it imposes as the mustang resists the bit, but now that they're going the way of buggy-whip makers, maybe systems design as a science will make a comeback after 50 years.
It was gratifying to build the confidence of learning a new language quickly that I had never even heard of before. DXL was also pretty awful.
Opened a lot of doors for me though, no pun intended.
The author is nibbling at the same problem ultimately, but i don't think "hey one strategy is we could just let cognitive debt accumulate so we can go faster!" is a particularly insightful tool in the toolbox. Don't misread me, i'm not denying it can be a valid strategy.
Instead i want to read about insightful strategies for optimising that system-wide bottleneck we have: understanding.
Tell me about how you managed to shift to a higher level of abstraction, tell me about how and when that abstraction leaks. Tell me how you reduced the amount of information that has to flow through the system bottleneck.
I suppose all the money floating around AI helps dummify everything, as people glom on to narratives, regardless of merit, that might position them to partake.
What we actually have now is the ability to bang out decent quality code really fast and cheaply.
This is massive, a huge change, one which upends numerous assumptions about the business of software development.
...and it only leaves us to work through every other aspect of software development.
The approach this article advocates is to essentially pretend none of this exists. Simple, but will rarely produce anything of value.
This paragraph from the post gives you the gist of it:
> ...we need to remove humans-in-the-loop, reduce coordination, friction, bureaucracy, and gate-keeping. We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work, with the purview to make autonomous decisions. Rework is almost free so we shouldn’t make an effort to prevent incorrect work from happening.
As if the only reason we ever had POs or designers or business teams, or built consensus between multiple people, or communicated with others, or reviewed designs and code, or tested software, was because it took individual engineers too long to bang out decent code.
AI has just gotten people completely lost. Or I guess just made it apparent they were lost the whole time?
Using an LLM to one shot a small function (something i would do with a very specific search on Google or SO) is handy. Giving it a harness and free access to a code base leads to some terrible code, and doubling down with more instructions and agents in the loop means more time writing the rube Goldberg orchestration rather than just opening up an editor and writing code.
To me what AI is doing is changing the economics of human thought, but the change is happening way faster than individuals, let along organizations can absorb the implications. What I've seen is that AI magnifies the judgment of individuals who know how to use it, and so far it's mostly software engineers who have learned to use it most effectively because they are the ones able to develop an intuition about its limitations.
The idea of removing the human from the loop is nonsense. The question is more what loops matter, and how can AI speed them up. For instance, building more prototypes and one-off hacky tools is a great use of vibe coding, changing the core architecture of your critical business apps is not. AI has simultaneously increased my ability to call bullshit, while amplifying the amount of bullshit I have to sift through.
When the dust settles I don't really see that the value or importance of reading code has changed much. The whole reason agentic coding is successful is because code provides a precise specification that is both human and machine readable. The idea that we'll move from code to some new magical form of specification is just recycling the promise of COBOL, visual programming, Microsoft Access, ColdFusion, no-code tools, etc, to simplify programming. But actually the innovations that have moved the state of the art of professional programming forward, are the same ones that make agentic coding successful.
The point I’m making is that we give the spotlight to people who are making absurd claims. We have not achieved the ability to remove the human from the loop and continually produce value-able outputs. Until we do, I don’t see how any of the claims made in this article are even close to anything more than simply gate-keeping slop.
> What then, what are humans for?
why, to make money for the boss of course!As I understand, this is an unsolved problem.
"Somewhat easier than an impossible task" is not a particularly strong claim about when (or whether) this problem will be solved, though.
Those were written by humans, and don't involve unsolved mathematics.
Is your claim tht you just need to solve comprehensibility of LLMs?
Figuring out epistemology and cognition to have a chance to reason about the outputs of a LLM seems to me way harder that traditional attempts to reason directly about algorithms.
But that aside, it's such a shame that many drinking the AI Kool-Aid aren't even aware of the theoretical limits of a computer's capabilities.
Computers are finite machines. There is a theorem that although a machine with finite memory can add, multiplication requires unbounded memory. Somehow we muddle along and use computers for multiplication anyway.
More to your point there is a whole field of people who write useful programs using languages in which every program must be accompanied by a proof that it halts on all inputs.
(See for example https://lean-lang.org/ or David Turner's work on Total Functional Programming from about 20 years ago.)
Other examples are easy to find. The simplex algorithm for linear optimization requires exponential time in general, and the problem it solves is NP-hard, but in practice works well on problems of interest and is widely used. Or consider the dynamic programming algorithms for problems like subset-sum.
Theory is important, but engineering is also important.
What theorem is that?
The multiplication of any two integers below a certain size (called "words") fits in a "double word" and the naive multiplication algorithm needs to store the inputs, an accumulator and at most another temporary for a grand total of 6*word_size
Sure, you can technically "stream" carry-addition (which is obvious from the way adders are chained in ALU-101) and thus in a strict sense addition is O(1) memory but towards your final point:
> Theory is important, but engineering is also important.
In practice, addition requires unbounded memory as well (the inputs). And it's definitely compute-unbounded, if your inputs are unbounded.
I dislike the term "we muddle along". IEEE 754 has well specified error bars and cases, and so does all good data science. LLMs do not, or at least they do not expose them to the end user
So then, how exactly do we go about proving that the result of chaining prompts is within a controllable margin of error of the intended result? Because despite all the specs, numerical stability is the reason people don't write their own LAPACK.
LLMs address this problem by just making things up (and they don't do a great job of comprehending the natural language, either), which I think qualifies as "hoping for the best", but I'm not sure there is another way, unless you reframe the problem to allow the algorithm to request the information it's missing.
"is this implementation/code actually aligned with what i want to do?"
humanic responsibility's focus will move entirely from implementing code to deciding whether it should be implemented or not.
u probably mean unsolved as in "not yet able to be automated", and that's true.
if pull-request checks verifying that tests are conforming to the spec are automated, then we'd have AGI.
LLMs do not understand prose or code in the same way humans do (such that "understand" is misleading terminology), but they understand them in a way that's way closer to fuzzy natural language interpretation than pedantic programming language interpretation. (An LLM will be confused if you rename all the variables: a compiler won't even notice.)
So we've built a machine that makes the kinds of mistakes that humans struggle to spot, used RLHF to optimise it for persuasiveness, and now we're expecting humans to do a good job reviewing its output. And, per Kernighan's law:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?
And that's the ideal situation where you're the one who's written it: reading other people's code is generally harder than reading your own. So how do you expect to fare when you're reading nobody's code at all?
say: human wants to make a search engine that money for them.
1. for a task, ask several agents to make their own implementation and a super agent to evaluate each one and interrogate each agent and find the best implementation/variable names, and then explain to the human what exactly it does. or just mythos
2. the feature is something like "let videos be in search results, along with links"
3. human's job "is it worth putting videos in this search engine? will it really drive profits higher? i guess people will stay on teh search engine longer, but hmmm maybe not. maybe let's do some a/b testing and see whether it's worth implementing???" etc...
this is where the developer has to start thinking like a product manager. meaning his position is abolished and the product manager can do the "coding" part directly.
now this should be basic knowledge in 2026. i am just reading and writing back the same thing on HN omds.
Having the code-writing part automated would have a negligible impact on the total project time.
No, thank you
LLMs struggle with TDD. They want to generate a bunch of code and tests in large passes. You can instruct them to do red/green TDD, but the results aren't great.
SDD starts before implementation, and formalizes intent and high-level design. LLMs eat it up. The humans can easily reinvent the worst parts of waterfall if they're not careful.
They're not mutually exclusive.
> SDD starts before implementation
No different from TDD.
So guardrails, i.e. sufficiently precise spec and tests, will need to be as strict as the LLM is bad at getting the right context and asking back the right questions. I suppose at that point not much difference between a human engineer and it.
My opinion is very close to this. Currently the reason that it's bad to not reviewing/testing the code LLMs generated is because the LLMs can sometime generate bad codes. But it's a bug that can be improved. One day you'll have LLMs generating code consistently better than what a human could write. And then you just stop needing to review them. (And that's probably also the time where most programmers/developers got fired too)
Don't get surprised if anyday the LLMs starts to generate binaries directly. THAT will be impossible to read and costs more time to analyze.
Sometimes?
I am heavily into vibe coding and I think they almost always generate bad code. At least as soon as you're distant enough from the code to call it vibe coding.
When you're still in touch with the code, have at least been recently talking to it about code rather than 100% about features, and its context is filled with good code, it can generate good code.
Agree, this is how you make the development loop more deterministic and ultimately autonomous. It's how I've been using coding agents myself for the past few months (by building my own to support this natively [1]).
If you have a spec you approve/agree on, have an agent code against it, and then have a review phase verify the implementation didn't drift from the spec (either by adding or removing features), you get to a position where you can trust the outcome.
There's still a lot to be said about spec definition and what if during implementation gaps are discovered, and that's where HITL comes into play.
I might even start my own blog to write about things I've found.
1. Always get the agent to create a plan file (spec). Whatever prompt you were going to yolo into the agent, do it in Plan Mode first so it creates a plan file.
2. Get agents to iterate on the plan file until it's complete and thorough. You want some sort of "/review-plan <file>" skill. You extend it over time so that the review output is better and better. For example, every finding should come with a recommended fix.
3. Once the plan is final, have an agent implement it.
4. Check the plan in with the impl commit.
The plan is the unit of work really since it encodes intent. Impl derives from it, and bugs then become a desync from intent or intent that was omitted. It's a nicer plane to work at.
From this extends more things: PRs should be plan files, not code. Impl is trivial. The hard part is the plan. The old way of deriving intent from code sucked. Why even PR code when we haven't agreed on a plan/intent?
This process also makes me think about how code implementation is just a more specific specification about what the computer should do. A plan is a higher level specification. A one-line prompt into an LLM is the highest level specification. It's kinda weird to think about.
Finally, this is why I don't have to read code anymore. Over time, my human review of the code unearthed fewer and fewer issues and corrections to the point where it felt unnecessary. I only read code these days so I can impose my preferences on it and get a feel for the system, but one day you realize that you can accumulate your preferences (like, use TDD and sum types) in your static prompt/instructions. And you're back to watching this thing write amazing code, often better than what you would have written unless you have maximum time + attention + energy + focus no matter how uninteresting the task, which you don't.
Doesn’t it bother you that the outcome each PR is different every time you/CI “run it”?
Basically zero plan. Or rather, the "internal" plan that the human implementor used while writing the code is hidden from us because it's a mix of ideas they held in their head, jotted in some notes, existed in a sequence of commits that were lost when squashed into a PR, etc. There's zero reproducibility in the implementation.
So take my idea and pretend we still don't have AI yet: the main point is that we move to a pipeline where we work on a first-class plan first before we begin implementation. This gets us closer to reproducible implementation no matter who is implementing it.
It just so happens that now with implementation becoming automated, we have more attention and energy freed up to focus on this plan-based model.
React team seems to really have set a precedent with their "dangerouslySetInnerHTML" idea.
Or did they borrow it somewhere?
I'm just curious about that etymology, of course the idea is not universally helpful: for example, for dd CLI parameters, it would only make a mess.
But when there's a flag/option that really requires you to be vigilant and undesired the input and output and all edge cases, calling it "dangerous" is quite a feat!
React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED
https://github.com/reactjs/react.dev/issues/3896There are stances that say they should, browse a large SPA with complex working source maps enabled, DevTools open, cache disabled and a long session (relevant because of HMR in dev), and you can see why this matters.
Browsers only fetch and process source maps in a development environment in production, that's why this flag name exists.
That being said, I still have a hobby project with an (in my opinion) sensible (at the time) Webpack configuration, and glossed over this being in the minified bundle, after 1-2 days at the time.
But if my hobby project would have been something production-relevant, I'd have continued to hunt down this artifact.
I think, with Vite et al this should not appear anymore in current JS bundles ready for prod, so the name is apt.
But the underlying problem is still a neverending source of frustration: minification is (by definition, when it's statically verifiable), not equipped to change object property names without provoking breakage.
We will have code full of unknown bugs, that is unfixable.
The solution is to replace it with more of the same but with some new specification (fix some bug add some new feature).
And this will be done by using astounding amounts of compute in massive new data centres.
Technology, implementation may change, but general point of "why!?" stays.
This is what we're building for at Saldor (https://saldor.com). It's a hard problem, to get a team in the habit of writing good specs. Probably because it's a hard thing to do: thinking of the behavior of your program, especially at the edges. But I agree (biased) that this is probably the way forward for writing code in the near future. I'm excited to see other people thinking about it.
I have team do this using CLAUDE.md telling Claude to do it in a set of interconnected steps, but in brief: they are to make it write every aspect of transcript somewhere: PRD, research notes, spec, dev log and debate log, break/fix/retro notes, commit log, PR, release notes, README, docs .mds... heavy emphasis on the edges in our thinking, and just as important, the edges in its ability to provide good leverage.
It needs a core set of guidance on the ordering and how to write "as of" a given phase or release so context stays current, trusting the old info is in git history it can navigate for the story of how we got here.
CC's /insights claims I have 10:1 md edits to code edits, and we both note this way of working is resulting in far fewer error loops per higher quality outcome.
// So yes, interested in your product. Baking something more broadly battle tested in so we don't have to reinvent it makes sense.
Just a data point: this month I had a knarly bug in generated bpf code. The C language was correct but the compiler produced a bug that corrupted packets. I spent around 8 hours debugging _where_ the issue is and how to work around, never really understanding what went wrong. That knowledge came with several more days on and off looking at it--after I had mitigated the production issue.
So if I extrapolate this experience to LLMs (who are not deterministic) and who will make larger systems. What we trade for velocity we will pay for with hours of debugging because we won't understand how things work. I think this is unavoidable.
Another way I'm looking at it: after some time of not writing code, it will be analogous to instructing the LLM and the output being assembly--where I simply don't have the muscle to grok the output. How do I mitigate that knowledge gap? I see micro serves coming back. Today it is easy to slop up disposable scripts. Our services need to be modular so we can dispose of broken things--so they are only coupled with each other by strict APIs.
Your app takes 20 seconds to load, pulling 50 megabytes of minified JS. Your backend is a mess of 20 Rust microservices, 300 megabytes docker image each.
Nobody has actually been reading and understanding code in your org for the past 15 years. And nobody has ever been responsible, everybody has just been job hopping for a 15% total comp bump.
Now the secret is out.
LOL. I had to check if this was published on April 1st.
user experience/what the app actually does >>> actually implementing it.
elon musk said this a looong time ago. we move from layer 1 (coding, how do we implement this?) to layer 2 thinking (what should the code do? what do we code? should we implement this? (what to code to get the most money?))
this is basic knowledge
We need the pragmatic engineer more than ever.
This was a follow up to a previous article[1] and the pair tried to express what I still think today (using AI daily at work): every time I use AI for coding, to some capacity I'm sacrificing system understanding and stability in favor of programming speed. This is not necessarily always a bad tradeoff, but I think it's important to constantly remind ourselves we are making it.
[1] https://olano.dev/blog/tactical-tornado/