If they make the LLMs more productive, it is probably explained by a less complicated phenomenon that has nothing to do with the names of the roles, or their descriptions. Adversarial techniques work well for ensuring quality, parallelism is obviously useful, important decisions should be made by stronger models, and using the weakest model for the job helps keep costs down.
For instance, if an agent only has to be concerned with one task, its context can be massively reduced. Further, the next agent can just be told the outcome, it also has reduced context load, because it doesn't need to do the inner workings, just know what the result is.
For instance, a security testing agent just needs to review code against a set of security rules, and then list the problems. The next agent then just gets a list of problems to fix, without needing a full history of working it out.
That's mostly for throughput, and context management.
It's context management in that no human knows everything, but that's also throughput in a way because of how human learning works.
In other words, when I have a task that specifically should not have project context, then subagents are great. Claude will also summon these “swarms” for the same reason. For example, you can ask it to analyze a specific issue from multiple relevant POVs, and it will create multiple specialized agents.
However, without fail, I’ve found that creating a subagent for a task that requires project context will result in worse outcomes than using “main CC”, because the sub simply doesn’t receive enough context.
However one of the bigger things is by having a focus on a specific task or a role, you force the LLM to "pay attention" to certain aspects. The models have finite attention and if you ask them to pay attention to "all things".. they just ignore some.
The act of forcing the model to pay attention can be acoomplished in alternative ways (defined process, commitee formation in single prompt, etc.), but defining personas at the sub-agent is one of the most efficient ways to encode a world view and responsibilities, vs explicitly listing them.
And this is finite in capacity and emergent from the architecture.
“Organizations are constrained to produce designs which are copies of the communication structures of these organizations.”
Maybe a different separation of roles would be more efficient in theory, but an LLM understands "you are a scrum master" from the get go, while "you are a zhydgry bhnklorts" needs explanation.
https://arxiv.org/abs/2311.10054
Key findings:
-Tested 162 personas across 6 types of interpersonal relationships and 8 domains of expertise, with 4 LLM families and 2,410 factual questions
-Adding personas in system prompts does not improve model performance compared to the control setting where no persona is added
-Automatically identifying the best persona is challenging, with predictions often performing no better than random selection
-While adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random
Fun piece of trivia - the paper was originally designed to prove the opposite result (that personas make LLMs better). They revised it when they saw the data completely disproved their original hypothesis.
What the paper is really addressing is does key words like you are a helpful assistant give better results.
The paper is not addressing a role such as you are system designer, or you are security engineer which will produce completely different results and focus the results of the LLM.
In the domain alignment section:
> The coefficient for “in-domain” is 0.004(p < 0.01), suggesting that in-domain roles generally lead to better performance than out-domain roles.
Although the effect size is small, why would you not take advantage of it.
And then you say:
> comprehensively disproven
? I don't think you understand the scientific method
"Comprehensively disproven" was too strong - should have said "evidence suggests the effect is largely random." There's also Gupta et al. 2024 (arxiv.org/abs/2408.08631) with similar findings if you want more data points.
Being a manager is a hard job but the failure mode usually means an engineer is now doing something extra.
When you think about what an LLM is, it makes more sense. It causes a strong activation for neorons related to "code review", and so the model's output sounds more like a code review.
I came across a concept called DreamTeam, where someone was manually coordinating GPT 5.2 Max for planning, Opus 4.5 for coding, and Gemini Pro 3 for security and performance reviews. Interesting approach, but clearly not scalable without orchestration. In parallel, I was trying to do repeatable workflows like API migration, Language migration, Tech stack migration using Coding agents.
Pied-Piper is a subagent orchestration system built to solve these problems and enable repeatable SDLC workflows. It runs from a single Claude Code session, using an orchestrator plus multiple agents that hand off tasks to each other as part of a defined workflow called Playbooks: https://github.com/sathish316/pied-piper
Playbooks allow you to model both standard SDLC pipelines (Plan → Code → Review → Security Review → Merge) and more complex flows like language migration or tech stack migration (Problem Breakdown → Plan → Migrate → Integration Test → Tech Stack Expert Review → Code Review → Merge).
Ideally, it will require minimal changes once Claude Swarm and Claude Tasks become mainstream.
The previous generations of AI (AI in the academic sense) like JASON, when combined with a protocol language like BSPL, seems like the easiest way to organize agent armies in ways that "guarantee" specific outcomes.
The example above is very cool, but I'm not sure how flexible it would be (and there's the obvious cost concern). But, then again, I may be going far down the overengineering route.
Have you been able to build anything productionizable this way, or are you just using this workflow for rapid prototyping?
I've been working on something in this space too. I built https://sonars.dev specifically for orchestrating multiple Claude Code agents working in parallel on the same codebase. Each agent gets its own workspace/worktree and there's a shared context layer so they can ask each other questions about what's happening elsewhere (kind of like your Librarian role but real-time).
The "ask the architect" pattern you described is actually built into our MCP tooling: any agent can query a summary of what other agents have done/learned without needing to parse their full context.
I built a drag and drop UI tool that sets up a sequence of agent steps (Claude code or codex) and have created different workflows based on the task. I'll kick them off and monitor.
Here's the tool I built for myself for this: https://github.com/smogili1/circuit
1. Are you using a Claude Code subscription? Or are you using the Claude API? I'm a bit scared to use the subscription in OpenCode due to Anthropic's ToS change.
2. How did you choose what models to use in the different agents? Do you believe or know they are better for certain tasks?
Not a change, but enforcing terms that have been there all the time.
- I built a system where context (+ the current state + goal) is properly structured and coding agents only get the information they actually need and nothing more. You wouldn’t let your product manager develop your backend and I gave the backend dev only do the things it is supposed to and nothing more. If an agent crashes (or quota limits are reached), the agents can continue exactly where the other agents left off.
- Agents are ”fighting against” each other to some extend? The Architect tries to design while the CAB tries to reject.
- Granular control. I wouldn’t call “the manager” _a deterministic state machine that is calling probabilistic functions_ but that’s to some extent what it is? The manager has clearly defined tasks (like “if file is in 01_design —> Call Architect)
Here’s one example of an agent log after a feature has been implemented from one of the older codebases: https://pastebin.com/7ySJL5Rg
The models can call each other if you reference them using @username.
This is the .md file for the manager : https://pastebin.com/vcf5sVfz
I hope that helped!
Extrapolating from this concept led me to a hot-take I haven't had time to blog about: Agentic AI will revive the popularity of microservices. Mostly due to the deleterious effect of context size on agent performance.
Real example that happened to me, Agent forgets to rename an expected parameter in API spec for service 1. Now when working on service 2, there is no other way of finding this mistake for the Agent than to give it access to service 1. And now you are back to "... effect of context size on agent performance ...". For context, we might have ~100 services.
One could argue these issues reduce over time as instruction files are updated etc but that also assumes the models follow instructions and don't hallucinate.
That being said, I do use Agents quite successfully now - but I have to guide them a bit more than some care to admit.
I guess this may be dependent on domain, language, codebase, or soke combination of the 3. The biggest issues I've had with agents is when they go down the wrong path and it snowballs from there. Suddenly they are loading more context unrelated to the tasks and getting more confused. Documenting interfaces doesn't help if the source is available to the agent.
My agentic sweet spot is human-designed interfaces. Agents cannot mess up code they don't have access to, e.g. by inadvertently changing the interface contract and the implementation.
> Agent forgets to rename an expected parameter in API spec for service 1
Document and test your interfaces/logic boundaries! I have witnessed this break many times with human teams with field renames, change in optionality, undocumented field dependencies, etc, there are challenging trade-offs with API versioning. Agents can't fix process issues.
These tools and services are already expected to do the best job for specific prompts. The work you're doing pretty much proves that they don't, while also throwing much more money at them.
How much longer are users going to have to manually manage LLM context to get the most out of these tools? Why is this still a problem ~5 years into this tech?
Applying distributed human team concepts to a porting task squeezes extra performance from LLMs much further up the diminishing returns curve. That matters because porting projects are actually well-suited for autonomous agents: existing code provides context, objective criteria catch more LLM-grade bugs than greenfield work, and established unit tests offer clear targets.
I guess what I'm trying to say is that the setup seems absurd because it is. Though it also carries real utility for this specific use case. Apply the same approach to running a startup or writing a paid service from scratch and you'd get very different results.
If you have these agents do everything at the "top level" they lose track. The moment you introduce sub-agents, you can have the top level run in a tight loop of "tell agent X to do the next task; tell agent Y to review the work; repeat" or similar (add as many agents as makes sense), and it will take a long time to fill up the context. The agents get fresh context, and you get to manage explicitly what information is allowed to flow between them. It also tends to mean it is a lot easier to introduce quality gates - eg. your testing agent and your code review agent etc. will not decide they can skip testing because they "know" they implemented things correctly, because there is no memory of that in their context.
Sometimes too much knowledge is a bad thing.
You need to have different skills at different times. This type of setup helps break those skills out.
My current workplace follows a similar workflow. We have a repository full of agent.md files for different roles and associated personas.
E.g. For project managers, you might have a feature focused one, a delivery driven one, and one that aims to minimise scope/technology creep.
same people pushing this crap
100% genuine
Where were you back then? Laughing about them instead of creating intergenerational wealth for a few bucks?
it's not creating wealth, it's scamming the gullible
criminality being lucrative is not a new phenomenon
Weird Im getting downvoted for just stating facts again
It attracts the gamers and LARPers. Unfortunately, management is on their side until they find out after four years or so that it is all a scam.
I guess "agentic swarms" are the next evolution of the meta-game, the perfect nerd-sniping strategy. Now you can spend all your time minmaxing your team, balancing strengths/weaknesses by tweaking subagents, adding more verifiers and project managers. Maybe there's some psychological draw, that people can feel like gods and have a taste of the power execs feel, even though that power is ultimately a simulacra as well.
Recently fixed a problem over a few days, and found that it was duplicated though differently enough that I asked my coworker to try fixing it with an LLM (he was the originator of the duplicated code, and I didn't want to mess up what was mostly functioning code). Using an LLM, he seemingly did in 1 hour what took me maybe a day or two of tinkering and fixing. After we hop off the call, I do a code read to make sure I understand it fully, and immediately see an issue and test it further only to find out.. it did not in fact fix it, and suffered from the same problems, but it convincingly LOOKED like it fixed it. He was ecstatic at the time-saved while presenting it, and afterwards, alone, all I could think about was how our business users were going to be really unhappy being gaslit into thinking it was fixed because literally every tester I've ever met would definitely have missed it without understanding the code.
People are overjoyed with good enough, and I'm starting to think maybe I'm the problem when it comes to progress? It just gives me Big Short vibes -- why am I drawing attention to this obvious issue in quality, I'm just the guy in the casino screaming "does no one else see the obvious problem with shipping this?" And then I start to understand, yes I am the problem: people have been selling eachother dog water product for millenia because at the end of the day, Edison is the person people remember, not the guy who came after that made it near perfect or hammered out all the issues. Good enough takes its place in history, not perfection. The trick others have found out is they just need to get to the point that they've secured the money and have time to get away before the customer realizes the world of hurt they've paid for.
Corporate has to die
One of the issues that people had which necessitated this feature is that you have a task, you tell Claude to work on it, and Claude has to keep checking back in for various (usually trivial) things. This workflow allows for more effective independent work without context management issues (if you have subagents, there is also an issue with how the progress of the task is communicated by introducing things like task board, it is possible to manage this state outside of context). The flow is quite complex and requires a lot of additional context that isn't required with chat-based flow, but is a much better way to do things.
The way to think about this pattern - one which many people began concurrently building in the past few months - is an AI which manages other AIs.
It isn't all that hard to bootstrap. It is, however, something most people don't think about and shouldn't need to have to learn how to cobble together themselves, and I'm sure there will be advantages to getting more sophisticated implementations.
It is very difficult to manage task lists in context. Have you actually tried to do this? i.e. not within a Claude Code chat instance but by one-shot prompting. It is possible that they have worked out some way to do this, but when you have tens of tasks, merge conflicts, you are running that prompt over months, etc. At best, it doesn't work. At worst, you are burning a lot of tokens for nothing.
It is hard to bootstrap because this isn't how Claude Code works. If you are just using OpenRouter, it is also not easy because, after setting up tools/rebuilding Claude Code, it is very challenging to setup an environment so the AI can work effectively, errors can be returned, questions returned, etc. Afaik, this is basically what Aider does...it is not easy, it is especially not easy in Claude Code which has a lot of binding choices from the business strategy that Anthropic picked.
You ask if I've tried to do this, and then set constraints that are completely different to what I described.
I have done what I described. Several times for different projects. I have a setup like that running right now in a different window.
> It is hard to bootstrap because this isn't how Claude Code works.
It is how Claude Code works when you give it a number of sub-agents with rules for how to manage files that effectively works like task queues, or skills/mcp servers to interact with communications tools.
> it is not easy
It is not easy to do in a generic way that works without tweaks for every project and every user. It is reasonably easy to do for specific teams where you can adjust it to the desired workflows.
No, it isn't how Claude Code works because Claude Code is designed to work with limited task queues, this is not what this feature is. Again, I would suggest you trying to actually build something like this. Why do you think Anthropic are doing this? They just don't understand anything about their product?
No, it doesn't work within that context. Again: sharing context between subagents, single instance running for months...I am not even sure why someone would think this could work. The constraints that I set are the ones that you require to build this...because I have done this. You are talking about having some CLAUDE.md files like you have invented the wheel, lol. HN is great.
The unlock here is tmux-based session management for the teammates, with two-way communication using agent inbox. It works very well.
How so? I’ve been using “claude -p” for a while now.
But even within an interactive session, an agent call out is non-interactive. It operates entirely autonomously, and then reports back the end result to the top level agent.
You can use Claude Code SDK but it requires a token from Claude Code. If you use this token anywhere else, your account gets shut down.
Claude -p still hits Claude Code with all the tools, all the Claude Code wrapping.
https://code.claude.com/docs/en/sub-agents
Are you talking about the same thing or something else like having Claude start new shell sessions?
If they were able to wrap the API directly, this is relatively easy to implement but they have to do this within Claude Code which is based on giving a prompt/hiding API access. This is obvious if you think carefully about what Claude Code is, what requests it is sending to the API, etc.
Btw, you can use the Claude Agent SDK (the renamed Claude Code SDK) with a subscription. I can tell you it works out of the box, and AFAIK it is not a ToS violation.
Subagents and the auth implementation are linked because Anthropic's initial strategy was to have a prompt-based interaction which, because of the progress in model performance, has ended up being limiting as users want to run things without prompting. This is why they developed Claude Code Web (this product is more similiar to what this feature will do than subagents, subagents are similar if you have a very shallow understanding...the purpose of this change is to abstract away human interaction, i assume that will use subagents but the context/prompt management is quite different).
Unless previously approved, we do not allow third party developers to offer Claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.
I didn't dig deeper, but I'd pick it back up for a little personal project if I could just use my current subscription. Does it just use your local CC session out of the box?The main driver for those subscriptions is that their monthly cost with Opus 3.7 and up pays itself back in couple hours of basic CC use, relative to API prices.
As someone else has mentioned, you can actually use SDK for programmatic access. But that happens within the CC wrapper so it isn't a true API experience i.e. it has CC tools.
imho the plans of Claude Code are not detailed enough to pull this off; they’re trying to do it to preserve context, but the level of detail in the plans is not nearly enough for it to be reliable.
You start to get a sense for what size plan (in kilobytes) corresponds to what level of effort. Verification adds effort, and there's a sort of ... Rocket equation? in that the more infrastructure you put in to handle like ... the logistics of the plan, the less you have for actual plan content, which puts a cap on the size of an actionable plan. If you can hit the sweet spot though... GTFO.
I also like to iterate several times in plan mode with Claude before just handing the whole plan to Codex to melt with a superlaser. Claude is a lot more ... fun/personable to work with? Codex is a force of nature.
Another important thing I will do is now that launching plans clear context, it's good to get out of planning mode early, hit an underspecified bit, go back into planning mode and say something like "As you can see the plan was underspecified, what will the next agent actually need to succeed?" and iterate that way before we actually start making moves. This is made possible by lots of explicit instructions in CLAUDE.md for Claude to tell me what it's planning/thinking before it acts. Suppressing the toolcall reflex and getting actual thought out helps so much.
You can make a template and tell Claude to make a plan that follows the template.
Not sure how well it’s working though (my agents haven’t used it yet)
I can imagine some ideas (ask it for more detail, ask it to make a smaller plan and add detail to that) but I’m curious if you have any experience improving those plans.
Effectively it tries to resolve all ambiguities by making all decisions explicit — if the source cannot be resolved to documentation or anything, it’s asked to the user.
It also tries to capture all “invisible knowledge” by documenting everything, so that all these decisions and business context are captured in the codebase again.
Which - in theory - should make long term coding using LLMs more sane.
The downside is that it takes 30min - 60min to write a plan, but it’s much less likely to make silly choices.
Oof you weren't kidding. I've got your skills running on a particularly difficult problem and it's been running for over three hours (I keep telling it to increase the number of reviews until its satisfied).
My workflow with it is usually brainstorm -> lfg (planning) -> clear context -> lfg (giving it the produced plan to work on) -> compound if it didn’t on its own.
[^1]: https://github.com/EveryInc/compound-engineering-plugin
Seems like a lot of it aligns with what I’m doing, though.
I had read about them before but for whatever reason it never clicked.
Turns out I already work like this, but I use commits as "PRs in the stack" and I constantly try to keep them up to date and ordered by rebasing, which is a pain.
Given my new insight with the way you displayed it, I had a chat with chatGPT and feel good about giving it a try:
1. 2-3 branches based on a main feature branch
2. can rebase base branch with same frequency, just don't overdo it, conflicts should be base-isolated.
3. You're doing it wrong if conflicts cascade deeply and often
4. Yes merge order matters, but tools can help and generally the isolation is the important pieceDo make sure you know how to reset the cache, in case you did a bad conflict resolution because it will keep biting you. Besides that caveat it's a must.
https://www.atlassian.com/git/tutorials/comparing-workflows/...
Stacking is meant to make development of non-trivial features more manageable and more likely to enter main safer and faster.
it's specific to each developer's workflow and wouldn't necessarily produce artifacts once merged into main (as gitflow seems to intentionally have a stance on)
I use Shelley (their web-based agent) but they have Claude Code installed too.
I can see this approach being useful once the foundation is more robust, has better common sense, knows when to push back when requirements conflict or are underspecified. But with current models I can only see this approach as exacerbating the problem; coding agents solution is almost always "more code", not less. Makes for a nice demo, but I can't imagine this would build anything that wouldn't have huge operational problems and 10x-100x more code than necessary.
For the first part of this comment, I thought "trying to reinvent Istanbul in a bash script" was meant to be a funny way to say "It was generating a lot of code" (as in generating a city's worth of code)
I also think it’s interesting to see Anthropic continue to experiment at the edge of what models are capable of, and having it in the harness will probably let them fine-tune for it. It may not work today, but it might work at the end of 2026.
Fundamentally, forking your context, or rolling back your context, or whatever else you want to do to your context also has coordination costs. The models still have to decide when to take those actions unless you are doing it manually, in which case you haven't really solved the context problems, you've just given them to the human in the loop.
I see this as being different from a single process loop that directly manages the contexts, models, system prompts, etc. I get that it's not that different; kind of like FP vs OOP you can do the same thing in either. But I think the end result is simpler if we just think about it as a single loop that manages contexts directly to complete a project, rather than building an async communication and coordination system.
Antigravity and others already ask for human feedback on their plans.
Hard to keep up with all the changes and it would be nice to see a high level view of what people are using and how that might be shifting over time.
https://lmarena.ai/leaderboard/code
Also, I'm not sure if it's exactly the case but I think you can look at throughput of the models on openrouter and get an idea of how fast/expensive they are.
> Q5. For which tasks do you use AI assistance most?
This is really tough for me. I haven't done a single one of those mostly-manually over the last month.
Also 75% darwin-arm64
His newsletter put me onto using Opus 4.5 exclusively on Dec 1, a little over a week after it was released. That's pretty good for a few minutes of reading a week.
Why would you want a list with such godawful methodology? Here's [0] what the TIOBE folks have to say about their data analysis process:
Since there are many questions about the way the TIOBE index is assembled, a special page is devoted to its definition. Basically the calculation comes down to counting hits for the search query
+"<language> programming"
The only advantage this methodology has is it's extremely cheap for the surveyor to use.[0] <https://www.tiobe.com/tiobe-index/programminglanguages_defin...>
Can you help me envision what you're saying? It's async - you will have to wait whether its good or not. And in theory the better it is the more time you'd have to comment here, right?
I don't yolo much code, but even so, there are some times where you reach a point where parallelism starts to make sense.
Once you have a stable workflow and foundation, and you front load design and planning, you might see opportunities appear. Writing tests, testing loops, simple features, documentation, etc.
I'm not in research, but I could imagine trying to solve hard problems by trying different approaches, or testing against different data, all in parallel.
They couldn't even be bothered to write the Tweet themselves...
But the other part of it is, each conversation you have, and each piece of AI output you read online, is written by LLM instance that has no memory of prior conversations, so it doesn't know that, from human perspective, it used this construct 20 times in the last hour. Human writers avoid repeating the same phrases in quick succession, even across different writings (e.g. I might not reuse some phrase in email to person A, because I just used it in email to unrelated person B, and it feels like bad style).
Perhaps that's why reading LLM output feels like reading high school essays. Those essays all look alike because they're all written independently and each is a self-contained piece where the author tries to show off their mastery of language. After reading 20 of them in a row, one too gets tired of seeing the same few constructs being used in nearly every one of them.
Some would argue there’s no point reviewing the code, just test the implementation and if it works, it works.
I still am kind of nervous doing this in critical projects.
Anyone just YOLO code for projects that’s not meant to be one time, but fully intend to have to be supported for a long time? What are learnings after 3-6 months of supporting in production?
I do use them, though, it helps me, search, understand, narrow down and ideate, it's still a better Google, and the experience is getting better every quarter, but people letting tens or hundreds of agents just rip... I can't imagine doing it.
For personal throwaway projects that you do because you want to reach the end output (as opposed to learning or caring), sure, do it, you verify it works roughly, and be done with it.
To me, someone who can code means someone who (unless they're in a detectable state of drunkenness, fatigue, illness, or distraction) will successfully complete a coding task commensurate with some level of experience or, at the very least, explain why exactly the task is proving difficult. While I've seen coding agents do things that truly amaze me, they also make mistakes that no one who "can code" ever makes. If you can't trust an LLM to complete a task anyone who can code will either complete or explain their failure, then it can't code, even if it can (in the sense of "a flipped coin can come up heads") sometimes emit impressive code.
Not quite, but in any event none of the avionics is an LLM or a program generated by one.
I also heard "I see the issue now" so many times because it missed or misunderstood something very simple.
I mean you'd think. But it depends on the motivations.
At meta, we had league tables for reviewing code. Even then people only really looked at it if a) they were a nitpicking shit b) don't like you and wanted piss on your chips c) its another team trying to fix our shit.
With the internal claude rollout and the drive to vibe code all the things, I'm not sure that situation has got any better. Fortunately its not my problem anymore
Where you have shared ownership, meaning once I approved your PR, I am just as responsible of something goes wrong as you are and I can be expected to understand it just as well as you do… your code will get reviewed.
If shipping is the number one priority of the team, and a team is really just a group of individuals working to meet their quota, and everyone wants to simply ship their stuff, managers pressure managers to constantly put pressure on the devs, you’ll get your PR rubber stamped after 20s of review. Why would I spend hours trying to understand what you did if I could work on my stuff.
And yes, these tools make this 100x worse, people don’t understand their fixes, code standards are no longer relevant, and you are expected to ship 10x faster, so it’s all just slop from here on.
People will ask LLM to review some slop made by LLM and they will be absolutely right!
There is no limit to lazyness.
My "first pass" of review is usually me reading the PR stack in graphite. I might iterate on the stack a few times with CC before publishing it for review. I have agents generate much of my code, but this workflow has allowed me to retain ownership/understanding of the systems I'm shipping.
To me it feels like building your project on sand. Not a good idea unless it's a sandcastle
Usually about 50% of my understanding of the domain comes from the process of building the code. I can see a scenario where large scale automated code works for a while but then quickly becomes unsupportable because the domain expertise isn't there to drive it. People are currently working off their pre-existing domain knowledge which is what allows them to rapidly and accurately express in a few sentences what an AI should do and then give decisive feedback to it.
The best counter argument is that AIs can explain the existing code and domain almost as well as they can code it to begin with. So there is a reasonable prospect that the whole system can sustain itself. However there is no arguing to me that isn't a huge experiment. Any company that is producing enormous amounts of code that nobody understands is well out over their skis and could easily find themselves a year or two down the track with huge issues.
That example is from a recent bug I fixed without Cursor being able to help. It wanted to create a wrapper around the pool class that would have blocked all threads until a connection was free. Bug fixed! App broken!
Whether you’re reading a book or using an app, you’re communicating with the author by way of your shared humanity in how they anticipate what you’re thinking as you explore the work. The author incorporates and plans for those predicted reactions and thoughts where it makes sense. Ultimately the author is conveying an implicit mental model to the reader.
The first problem is that many of these pathways and edge cases aren’t apparent until the actual implementation, and sometimes in the process the author realizes that the overall app would work better if it were re-specified from the start. This opportunity is lost without a hands on approach.
The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
> The second problem is that, the less human touch is there, the less consistent the mental model conveyed to the user is going to be, because a specification and collection of prompts does not constitute a mental model. This can create subconscious confusion and cognitive friction when interacting with the work.
tbf, this is a trend i see more and more across the industry; llm or not so many process get automated that teams just implement x cause pdm y said so and its because they need to meet goal z for the quarter... and everyone is on scrum autopilot they cant see the forest for the trees anymore.i feel like the massive automation afforded by these coding agents may make this worse
Results will vary depending on how automatically checkable a problem is, but I expect a lot of problems are amenable to some variation of this.
It means there is no value in producing more code. Only value in producing better, clearer, safer code that can be reasoned about by humans. Which in turn makes me very sceptical about agents other than as a useful parallelisation mechanism akin to multiple developers working on separate features. But in terms of ramping up the level of automation - it's frankly kind of boring to me because if anything it make the review part harder which actually slows us down.
Agents are stateless functions with a limited heap (context window) that degrades in quality as it fills. Once you see it that way, the whole swarm paradigm is just function scoping and memory management cosplaying as an org chart:
Agent = function
Role = scope constraints
Context window = local memory
Shared state file = global state
Orchestration = control flow
The solution isn't assigning human-like roles to stateless functions. It's shared state (a markdown file) and clear constraints.
Isn’t a “role” just a compact way to configure well-known systems of constraints by leveraging LLM training?
Is your proposal that everybody independently reinvent the constraints wheel, so to speak?
A. Using a role prompt to configure a single function's scope ("you are a code reviewer, focus on X") - totally reasonable, leverages training
B. Building an elaborate multi-agent orchestration layer with hand-offs, coordination protocols, and framework abstractions on top of that
I'm not arguing against A. I'm arguing that B often adds complexity without proportional benefit, especially as models get better at long-context reasoning.
Fairly recent research (arXiv May 2025: "Single-agent or Multi-agent Systems?" - https://arxiv.org/abs/2505.18286) found that MAS benefits over single-agent diminish as LLM capabilities improve. The constraints that motivated swarm architectures are being outpaced by model improvements. I admit the field is moving fast, but the direction of travel appears to be that the better the models get, the simpler your abstractions need to be.
So yes, use roles. But maybe don't reach for a framework to orchestrate a PM handing off to an Engineer handing off to QA when a single context with scoped instructions would do.
This thread seems surreal, I see multiple flow repositories mentioned with 10k+ stars. Comprehensive doc. genAI image as a logo.
Can anyone show me one product these things have accomplished please ?
I used some frontier LLM yesterday to see if it could finally produce simple cascading style sheet fix. After a few dozens attempts and steering, a couple of hours and half a million token wasted it couldn't. So I fixed the issue myself and I went to bed.
This person is on HN for the same reasons as I am, presumably: reading about hacker stuff. Entering prompts in black boxes and watching them work so you have more time to scratch your balls is not hacker stuff, it's the latest abomination of late stage capitalism and this forum is, sadly, falling for it.
I then went to see the latest batches. Cohorts are heavily building things that would support the fall for whatever this is. It needs supported or we won't make it.
But seems we are heading this way, from initially:
- a Senior Dev pairing with Junior Dev (2024/25)
- a tech lead/architect in charge of several Developers (2025)
- a Product Owner delegating to development teams (2026?)
---
- https://github.com/steveyegge/gastown
- https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
1. Subagents doing work have a fresh context (ie. focused and not working on the top of a larger monolithic context) 2. Subagents enjoying a more compact context leads to better reasoning, more effective problem solving, less tokens burned.
That’s a pretty basic functionality in Claude code
I wrote up a technical plan with Claude code and I was about to set it to work when I thought, hang on, this would be very easy to split into separate work, let's try this subagent thing.
So I asked Claude to split it up into non- overlapping pieces and send out as many agents as it could to work on each piece.
I expected 3 or 4. It sent out 26 subagents. Drudge work that I estimate would have optimistically taken me several months was done in about 20 minutes. Crazy.
Of course it still did take me a couple of days to go through everything and feel confident that the tests were doing their job properly. Asking Claude to review separate sections carefully helped a lot there too. I'm pretty confident that the tests I ended up with were as good as what I would have written.
This will only compound wasted time on Claude.ai, which exploits that time to train its own models.
Why time wasted? Claude’s accuracy for shell, Bash, regex, Perl, text manipulation/scripting/processing, and system-level code is effectively negligible (~5%). Such code is scarce in public repositories. For swarms or agents to function, accuracy must exceed 96%. At 5%, it is unusable.
We do also use Claude.ai and we believe it is useful, but strictly for trivial, typing-level tasks. Anything beyond that, at this current point, is a liability.
I've been using that and it's excellent
As time went on I felt like the organization was kind of an illusion. It demanded something from me and steered Claude, but ultimately Claude is doing whatever it's going to do.
I went black to just raw-dogging it with lots of use of planning mode.
GSD might be better right now, but will it continue to be better in the future, and are you willing to build your workflows around that bet?
---
function i8() {
if (Yz(process.env.CLAUDE_CODE_AGENT_SWARMS)) return !1;
return xK("tengu_brass_pebble", !1);
}
---
So, after patch
function i8(){return!0}
---
The tengu_brass_pebble flag is server-side controlled based on the particulars of your account, such as tier. If you have the right subscription, the features may already be available.
The CLAUDE_CODE_AGENT_SWARMS environment variable only works as an opt-out, not an opt-in.
The rate at which a person running these tools can review and comprehend the output properly is basically reached with just a single thread with a human in the loop.
Which implies that this is not intended to be used in a setting where people will be reading the code.
Does that... Actually work for anyone? My experience so far with AI tools would have me believe that it's a terrible idea.
I have also started it in writing tests.
I will write the first test the “good path” it can copy this and tweak the inputs to trigger all the branches far faster than I can.
Executives/product managers/sales often only really care about getting the product working well enough to sell it.
I don't mean this in a disparaging way. But we're at a car-meets-horse-and-buggy moment and it's happening really quickly. We all need to at least try driving a car and maybe park the horse in the stable for a few hours.
Attitudes like that, where you believe that the richeous AI pushers will be saved from the coming rapture meanwhile everyone else will be out on the streets, really make people hate the AI crowd
You have to realize this is targeting manager and team lead types who already mostly ignore the details and quality frankly. "Just get it done" basically.
That's fine for some companies looking for market fit or whatever - and a disaster for some other companies now or in future, just like outsourcing and subcontracting can be.
My personal take is: speed of development usually doesn't make that big a difference for real companies. Hurry up and wait, etc.
That's what you're missing -- the key point is, you don't review and comprehend the output! Instead, you run the program and then issue prompts like this (example from simonw): "fix in and get it to compile" [0]. And I'm not ragging on this at all, this is the future of software development.
[0] https://gisthost.github.io/?9696da6882cb6596be6a9d5196e8a7a5...
I feel like software engineers are taking a lot of license with the idea that if something bad happens, they will just be able to say "oh the AI did it" and no personal responsibility or liability will attribute. But if they personally looked at the code and their name is underneath it signing off the merge request acknowledging responsibility for it - we have a very different dynamic.
Just like artists have to re-conceptualise the value of what they do around the creative part of the process, software engineers have to rethink what their value proposition is. And I'm seeing a large part of it is, you are going to take responsibility for the AI output. It won't surprise me if after the first few disasters happen, we see liability legislation that mandates human responsibility for AI errors. At that point I feel many of the people all in on agent driven workflows that are explicitly designed to minimise human oversight are going to find themselves with a big problem.
My personal approach is I'm building up a tool set that maximises productivity while ensuring human oversight. Not just that it occurs and is easy to do, but that documentation of it is recorded (inherently, in git).
It will be interesting to see how this all evolves.
I do a fair amount of agentic coding, but always periodically review the code even if it's just through the internal diff tool in my IDE.
Approximately 4 months ago Sonnet 4.5 wrote this buried deep in the code while setting up a state machine for a 2d sprite in a relatively simple game:
// Pick exit direction (prefer current direction)
const exitLeft = this.data.direction === Direction.LEFT || Math.random() < 0.5;
I might never have even noticed the logical error but for Claude Code attaching the above misleading comment. 99.99% of true "vibe coders" would NEVER have caught this.I believe bees call it "bzz bzzt *clockwise dance* *wiggle*"
https://github.com/docker-archive/classicswarm/releases/tag/...
https://github.com/openai/swarm/commit/e5eabc6f0bdc5193d8342...
Looks like claude calls it just "teams" under the covers
Even 90 word tweets are now too long for these people to write without using AI, apparently.
“FTSChunkManager agent is still running but making good progress, let’s wait a bit more for it to complete” (it’s implementing hybrid search) plus a bunch of stack traces and json output.
My implementation was slightly different as there is no shared state between tasks, and I don't run them concurrently/coordinate. Will be interesting to see if this latter part does work because I tried similar patterns and it didn't work. Main issue, as with human devs, was structuring work.
With Swarm mode, it seems there's a new option for an entire team of agents to be working in the wrong direction before they check back in to let you know how many credits they've burned by misinterpreting what you wanted.
Mine also rotate between Claude or Z.ai accounts as they ran out of credits
Honestly if people in AI coding write less hype-driven content and just write what they mean I would really appreciate it.
Way too much code for such a small patch
Incredible.
In his second post he included a link to GitHub: https://github.com/mikekelly/claude-sneakpeek
Manager (Claude Opus 4.5): Global event loop that wakes up specific agents based on folder (Kanban) state.
Product Owner (Claude Opus 4.5): Strategy. Cuts scope creep
Scrum Master (Opus 4.5): Prioritizes backlog and assigns tickets to technical agents.
Architect (Sonnet 4.5): Design only. Writes specs/interfaces, never implementation.
Archaeologist (Grok-Free): Lazy-loaded. Only reads legacy Java decompilation when Architect hits a doc gap.
CAB (Opus 4.5): The Bouncer. Rejects features at Design phase (Gate 1) and Code phase (Gate 2).
Dev Pair (Sonnet 4.5 + Haiku 4.5): AD-TDD loop. Junior (Haiku) writes failing NUnit tests; Senior (Sonnet) fixes them.
Librarian (Gemini 2.5): Maintains "As-Built" docs and triggers sprint retrospectives.
You might ask yourself the question “isn’t this extremely unnecessary?” and the answer is most likely _yes_. But I never had this much fun watching AI agents at work (especially when CAB rejects implementations). This was an early version of the process that the AI agents are following (I didn’t update it since it was only for me anyway): https://imgur.com/a/rdEBU5I