$500 GPU outperforms Claude Sonnet on coding benchmarks

https://github.com/itigges22/ATLAS

Comments

bloppeMar 27, 2026, 7:13 AM
Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.
bartreadMar 27, 2026, 11:00 AM
I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

CraigJPerryMar 27, 2026, 11:14 AM
Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).

You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.

bartreadMar 27, 2026, 3:49 PM
Yeah, I did consider moving records to shadow tables, but - because of the nature of our data - it requires moving a lot of child records as well, so it's quite a lot of additional churn in WAL, and the same for restore. And this approach has its own challenges with referential integrity.

More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.

Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.

The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.

As I say, not to diss other approaches: in a different situation I might have chosen one of them.

My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.

For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.

andyferrisMar 27, 2026, 11:36 AM
I move the record to another _index_, generally.

It depends whether you reliably control all the DB client code, of course.

dakolliMar 27, 2026, 3:39 PM
must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.
bartreadMar 27, 2026, 3:51 PM
This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s
sigmoid10Mar 27, 2026, 8:47 AM
Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
jakozaurMar 27, 2026, 11:16 AM
Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

slashdevMar 27, 2026, 1:12 PM
Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

imiricMar 27, 2026, 1:15 PM
[flagged]
serfMar 27, 2026, 2:24 PM
I have never read a snide comment on this site that i've been more repulsed by.

I think because it's so specifically sharpened to stab at the software developer, my compatriot, one of the foremost primary populations here, rather than just an overall shitty human insult -- and timed to do so when the person opens up in an honest dialogue about what they're doing.

But good news: every large software house i've talked to in the past two years is touching AI. As tragic as that is for a multitude of good reasons surrounding the workforce/copyright/ip/human-laziness/loss-of-skill/etc, that means imric is going to be outside of software , by their own rules, in totality in just a few short years!

Happy days!

imiricMar 27, 2026, 5:19 PM
[flagged]
slashdevMar 27, 2026, 5:58 PM
You only hurt yourself with that attitude. AI might take your job.
imiricMar 27, 2026, 6:08 PM
> You only hurt yourself with that attitude.

Funny, others seem more hurt by it.

> AI might take your job.

I'm not the one "grieving the loss of his career". :)

slashdevMar 27, 2026, 1:42 PM
We have the quietest on-call rotation of any company I've ever worked at.

We have a high standard for code review, static verification, and tests.

The fact that the code isn't hand-rolled artisanal code, and is generated by AI now, has so far turned out to have no impact on product quality or bugs reported.

imiricMar 27, 2026, 5:22 PM
Ah, that's great, sounds like the ideal working environment.

So, which company is it again?

dlahodaMar 27, 2026, 2:12 PM
What are company or tools you are working?
aditmagMar 27, 2026, 1:25 PM
Tbf, as long as you really know what you're doing and have the sense to avoid falling into a spaghetti code trap, generating bigger chunks of code absolutely works and should be done. The pitfall happens when

(a) the dev has no idea what the agent is doing (b) the dev gives overtly-broad instructions.

If you give it specific enough tasks (not to the point where it's writing singular functions) but a general class description, you're on a good track.

yohannparisMar 27, 2026, 1:27 PM
Why? Because writing code is the only measure of quality when producing tools? What about Unit and Integration Tests, UX research, and Performance tests.
adrian_bMar 27, 2026, 2:15 PM
I agree that for many applications the code written by an LLM can be good enough, as proven by the many commercial applications that contain even worse code.

However, anyone who uses an LLM must remain aware of the limitations of this method.

There are many features of a program that cannot be tested exhaustively and which must be guaranteed by its design. When you do not understand very well the structure of a program it may be difficult to decide what must be tested.

With performance, the confidence in what an LLM produces is even lower, because it is unlikely to know if you have really reached a performance limited by hardware. Obtaining a performance better than a previously existing program does not prove anything, because most existing programs are likely to have a performance much lower than possible.

In many cases you just want a performance good enough, not the best attainable, so you can be content with your LLM-generated program. But you must not fool yourself by believing that this is really the best that can be done.

BombthecatMar 27, 2026, 9:43 AM
Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewaMar 27, 2026, 1:07 PM
Create it!
d0963319287Mar 27, 2026, 2:31 PM
[flagged]
mmaunderMar 27, 2026, 1:33 AM
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
vidarhMar 27, 2026, 11:42 AM
I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.

It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.

I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.

If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.

thefourthchimeMar 27, 2026, 3:23 AM
I won’t use anything less than the SOTA. It tried using Opus 4.6 medium and immediately regretted it. High messes up enough.
overfeedMar 27, 2026, 6:03 AM
What were you using 6 months ago?
withinboredomMar 27, 2026, 6:23 AM
Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.
hhhMar 27, 2026, 8:44 AM
The models don’t change.
tornikeoMar 27, 2026, 8:50 AM
On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.
armchairhackerMar 27, 2026, 10:17 AM
And there’s an incentive to publish evidence of this to discourage it, do you have any?
TeMPOraLMar 27, 2026, 10:54 AM
Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
natebcMar 27, 2026, 11:12 AM
There really always is a man behind the curtain eh?
coldteaMar 27, 2026, 12:12 PM
TeMPOraLMar 27, 2026, 11:35 AM
It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).

ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.

woadwarrior01Mar 27, 2026, 11:23 AM
There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.

[1]: https://marginlab.ai/trackers/claude-code/

nlMar 27, 2026, 11:31 AM
So - as the charts say - no statistical difference?

Isn't this link am argument against the point you are making?

withinboredomMar 27, 2026, 12:47 PM
The chart doesn't cover the 4.6 release which was in the end of December/early January time frame. So, it's hard to tell from existing data.
coldteaMar 27, 2026, 12:11 PM
Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...
seunosewaMar 27, 2026, 5:53 PM
Or just change the reasoning levels.
scrollopMar 27, 2026, 6:13 PM
esskayMar 27, 2026, 9:22 AM
Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.
yorwbaMar 27, 2026, 11:40 AM
Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)

nextaccounticMar 27, 2026, 12:32 PM
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark
yorwbaMar 27, 2026, 1:03 PM
Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.
ferMar 27, 2026, 9:39 AM
They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".
stavrosMar 27, 2026, 10:12 AM
Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".
arcanemachinerMar 27, 2026, 10:48 AM
I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.
coldteaMar 27, 2026, 12:09 PM
Only nominally...
pixel_poppingMar 27, 2026, 9:33 AM
Oh yes, they do.
girvoMar 27, 2026, 9:57 AM
I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.
coldteaMar 27, 2026, 12:19 PM
No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.
rf15Mar 27, 2026, 6:23 AM
You cannot afford the SOTA.
weird-eye-issueMar 27, 2026, 6:33 AM
Why is that? The $200 per month subscription comes with a ton of usage.

Opus 4.6 is available on the $20 plan too

EpholysMar 27, 2026, 9:28 AM
> The $200 per month subscription comes with a ton of usage.

$200 dollars + VAT is half of my rent.

I know HN is not a good place to rant on this subject, but I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech. Or just prices in general.

I remember someone who said a few years ago (I'm paraphrasing): "You could just use one of the empty room in your house!". It was so outlandish I believed it was a joke at first.

EDIT: "not", minor grammar

mememememememoMar 27, 2026, 11:38 AM
Thanks for the alternative perspective.

I think I am in the middle. I can afford $200/m but it'd be a brainer. And I don't pay that as I barely use home AI enough to warrant it.

I am also amazed at the richer end of HN but now I realize I am priviledged. Earned it? Like fuck I did. Lucky to be born a geek in late 20c. I'd be useless as a middle ages guy.

throwaway173738Mar 27, 2026, 12:38 PM
If I found myself in the middle ages I’d just become a blacksmith or a miller.
FilligreeMar 27, 2026, 3:10 PM
Do you have the genetics for that? It takes a lot of raw strength, and not that much intelligence.
layer8Mar 27, 2026, 11:25 AM
The other part of the bubble is assuming working in projects that allow disclosing any code or project details to a generic third party with that kind of power asymmetry.
1123581321Mar 27, 2026, 4:19 PM
It’s a good reminder. Claude Max costs about as much as the global poverty line ($3/day.) I think it’s okay to invest in it, but we should try to make sure it’s worthwhile, and also invest in charity.
BombthecatMar 27, 2026, 9:45 AM
That's why ai is for the "rich". Poor people or later on middle class will be left behind....
TeMPOraLMar 27, 2026, 10:59 AM
Nah, that's why you cannot not afford the subscriptions these days. Whatever your needs, ever since Claude Code became a thing, subscription costs come out massively cheaper than pay-as-you-go per-token API pricing. Also SOTA models are so much better than anything else, that using older or open models will just cost you more in tokens/electricity than going for SOTA subscription.

Subscriptions are definitely middle-class targeted. $20/month is not much for the value provided, at least not in the western world.

But if by "rich" you just mean "westerners", then in this sense, the same is and has always been true for computing in general.

srousseyMar 27, 2026, 4:28 PM
The subscriptions are purposely sold for less than cost. The subsidy will end some day.
mememememememoMar 27, 2026, 11:43 AM
Not sure. AI is sort of car ownership price. I think while that ain't poor, that is middle class.

So like if you want to start a business of any sort the AI sub is still peanuts.

AI is a car, or a dog, or a mild social life, or a utility bill level of cost. And thats for the level needed for a sane typical developer. (AI maximalists need 250k/y, let them slop it out)

It is not a Cessna, an infinity pool or a 1 month vacation.

dahartMar 27, 2026, 2:43 PM
$200/mo is a lot, sure, but the shocking part of that comparison is your rent. I didn’t know $400/mo apartments still existed. For most people in the US and EU, $200 would be closer to 15%-20% of rent I think? My cell phone bill for my family is almost $200/mo.

Last year, at first, $200 seemed crazy. Now that I’m getting addicted to coding agents, not so much. Some companies are paying API rates for AI for employees, and it’s a lot more than $200/mo. It seems like funny money, and I’m not sure it’ll last.

AerroonMar 27, 2026, 5:10 PM
It is my belief that rent price scales with the leftover income people have after they've paid for other necessities. Ie if you're from a poorer country/area then things like milk and gasoline will cost a similar amount (maybe 2x difference), but rent will cost a lot less. As people in a country get richer they start paying a larger and larger share of their income as rent of various forms.

Even the US has places with cheap rent/housing. The downside is that there's no (well-paying) work nearby.

dahartMar 27, 2026, 6:29 PM
It’s true that average rent prices are regional and poorer areas have lower rents, but that doesn’t tend to make much difference in urban areas and large cities where the majority of people live now. Why do you feel that rent scales with disposable income? Economists generally say the opposite based on housing being a core necesessity; that people pay rent in proportion to their income, and only what’s left over the the disposable amount. That’s why we have the 30% rule, for example.

You’re technically correct, btw, rental housing is a market and is subject to market forces, meaning what people are willing to pay. I’m just not so sure about framing rent as being lower priority than other necessities. And rent prices have been increasing faster than other necessities, and faster than income, so that might be a confounding factor in your argument.

Still, my initial reaction above is due to the fact that in the US and in Europe in most large cities, the average rent is north of $1000/mo.

coldteaMar 27, 2026, 12:23 PM
In the US/Western Europe? Because for devs especially in the former, $200 is pocket change, especially for a core productivity tool. And the rent would be in the $1200 to $3000 easily. Same for houses. Maybe not in NY or SF, but in most of the US there's no shortage of house spaces and redundant rooms.
EpholysMar 27, 2026, 12:47 PM
I've seen those comments about $200/month and empty rooms here, so I suppose they mainly come from the US, yes.

So yes, you describe a situation that I feel like a lot of people here don't understand is not the norm.

I compared the subscription with my rent precisely because it's easier to compare: with your numbers it would be like paying from $600 up to $1500 / month. Pretty hard to justify.

lostmsuMar 27, 2026, 1:32 PM
> Because for devs especially

Are you not a dev? If not, what would you use a coding tool for? They still require handholding for anything largeish. Still much cheaper than outsource.

weird-eye-issueMar 27, 2026, 10:21 AM
You think I don't understand that? I'm friends with people who make little more than that amount per month.

But it's not all that relevant to this conversation. It's not like this is the first time economic inequality is a thing.

It's just as relevant to me factoring in your salary the next time I go buy a car.

EpholysMar 27, 2026, 10:49 AM
First, I've assumed you were in the bubble I described, but that's not the case, so sorry bout that.

Also, I think it's relevant to the conversation.

You replied to someone who said that "you" (undirected pronoun I suppose) can't afford the SOTA that the $200/month Anthropic subscription comes with a ton of usage. So I interpreted it as a general statement. It wasn't what you meant?

I'm a bit lost about who you're talking to/about in your first comment: the person you respond to, a general statement for everyone reading, or yourself?

weird-eye-issueMar 27, 2026, 11:27 AM
I assume when somebody says you and is not talking about anyone in particular they mean that it's infeasible for virtually everybody which is certainly not the case. Also you conveniently disregarded the fact that is available on the $20 per month plan.
EpholysMar 27, 2026, 12:40 PM
Okay, I understand better. I interpreted your answer as "well, it's $200, everybody can afford it". Clearly a misunderstanding.

Going back to the $20 plan, yes, I agree it's much more accessible.

I didn't talk about it because I've seen a lot of comments here, on blogs, on social media about how a $200 subscription for Claude is a no brainer. And it got on my nerves, so I wanted to tell how much money it can be. To you (and it was misguided reading your answers), and to concerned HN commenters in general.

edgyquantMar 27, 2026, 10:15 AM
For me I pass the token costs off to my clients. Not everyone is a hobbyist burning their own cash on personal projects
hyperbovineMar 27, 2026, 10:18 AM
Work pays.
EpholysMar 27, 2026, 10:39 AM
I'm not sure I've correctly understood what you're implying.

If it's that I'm not working, well, I'm employed.

It it's that I'm not working enough to not have this money... Well, we still go back to the bubble. Not everywhere in the world you can easily find a job that pays you enough, even if you accept to work more. And the employer will not accept to give developers a $200/month subscription, even less for personal use.

If it's that I'm not working enough and I should go freelancing to work as much as I want and get rich (I'm extrapolating). Well, you're right, I could do that. But (at least at first), I would work a lot more for much less money. And even if I become a recognized freelancer, it doesn't change the fact that I'll earn less money compared to the baseline of SF, or even the USA in the tech sector in general. So, bubble again. I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.

Also, but that's a "me case" compared to my previous points, health issues can greatly affect how much work you can do.

rob0Mar 27, 2026, 11:51 AM
> I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.

Do you have any evidence of that? I think the OPs are assuming this as a premise so their logic is probably valid but may not be sound logic for you.

EpholysMar 27, 2026, 1:00 PM
I don't have any hard evidence, no.

Instinctively, if we suppose all the newbies freelancers without any reputation start with the same lowest rate possible to be competitive, passing additional cost to my client will mechanically increase my rate. Putting me in disadvantage about getting any work. And with the difference of monetary value for the same price of tokens, the rate delta is higher.

It's a simplified model of the world, but it feels like simple economic rules.

I assume the comment I'm referring to was written by someone who is already established and for Wich the token cost passing is lower relatively to my environment.

nextaccounticMar 27, 2026, 12:25 PM
I guess what was meant is that those tools are generally bought by the employer
walletdrainerMar 27, 2026, 9:51 AM
>I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech

Sorry, no. You live in the bubble, the people you think are living in a bubble are actually doing the very opposite and taking advantage of the lack of bubbles in our globally connected world.

Today, basically anyone can sell any bullshit to billions of people around the world. We’ve never lived in less of a bubble.

stavrosMar 27, 2026, 10:11 AM
I guess all those people who live in not-SF just can't be bothered to succeed!
TeMPOraLMar 27, 2026, 11:00 AM
$20/month is not above middle class in most of the world.

$200/month is, but you don't need that for anything except beyond-casual use of coding agents.

weird-eye-issueMar 27, 2026, 10:35 AM
To be fair if you think only people in SF can afford that you do kind of live in a bubble.
stavrosMar 27, 2026, 10:40 AM
Nobody in this thread claimed that.
weird-eye-issueMar 27, 2026, 10:48 AM
The person you were replying to was not talking about SF but you specifically called out SF so you were implying that
stavrosMar 27, 2026, 10:51 AM
The thread started with "$200 is a lot for most of the world", the person I was replying to said "no it's not, now anyone can sell to billions of people", and I said "company success being concentrated in SF shows that that's not true".

I didn't say "only SF can afford $200/mo".

weird-eye-issueMar 27, 2026, 11:03 AM
"I guess all those people who live in not-SF just can't be bothered to succeed!"
stavrosMar 27, 2026, 11:05 AM
I explained it in my previous comment, I'm not going to explain it more than that.
weird-eye-issueMar 27, 2026, 11:27 AM
Again, if you think that only successful companies are in SF you live in a bubble.
andoandoMar 27, 2026, 3:15 PM
I dunno how you guys even go throuh the $200 subscription. I use it every day for work and side projects doing tasks in parallel and Im no where newr the limit on $100.
m4rtinkMar 27, 2026, 9:54 AM
A subscription for coding - no thanks.
weird-eye-issueMar 27, 2026, 12:55 PM
If you think it's only for coding you don't have much of an imagination :)
RetroSteve0Mar 27, 2026, 2:12 PM
These are the types of individuals that become so left in the dust that they don't realize what's going on anymore, and it's obvious this person is already there. Claude hasn't been a "subscription for coding" product for quite some time now. That's how it started out and while that's certainly what Claude is known for, Anthropic has been pushing for Claude to also be a general productivity tool -- Claude Code, then Claude Desktop, Claude Work, and now Claude Desktop has Chat, Work, and Code essentially built into a single desktop app that just works wonders for those who are looking for a general productivity tool.

I'd not use it over pure Claude Code because I am at heart a coder and I want the raw terminal experience and there's some features missing from the "Code" tab in Claude Desktop, but just saying "a subscription to code", just goes to show how out of touch that person already is, and that's what resistance does to you when you try to resist making use of any kind of modern tooling or technology.

aleph_minus_oneMar 27, 2026, 9:12 AM
> The $200 per month subscription comes with a ton of usage.

200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.

maleldilMar 27, 2026, 11:12 AM
The $100 already gives plenty of usage and is more than worth it, and I'm definitely not an affluent SV developer. I've only ever hit the 5h limit once in the last month, although I rarely run more than 3 agents at once, and I don't use ridiculously expensive tools like Gas Town.
weird-eye-issueMar 27, 2026, 9:17 AM
"Opus 4.6 is available on the $20 plan too"
revolvingthrowMar 27, 2026, 9:28 AM
Anthropic’s $20 plan gives you such a pittance of tokens that it’s borderline unusable for anything more than a few scripts or a toy app. If $20 is all you have you’d do _much_ better going with chatgpt
maleldilMar 27, 2026, 11:13 AM
The Codex plan for the $20 ChatGPT plan goes much further than Claude's $20 plan, but it's still not enough if you plan to work full-time with it.
WeryjMar 27, 2026, 10:35 AM
My usage is in the $60 tier, but that doesn't exist so I have to cough up $100. And then get all shaky if I don't use up my weekly quota.
weird-eye-issueMar 27, 2026, 10:36 AM
Do you mostly just hit the session limits? If so I know it's not ideal but you could wait an hour or two for that to reset. Not sure if that would work for you but just a suggestion
WeryjMar 27, 2026, 1:51 PM
I get to 80% when on a single session and cap out a hour off the rest if I’m working on two.

But I like to have that forced hour to stop, it’s moment to take a breath.

It depends on the kind of work though, some things are more token intensive.

weird-eye-issueMar 27, 2026, 10:23 AM
That's simply not true at all.
cpursleyMar 27, 2026, 9:51 AM
Are you kidding me? Even developer salaries in the Philippines can afford that or at least the plan below it. If I used the Anthropic API, my monthly spend would be $4k a month. The Claude Max plan is the best bargain around.
LoganDarkMar 27, 2026, 9:47 AM
> 200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.

Not true, I live in USA PNW and my last remote job paid $12k/mo. I have been jobless for over a month now (currently waiting for the next HN "who wants to be hired"), but I still have enough savings to easily afford to continue that plan for a while.

I don't think it really has to do with affluence but more the job market and economy you're in. Countries with lower salaries or higher costs of living will have less buying power.

komali2Mar 27, 2026, 7:05 AM
I'm starting to think in these conversations we're all often talking about two different things. You're talking about running an LLM service through its provided tooling (codex, Claude, cursor), others seem to be talking token costs because they're integrating LLMs into software or are using harness systems like opencode, pi, or openclaw and balancing tasks across models.
weird-eye-issueMar 27, 2026, 7:21 AM
Fair enough, I read it quickly and assumed the person they replied to was talking about Claude Code

But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.

Also you can run OpenClaw with your CC subscription. It's what I do.

BoorishBearsMar 27, 2026, 8:29 AM
I wrap Opus 4.5 in a consumer product with 0 economic utility and people pay for it, I'm sure plenty of end users are willing to pay for it in their software.

Edit: I'm not using the term of art, I mean it literally cannot make them money.

eruMar 27, 2026, 8:32 AM
> [...] in a consumer product with 0 economic utility and people pay for it, [...]

Sorry, how do these two things go together?

If people pay for it, it has economic utility, doesn't it? I mean, people pay to watch movies or play video games, too.

XCSmeMar 27, 2026, 1:38 AM
Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

rmi_Mar 27, 2026, 7:54 AM
Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSmeMar 27, 2026, 8:28 AM
The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.
BoorishBearsMar 27, 2026, 8:31 AM
> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

-

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

XCSmeMar 27, 2026, 8:34 AM
Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.

usagisushiMar 27, 2026, 4:37 AM
Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

wizeeMar 27, 2026, 3:27 AM
It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.
vidarhMar 27, 2026, 4:57 PM
While I like these models, if you're getting similar results to SOTA models from 6 months ago, I have to question how far you pushed those models 6 months ago. It is really easy to find scenarios were these models really underperform. They take far more advanced harnesses to perform reasonably (and hence the linked project). It's possible to get good results out of them, but it takes a lot of extra work.

I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.

XCSmeMar 27, 2026, 4:01 AM
I used qwen 3.5 plus in production, it was really good at instruction following and tool calling.
redohMar 27, 2026, 12:21 PM
we used Kimi 2.5, its really good
raincoleMar 27, 2026, 9:51 AM
I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.
scotty79Mar 27, 2026, 10:01 AM
GLM 5 here is significantly better than GPT-5.4
anonyggsMar 27, 2026, 10:38 AM
[dead]
comboyMar 27, 2026, 8:23 AM
Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.
XCSmeMar 27, 2026, 8:27 AM
Oh, I didn't think about this, that's a good idea. I also feel generally model performance changes over time (usually it gets worse).

The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.

comboyMar 27, 2026, 9:54 AM
Yeah, good tests are associated with cost. I'd like to see benchmarks on big messy codebases and how models perform on a clearly defined task that's easy to verify.

I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.

miroljubMar 27, 2026, 8:54 AM
> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

tim-projectsMar 27, 2026, 9:15 AM
I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.

If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

bethekindMar 27, 2026, 2:29 PM
I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt

mongrelionMar 27, 2026, 10:26 AM
What is this 10€ per month subscription that you are talking about?
hariasMar 27, 2026, 10:34 AM
throwa356262Mar 27, 2026, 2:06 PM
How is the speed and stability?

These small Chinese companies dont always have access to serious hardware.

moffkalastMar 27, 2026, 8:36 AM
Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

smokelMar 27, 2026, 8:40 AM
And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.
moffkalastMar 27, 2026, 8:43 AM
No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

m00xMar 27, 2026, 6:23 AM
Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

LeynosMar 27, 2026, 7:51 AM
Kimi is surprisingly good at Rust.
dvtMar 27, 2026, 6:30 AM
> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

stuaxoMar 27, 2026, 9:36 AM
10x more code output is 10x more review.

We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

victorbjorklundMar 27, 2026, 8:36 AM
yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex
mkw2000Mar 27, 2026, 5:23 AM
i find kimi to be very very good, minimax not so much
paulddraperMar 27, 2026, 5:37 AM
Agreed.

They are equivalent of frontier models 8+ months ago.

AbanoubRodolfMar 27, 2026, 3:55 AM
[dead]
selcukaMar 27, 2026, 1:04 AM
It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

strangescriptMar 27, 2026, 1:47 PM
I will "suffer" through .004 of electricity if I can run it on my own computer
sourcecodeplzMar 27, 2026, 4:04 AM
I've tested many open models, Deepseek 3.2 is the only SOTA similar.
no_shadowban_3Mar 27, 2026, 11:41 AM
[dead]
alifeinbinaryMar 27, 2026, 4:01 PM
All those parameters and it still won't answer questions about Tianenman Square in 1989... :(
viktorcodeMar 27, 2026, 4:34 PM
It will. The web chat has censorship features, but the model you can download doesn't.
yogthosMar 27, 2026, 2:01 AM
You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.
hu3Mar 27, 2026, 3:15 AM
Indeed but:

1) That is relatively very slow.

2) Can also be done, simpler even, with SoTA models over API.

yogthosMar 27, 2026, 3:35 AM
Right, this works with any models. To me, the most interesting part is that you can use a smaller model that you could run locally to get results comparable to SoTA models. Ultimately, I'd far prefer running local, even if slower, for the simple reason of having sovereignty over my data.

Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.

If locally running models can get to the point where they can be used as a daily driver, that solves the problem.

eruMar 27, 2026, 8:34 AM
Why do you need a small model to pick promising candidates? Why not a bigger one?

(And ideally you'd probably test first, or at least try to feed compiler errors back etc?)

Overall, I mostly agree.

yogthosMar 27, 2026, 1:20 PM
mostly an issue of speed and resource usage, if the model is too big then simply running the tests will be cheaper
mikestorrentMar 27, 2026, 1:15 AM
> cheaper than the cost of local electricity only.

Can you explain what that means?

simonwMar 27, 2026, 1:30 AM
I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

BoredomIsFunMar 27, 2026, 9:34 AM
> Local model enthusiasts often assume that running locally is more energy efficient than running in a data center,

It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.

> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.

jacquesmMar 27, 2026, 3:50 AM
Some of those local model enthusiasts can actually afford solar panels.
jLaForestMar 27, 2026, 4:18 AM
You are still incurring a cost if you use the electricity instead of selling it back to the grid
KodiackMar 27, 2026, 4:26 AM
The extent of that heavily depends on where you are. Where I live in NZ, the grid export rates are very low while the import rates are very high.

Our peak import rate is 3x higher than our solar export rate. In other words, we’d need to sell 3 kWh hours of energy to offset the cost of using 1 kWh at peak.

We’re currently in the process of accepting a quote for home batteries. The rates here highly incentivise maximising self-use.

jacquesmMar 27, 2026, 12:43 PM
Selling it back to the grid is something that is still possible but much, much less of a financially sound proposition than it was a few years ago because of regulatory capture by the utilities. In some places it is so bad that you get penalized for excess power. Local consumption is the fastest way to capitalize on this, more so if you can make money with that excess power.
dmichulkeMar 27, 2026, 4:59 AM
Luxembourg: Purchase price = 2 x sales price, mostly due to grid costs.

And this is with no income tax or VAT on sold electricity.

croesMar 27, 2026, 5:28 AM
Local enthusiasts don’t have to fear account banning.
pbhjpbhjMar 27, 2026, 1:58 PM
Is it economies of scale, or is it unpaid externalities?
littlestymaarMar 27, 2026, 3:15 AM
I guess it mostly comes from using the model with batch-size = 1 locally, vs high batch size in a DC, since GPU consumption don't grow that much with batch size.

Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.

eruMar 27, 2026, 8:35 AM
Well, different parts of the world also have different electricity prices.
littlestymaarMar 27, 2026, 1:15 PM
Usually not multiple orders of magnitude difference though.
atoavMar 27, 2026, 2:18 AM
It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.
AuthAuthMar 27, 2026, 3:53 AM
cheap electric due to their massive push on non renewables. There has been no change in the price of electricity during the renewable shift.
jojobasMar 27, 2026, 1:32 AM
China has cheap electricity.
ericdMar 27, 2026, 1:40 AM
Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.
DeathArrowMar 27, 2026, 8:42 AM
Yes, but the hardware they use for inference like Huawei Ascend 910C is less efficient than Nvidia H100 used in US due to the difference in the process node.
DanielHallMar 27, 2026, 10:58 AM
These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.
memothonMar 26, 2026, 8:58 PM
I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

kimixaMar 27, 2026, 3:23 AM
I find it's often very language and sector dependent. I still see a massive difference in systems programming (normally c++ and rust) between any open model I've tried and something like sonnet 4.5 (not really tried 4.6). And honestly, even the big models (like Opus 4.6) struggle in many cases.

Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.

My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".

Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...

vidarhMar 27, 2026, 5:13 PM
> Perhaps these things aren't well represented in the training data for these open models

I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).

(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)

yogthosMar 26, 2026, 10:46 PM
You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.

But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.

ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.

These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.

So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.

zar1048576Mar 26, 2026, 11:31 PM
Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.
naaskingMar 27, 2026, 1:10 PM
> it's not clear to me based on the description how this could all be done efficiently.

Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.

yogthosMar 26, 2026, 11:41 PM
Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.

Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.

xyzzy123Mar 27, 2026, 1:30 AM
I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".

So it seems like it's a difficulty classifier for task descriptions written in English.

This is then used to score embeddings of Python code, which is a completely different distribution.

Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.

But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.

naaskingMar 27, 2026, 12:48 PM
> But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.

It does because hallucinations and low confidence share characteristics in the embedding vector which the small neural learns to recognize. And the fact that it continuously learns based on the feedback loop is pretty slick.

yogthosMar 27, 2026, 1:58 AM
I think the goal is to have a light heuristic that helps find plausibly useful solutions. They're still going to go through a testing phase as a next step, so this is just a very simple filter to decide what's even worth testing.
imtringuedMar 27, 2026, 11:59 AM
I tried to read the project documentation, but I got overwhelmed by the aimless AI generated documentation that has a nebulous goal of documenting absolutely everything, but never explaining anything.

If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."

This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.

Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"

That's a very long sequence of unexplained buzzwords that could mean absolutely anything.

MattRixMar 27, 2026, 1:05 PM
I think this is because when you shrink it down, the model ends up space constrained and each “neuron” ends up having to do multiple duties. It can stil be tuned to perform well at specific tasks, but no longer generalizes as well. It’s somewhat unintuitive but models that are larger are often simpler than smaller ones for this same reason.
tgibaMar 27, 2026, 8:16 AM
Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.
electroglyphMar 27, 2026, 5:18 AM
what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism
alkonautMar 27, 2026, 2:14 PM
Great, it became a $1000 gpu while you were reading that.
b3ingMar 27, 2026, 4:19 AM
Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet
Tuna-FishMar 27, 2026, 6:14 PM
Centralized inference is more economically efficient⁰, and should be cheaper for most users once competition squeezes the air out of token prices. It remains very valid for anyone who wants to maintain their privacy, ofc.

0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.

jillesvangurpMar 27, 2026, 8:04 AM
Not necessarily kill; but it will slowly push them off the critical path. Local agents can delegate to remote sub agents as needed but should default to local processing for low cost and latency reasons.

I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.

throwaway85825Mar 27, 2026, 4:30 AM
Financial gravity will kill them when returns don't match stratospheric expectations.
bluefirebrandMar 27, 2026, 5:21 AM
I hope so too, but I think it's wishful thinking. Be prepared for the mother of all financial bailouts from the world governments to make sure that doesn't happen
hollerithMar 27, 2026, 5:23 AM
I can understand why banks got bailed out by the US gov in 2008, but why would a government feel the need to bail out AI labs?

I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.

graemepMar 27, 2026, 9:05 AM
Can you understand why banks were bailed out to the extent of protecting shareholders?

In the UK the first bank to go, Northern Rock, was simply taken over by the government. The shareholders got nothing. The bailout of Lloyds bank required the government taking a 40% stake. This is the way to go - if you need a bailout there should be a cost to the shareholders. otherwise you are just privatising profit and nationalising risk.

Not that UK regulation was great all round or the bailout perfect. It certainly failed to prevent the crisis which could have been done (no doubt the same applies in many countries). I looked at Northern Rock's accounts some time (an year, maybe?) before the crisis and was horrified by their reliance on interbank lending. it was obvious they could not cope with a rise in rates.

nyarghMar 27, 2026, 6:35 AM
Bold of you to assume competency will overpower politics in our current era.
hollerithMar 27, 2026, 6:43 AM
So far, the country I know best, the US, has been competent enough to avoid massive corporate bailouts except the aforementioned banks in 2008 and GM. The bailout of GM was not motivated by a desire to avoid a recession when a bubble pops.

If the AI labs become very influential and powerful, Washington might nationalize them, but that would be very different from bailing them out because they have become unprofitable and cannot attract additional investment from the private sector.

Scottn1Mar 27, 2026, 8:05 AM
You forgot about the $9b bailout to Intel in August of 2025.

With the recent OpenAi deal with the government I am certain they would throw tons of money at OpenAi if it got real bad. But with upcoming IPO where they are expected to be valued at $840b, we would be a LONG way from them needing a bailout. Well past this current admin.

nyarghMar 27, 2026, 7:00 AM
Despite politics, TARP was arguably an economic success story for the US treasury despite public sentiment. Whether it created moral hazard or not I suppoae is up for debate.

GM on the other hand should have been left to die.

However, I was obliquely referring to the open transactionality and patronage encouraged by the current administration, and how the AI / big tech players have, with few exceptions, gleefully joined in.

Unless they run out of money for bribes, I think it's inevitable that current government will bend over backwards to prop them up.

attila-lendvaiMar 27, 2026, 8:09 AM
a bailout is a popular way in which public funds lose their publicness.
graemepMar 27, 2026, 9:08 AM
Do the examples of the banks and GM suggest that it is likely that AI companies will get a bailout to avoid the bubble popping?

The reason the banks bailouts did not involve nationalisation is that the US is very reluctant to nationalise anything.

Capricorn2481Mar 27, 2026, 1:57 PM
The U.S. has an admin right now that has made it clear the only important metric for country health is the stock market, which is single-handedly propped up by AI right now.

That's why huge concessions nobody asked for were made to the AI industry in the Big Beautiful Bill.

lukanMar 27, 2026, 9:11 AM
"but why would a government feel the need to bail out AI labs"

Oh easy, with all the drones and sensors, AI means military power. Those who dare opposing the bailout of the local AI gigants want the other side to win.

/s

eigenspaceMar 27, 2026, 10:30 AM
It'd be nice if they do, but I don't really see how. Training these open-weight local LLMs is still insanely expensive and hard to do, even if it's cheaper and faster than what the big corps are doing.

I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.

qingcharlesMar 27, 2026, 5:18 AM
Unless there are some really, really major shortcuts found in inference, then it's always going to be hard to run a really great model locally. The costs of the PC + electric will usually be crazy compared to a $20/mo Claude sub.
3836293648Mar 27, 2026, 8:22 AM
But that $20/month is still heavily subsidised. You have to compare to the API costs, not the direct subscription.
rudolph9Mar 27, 2026, 6:27 PM
When Apple gets their shit together.
nerbertMar 27, 2026, 10:19 AM
Some open source models will cross the chasm, some big ai providers will too, and in both case they will have their specific use cases.
freekhMar 27, 2026, 7:03 AM
This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT. They will do this because 1) they do not have an offering in AI yet 2) they have amazing hardware that even now almost can pull it off on open models and this will not be possible to replicate on android for a long time (presumably)

This will crush OpenAI.

Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?

oarsinsyncMar 27, 2026, 8:40 AM
> This has been my theory for a while: during this autumn Apple will release a version of Apple Intelligence that runs locally and works better than ChatGPT.

In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?

Eventually, this may be true. This autumn? Highly unlikely.

freekhMar 27, 2026, 12:39 PM
The Google Gemini deal is one of the reasons I think it is likely since Gemini works pretty local hw...
CJeffersonMar 27, 2026, 5:46 AM
They won't for coding and images, but they will socially. Everyone I know who has invested in home AI use is mostly using it for 'things that might get you banned/limited'.
MashimoMar 27, 2026, 6:00 AM
I'm quite impressed what is possible with just 12 to 16 GB of vram in terms of image generation.
emp17344Mar 27, 2026, 2:46 AM
Yet more evidence that the harness matters more than the model.
bilekasMar 27, 2026, 2:00 PM
Where is a RTX 5060 Ti 16 GB 500$?

Edit : The 8GB seems to hit this price but 16 not so much.

hedgehogMar 27, 2026, 2:31 PM
They were $450 or so until recently, now... good luck.
riidomMar 27, 2026, 12:04 AM
Not a word about the tok/sec, unfortunately.
arjieMar 27, 2026, 1:43 AM
It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.

I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

Octoth0rpeMar 27, 2026, 3:07 AM
> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)

There seems to be at least some detail on that point.

dwa3592Mar 27, 2026, 2:29 PM
I wonder if it's working out for the benchmark problems only?

one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.

bdbdbdbMar 27, 2026, 8:22 AM
This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house
AurornisMar 27, 2026, 3:43 PM
This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:

> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.

Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.

The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.

0xbadcafebeeMar 27, 2026, 3:39 AM
This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.
Temporary_31337Mar 27, 2026, 8:54 AM
the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor
15minutemailMar 27, 2026, 7:25 AM
74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.
subroutineMar 27, 2026, 7:55 AM
At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.
josefritzishereMar 27, 2026, 1:25 PM
The core problem of AI remains unresolved, with no conceivable path to solvency. The issue is that AI isn't very good. It's OK, sometimes under very narrow criteria. But providing AI in reality very costly. Vague promises of it magically becoming better remain, very optimistic at best and still provide no route to solvency.
negativegateMar 26, 2026, 11:37 PM
Am I still SOL on AMD (9070 XT) when it comes to this stuff?
0xbadcafebeeMar 27, 2026, 3:48 AM
No? You can run any model that fits in its VRAM, and you can run larger models with layer/MoE offloading. Ask an AI what the best models you can run on that card are, then ask it for newer models than that. Ask what tuning options to pass to llama.cpp, and what the auto-tuning options are. Use ROCm builds.

It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.

metalliqazMar 27, 2026, 6:24 PM
I've been running local models on my 9070XT and I have never found ROCm to be faster than Vulkan
patsheadMar 27, 2026, 1:38 AM
No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!

hrmtst93837Mar 27, 2026, 6:10 PM
You can cram absurd context into a card now, but none of that matter once you hit the VRAM wall and the whole thing slows to a crawl. Cloud is cheaper. Local still matters for privacy and weird adapter stuff, but 'usable for work' is a much higher bar than 'looks decent on benchmarks' when the task is chewing through a repo without latency going to hell.
dangusMar 26, 2026, 11:45 PM
Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

dannywMar 27, 2026, 1:50 AM
Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

sznioMar 27, 2026, 8:45 AM
On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.
mongrelionMar 27, 2026, 11:14 AM
Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroatonMar 27, 2026, 1:26 PM
For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).

mongrelionMar 27, 2026, 5:56 PM
I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.
superkuhMar 27, 2026, 1:04 AM
If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.
limoceMar 27, 2026, 1:47 AM
The title should be "Adaptive Test-time Learning and Autonomous Specialization".
paxrel_aiMar 27, 2026, 2:01 PM
[dead]
eddie-wangMar 27, 2026, 3:40 AM
[dead]
itigges22Mar 27, 2026, 3:54 AM
[dead]
mergeshieldMar 27, 2026, 11:28 AM
[dead]
LuisvelAIMar 27, 2026, 10:12 AM
[flagged]
wiradikusumaMar 27, 2026, 3:18 AM
[dead]
felixagentaiMar 27, 2026, 2:14 AM
[flagged]
dangMar 27, 2026, 3:19 AM
We've banned this account. Please don't post automated comments to HN.

https://news.ycombinator.com/newsguidelines.html#generated

sayYayToLifeMar 27, 2026, 1:22 AM
[dead]
ozgurozkanMar 27, 2026, 1:31 AM
[dead]
bustahMar 27, 2026, 2:58 AM
[dead]
RazenganMar 27, 2026, 8:02 AM
Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.
maipenMar 27, 2026, 10:48 AM
You still need to give it precise context and instructions when dealing with things that are not web apps or some other software cliche.

The reasoning is great in opus, unbeatable at the moment.

I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.

RazenganMar 27, 2026, 11:26 AM
I still don't trust any AI enough to generate or edit code, except for some throwaway experiments, because every time I tried it's been inefficient or too verbose or just plain wrong.

I use it for reviewing existing code, specifically for a components-based framework for Godot/GDScript at [0]. You can view the AGENTS.md and see that it's a relatively simple enough project: Just for 2D games and fairly modular so the AI can look at each file/class individually and have to cross-reference maybe 1-3 dependencies/dependents at most at any time during a single pass.

I've been using Codex, and it's helped me catch a lot of bugs that would have taken a long time on my own to even notice at all. Most of my productivity and the commits from the past couple months are thanks to that.

Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.

[0] https://github.com/InvadingOctopus/comedot

dr_kiszonkaMar 27, 2026, 5:47 PM
I had a similar experience and the answer appears to be learning how to use a specific model for a specific task using a specific harness (model X task X harness). Another, and somewhat related, lesson learned is understanding how to work with a given model and not against it.

I still get really mad at AI sometimes and I am not sure whether I could use AI for coding full time.

(Codex broke my git a few days ago.)

spiderfarmerMar 27, 2026, 10:01 AM
"I don't get it. Everyone else is wrong."
RazenganMar 27, 2026, 10:43 AM
"There's no such thing as astroturfing." ok

I use Codex regularly and Claude is shit in comparison, from its constant "Oops you're right!!" backtracking to its crap Electron app (if their AI is so good why can't they make a fucking native app for each OS?)

Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error

spiderfarmerMar 27, 2026, 11:54 AM
"Shit", "Crap", "Fucking", "Hell", "Freaking".

Maybe you're too easily frustrated. Or your existing code reads like your comments.

RazenganMar 27, 2026, 12:20 PM
Maybe you haven't tried any other AI product with an actual preexisting project. Or blindly trust every BS Claude feeds you.

I haven't had any such frustrations with Codex

Claude is specially annoying because of their submarining and people thinking it's the best

spiderfarmerMar 27, 2026, 12:25 PM
I use both, read what I need to read and fix small issues myself. Both Agents are pure magic and none of their issues warrant a tantrum on a public forum.
RazenganMar 27, 2026, 2:03 PM
I posted a more detailed report in case you can't see it in your thread view: https://news.ycombinator.com/item?id=47541369

and other comments further back in my history

> none of their issues warrant a tantrum on a public forum

I don't get frustrated if a problem is genuinely difficult to solve and the product creator is trying their best,

I get frustrated when a problem has been solved by other similar products but a specific creator or provider refuses to follow suit and fix their shit.

Claude's Electron app vs. Codex's native app is one such example right off the first impression of both products.

cesarvarelaMar 27, 2026, 5:26 PM
Codex desktop is Electron too. What app are you talking about?