So large companies are getting billed a lot more than those discount subscription plans.
Claude can be very good but enterprise pricing doesn't make sense to me.
Common talking point. There's enough evidence for the counter argument that this is essentially misinformation. I have no idea why it's so often repeated with confidence.
> No evidence is shared
Help an open-minded critic out.
How much is Waymo burning a year? 3B on 300M ARR? Anthropic is what 5B on 20B ARR? Waymo is 3x older. Why don't we hear such confident statements about how subsidized their rides are?
It's one thing to speculate it's another to parade it as fact. Even if the S1 reveals an unprofitable business today, you can still only claim it's unlikely.
We do. We hear it less often because no-one is talking about how Waymo changes how we all need to work or whatever, that's all.
Also, we do have some evidence for my position:
- We know that the consumer Claude plans provide _way_ more tokens than you could get if you were paying API prices. This is a huge part of why Anthropic's limits on other harnesses for subscription customers is such a big deal. So either their profit margin on API tokens is absurdly high, most consumer subscribers don't come anywhere near their rate limits, or they're losing money on the consumer subscriptions. - It appears that complains about people running into rate limits are common, which suggests the "consumers usually don't use much of their subscription" explanation is incorrect. - We also know that Anthropic has just become profitable, almost certainly driven mostly by enterprise customers. This rules out the "they make a very high profit margin on the API" explanation, since if that was the case they'd likely have been profitable much earlier.
Taken together, I think the case that their consumer subscriptions lose them money on net is pretty strong, even though their enterprise subscriptions (and API pricing) does make them a profit.
To be clear I'm not arguing against this position, just questioning the confidence with which people claim that the current consumer subs are not a sustainable offering and a merely temporary.
With this kind of opaque billing, how can I reasonably deploy any AI?
The real cost effective way is giving a team $20 cursor $20-100 Claude $20-200 codex.
I'm spending 1k on Claude enterprise easily and that's with trying to spread it on codex and cursor using pi.
Can large enterprises just not use the API ? I have audit logs and what seem to be enterprise features through my anthropic account (platform.claude.ai)
Longer term, you also have to be careful about building things around details which could change at any time. OpenAI and Anthropic have a ton of pressure to start banking huge profits and they very closely monitor customer activity. A time-honored strategy in this space is to shuffle the features enterprise customers depend on but which aren’t deal-breakers for most other customers into expensive enterprise plans. There’s possibly some counter pressure from companies like Google which have healthier finances but I wouldn’t count on that since they also have MBAs who’d be all too happy to invent pretexts to hike their prices to match.
What's your source for Opus being a 5T model?
> and tiny distillations from DeepSeek that perform well only in benchmarks.
I don't think you know what you're talking about. Local models aren't “distillations from Deepseek”.
And they don't perform well “only in benchmarks”, Qwen 3.6 is a very decent model (obviously it's not Opus, but it's also much faster and speed is a quality of its own).
Elon Musk tweeted that Grok is 0.5T or 1/10th the size of Opus. https://xcancel.com/elonmusk/status/2042123561666855235#m
While this source's reliability is certainly debatable, the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
Massive understatement. Nowadays it has become hard to find a single Musk statement that doesn't contain at least one lie.
> the size matches the results of this paper, in which researchers estimated the parameter count from model knowledge. https://01.me/research/ikp/
Thanks for the pointer. This estimation has Grok 6 times bigger than Musk claims it is, so maybe that's where the lie is.
(I'm quite skeptical about that number though, it would be quite disappointing for the US tech if their flagship models had to be that much larger than the Chinese ones for such a small edge in performance. Because I don't think US labs are incompetent, I'd bet that US flagships aren't more than 2/3 times bigger than Chinese flagship. Otherwise it really doesn't bode well.)
From this paper
Claude Opus 4.6 Anthropic 68.0% ∼5.3T [1.8–15.6T]
Claude Opus 4.7 Anthropic 66.4% ∼4.0T [1.4–12.0T]
Claude Opus 4.5 Anthropic 65.2% ∼3.4T [1.1–10.0T]
Claude Opus 4.1 Anthropic 64.9% ∼3.2T [1.1–9.5T]
Claude Opus 4 Anthropic 59.7% ∼1.4T [478B–4.2T
According to their estimation, Opus is likely between 1T and 15T, which really doesn't tell you much that you couldn't have guessed otherwise. It doesn't say “Opus is a 5T model”.The fact that there's absolutely no consistency in the predicted size between models from the same lab should tell you all you need about the predictive power of this method (and they aren't really lying about their numbers, their confidence interval is huge enough to fit anything in it, but their prose is making very strong claims out of their statistical nothingburger).
(somebody already posted this paper earlier, and I spent some time reading it, and this paper is really not that good even though there are a bunch of interesting ideas in it).
Probably Elon Musk: https://eu.36kr.com/en/p/3760679047267075
Like "Full Self-Driving" from coast-to-coast by 2016?
He's lagging the AI race despite having tons of compute available, so he tries to make a narrative about how it's not that the model is behind, it's just smaller than the competition.
This is a temporary phenomenon. Expect either drastic price increases or draconian throttling or both in the coming months.
These companies are operating at huge loses and have hundreds of billions in liabilities and commitments. They need to turn on the money faucet sooner than later.
We're already seeing slash their AI budgets. I expect that will increase till we hit more of an equilibrium.
What about human understanding of the codebase that's essential to any project's long term health? Even "superperformer developers" eventually leave the company.
edit: I see in other comments on this thread you think Ed Zitron is a reliable pundit so that explains everything.
You can dismiss Ed (and me vicariously) but what's your compelling evidence to counter their extremely uphill battle towards profitability?
Either way it will be very interesting to see their S1 when they try and IPO.
If it's anything like SpaceX's then I suspect my post will age better than yours.
If prices keep going up, watch for companies to exit frontier models and go to local llama.cpp instances for 6-month-ago SOTA, with the flex of being housed within the office - no more privacy leakage, no more price gouging.
To be honest, I’m not sure why a Y-Combinator backed company hasn’t come out yet flooding the market with highly capable OPAI (pronounced “Oh-pah” as in what Greeks shout as the drink shots), which stands for “On-Prem AI”
… yes, I just made up OPAI right now lol
If we momentarily disregard the fact that YC itself owns billions of dollars worth of OpenAI shares[1], YC would plan to find demo-day investors willing to drive down the value of frontier labs. The coöpetition among VCs and the existing web of AI investments will mean no VC will be interested in investing in local AI...until after the frontier labs IPO.
1. Thanks to the self-dea^w foresight of former YC president Sam Altman
That or just hiring people to do the work! I hear rumours that this is already starting to happen in some places (perhaps those that were a little overzealous with AI-hype driven layoffs).
If we're able to see some big increases in hardware capabilities that can be self-hosted, that will be an accelerant.
That said, most companies just want to pay a provider to delegate responsibility in exchange for cost and control.
And you think it is unreasonable to consider this unsustainable?
Looking at the pricing of 1-2T models like Kimi or DeepSeek on the open market, I'm tempted to assume that inference costs are closer to subscription pricing than to API pricing.
Especially considering that subscriptions a) distribute load over time via rate limits, and b) will include a lot of users who get only a fraction of the possible value, whether they are on a personal account where they are on the rate limit on the weekend but barely use it during the week, or are corporate users who were issued an account they rarely use. Subscription prices are usually measured on the average case, not the most extreme value a power user can get out of it
So just going on vibes?
While some people don't like his content, Ed Zitron shows a lot of evidence for your assumption being very wrong.
These companies are bleeding cash at ungodly rates. It's likely their API pricing is still subsidized if you look at their overall financial picture.
Related, there's a good reason those API prices keep going up a lot every new version and it's not just because the models are better.
Also, API prices going up a lot every new version is more an OpenAI thing, and even there it's a recent trend: GPT 5.0 was a big price drop compared to 4.1, and 4.1 was cheaper than 4o, which itself got a price cut at some point and is cheaper than 4. Meanwhile Anthropic's API pricing stayed stable for many versions, then got slashed to a third with the 4.2 release and have stayed at that level since.
Their business model is selling inference but the training and other costs have to be accounted for somehow. Unless I'm missing something obvious, inference costs must go up drastically if these companies are going to survive beyond the subsidy stage.
If that doesn't work, then yes, then prices will have to go up
There was a lot of hype and exploration of capabilities, but models aren't evolving fast enough to keep that going, so I'm settling down into a familiarity with what an LLM can and can't do that means I am using them less overall that I was 6 months ago when I was throwing everything under the sun at it just to see what happened.
Without either new model breakthroughs or dramatically _lower_ costs, I will be very surprised if the ultimate market doesn't end up within an order of magnitude of where it is today.
Of course they do have to "make bank" in some way to offset the insane training costs. But whether they go for high prices or high volume, or offer some services as a loss leader to drive profits elsewhere is somewhat orthogonal to that
pure speculation. about as valuable as my linked wsj reporting i suppose. given thats the case, maybe you shouldnt claim so confidently that they are money incinerators.
Back to the point: No one is profitable yet, which I think we both agree is accurate. If you are going to lean on “they will be soon” then it’s fair to say they’re going to IPO soon.
Ease off the gas. We’re just discussing a tech company.
For context, ChatGPT business subscriptions give you a fixed pool of credits to use, after which you get billed a la carte at inflated 1.75x rates vs API, or if you don't want to pay, you get access to anything but the non-reasoning models turned off for the month.
We also tried Claude Enterprise, which was unusable as people blew through their monthly limits in a matter of hours.
Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).
Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.
For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.
The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.
Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.
Anything can also be run on a cheap computer.
The difference is in speed. A cheap computer may run a big model up to a few orders of magnitude slower than datacenter hardware, depending on whether the LLM is small enough to fit in GPU memory, or it is small enough to fit in CPU memory or it is so big that it must spill on SSDs.
Depending on the application, the tradeoff between run time and run cost may happen to favor using local hardware, despite a much slower speed.
There are plenty of applications where doing them for negligible cost during an overnight job can be preferable to obtaining faster results at a very high price, for instance scanning for bugs in a mature code base using a great number of different open-weights LLMs, which can achieve similar bug coverage like using a single, but overpriced and unavailable SOTA LLM, e.g. Mythos.
You do realize that a model like Opus is (estimated to be) around 5T parameters, and uses around 5TB of GPU memory?
These kind of things are just impossible to run locally.
Like I have said, the problem is not that they cannot be run, but that they may run more slowly than it is acceptable for a given application. Depending on the model, the speeds reported for inference with weights stored on SSDs vary from one token every few seconds to at most a few tokens per second.
Computers could solve relatively huge problems even in the early days of vacuum tube computers, when the main memories were measured in kilobytes, because at that time it was not expected that the data needed for problem solving must fit inside the main memory or even in the next tier of memory, with magnetic drums or magnetic disks, but the really big problems were solved by a great number of passes over data stored on magnetic tapes.
An LLM whose inference could not be run on a small mini-PC would have to be one hundred times bigger than the biggest existing SOTA LLMs.
Any LLM that exists today can be run on almost any PC, just extremely slowly in comparison with datacenter hardware.
Giving strong “640k is enough for anyone” vibes here.
Cloud should have more compute and efficiency than local. I wouldn't be 100% sure, as I don't know what I might not be seeing, but still.
Whether that comparative advantage will matter, though, is a completely different question.
These are loss leaders that will not be maintained over the long term. Already we see moves to restrict their usage and redirect people back to API pricing.
Some might say the price wouldn't be great if you could actually process and validate it...
My hunch is that this is the source of much of the variability in outcomes upstream of HN commenters claiming extremes of, "This model changes everything!" to "This[same] model is crap."
We haven't operationalized what it means to "be good at prompting," nor developed proxies/heuristics/shibboleths for accessing prompting skill. There's community skepticism over whether prompting skill even exists. Besides even if prompting skill is real, who wants to hear, "Actually you kinda suck at prompting."
Power users are always going to have to take the messaging companies send out to the masses with a grain of salt.
My experience is that you have to write extremely detailed design documents and work specifications in order to get effective results. These generally have to be as detailed as most effective prompts.
Once you've written specs that detailed, why do you need outsourced developers and frontier models?
In ten years my prediction is that we have just as many developers as now building more products than they build now and AI is used for automation in isolated areas where it makes sense but most software development just happens at a higher level of abstraction where less text garbage is required to express the same concepts and the meat of code becomes even more focused on specifically encoding and highlighting the intricacies of the strange edge cases.
I started my journey in software development working on a MUD that had been passed down through a dozen hands and was extremely dirty software. I can't see anyone wanting to try and pick through the ball of mud and spaghetti that'd result of letting AI build software without severe oversight and corrections.
The core of software development has always been problem solving (or, more accurately, problem identification). As time has gone on we've gotten rid of more and more of the cruft to focus on that point. I suspect that trend line will continue and we'll evolve towards even leaner and more abstract languages to state problems and try and isolate the fiddly logical flow components, driver bits and math more and more into libraries and tools because for most daily work it is important but can be assumed to have been done by someone else better.
I would like to submit that the high-intricacy work congregates in Protocols themselves, and we start seeing the cycles of development and all the ways to direct AIs, programs, inter-person/inter-company interactions, etc etc all as types of protocol design - and studying those rules of interaction themselves becomes the new job of a programmer (systems architecture). What used to be hard rules and deterministic programs becomes soft self-governing tendencies and probabilistic behavior that can nonetheless be managed and bounded with the right system, but it's new and weird and more akin to management or herding cats than architecture. This is still very different from what most of us were working on before AI, but it's still familiar - especially to those who worked on internet protocols, or defensive UX design around users, physical engineering systems, or team management. Less programming languages, more - control theory, flows and throttles, quality control, design theory, etc. And clearly the field is still wide open as everyone seems to be experimenting with their own take on the AI orchestrator.
Even if the engineers themselves are cooperative, their managers / business owners will resist close cooperation and enforce work at arm's length (e.g. 1x weekly calls).
Ask me how I know. I once spent £300k (fortunately not my money) on an outsourced team of developers, and they delivered nothing at the end. Most of the time it was simply about aligning the work! We (me and my partner, we together had some idea of what we actually wanted) tried repeatedly to make sync-s more frequent, to better align the efforts, but their managers kept resisting. It's the "consulting" business model!
For remote jobs, the incentives are reversed. You're literally a full-time employee, there's no management layers to impede communication, and (unless you're lazy or a fraud) you probably want to work on interesting problems and not be bored!
Largest such scam[0] I've heard of was "we have 11 senior engineers working on this project" (actually three, two of whom were actually junior-to-mid-level).
[0] Let's call a spade a spade.
Does it matter how many engineers work on supplier's side?
Supplier is tasked to deliver the project. It is up to them to figure out how many people would they need, and to manage them.
I think that’s also where the assumptions of the original post are off - the difference between DeepSeek and a frontier model is not usually what low quality outsourcing can cover. So you probably end up paying a highly qualified outsourced engineer who may not be significantly cheaper (most outsourcing is not just due to cost but capacity and capability).
Not only do you need to spec everything to the right level of detail (at which point an LLM can likely make a good go of it), but a lot of the outsourced teams don't build in anywhere near the same way as those internally, and the difference in the level and speed of delivery is absolute.
Not to mention with everything changing so quickly, why would I be spending time and money training up someone else's staff to be keeping up with the cutting edge?
Luckily LLMs can do that too.
LLMs are likely to replace outsourced devs because your employees that know the context can use LLMs to do what offshore devs did before.
but for OP's use case, people with some capital and many skills who need additional help, AI is solving a problem in a way that was not solvable before, while improving on coordination abilities and coordination velocity. Offshore developers do not come back into play here.
Request: “manual step X should not be part of the automated build script”
Fulfilled as: build script is now split in two. X is still done as a manual step in between. Rather than prompting and waiting for it to be done, the documentation and scripts no longer mention X.
Part poorly written requirements, part implementing under pressure, and part lack of engineering discipline.
The main issue is catching stuff like this early enough to course-correct. Differences in time zone, language and cultural norms can make that a challenge, all of which LLMs have the advantage in.
I’ve had it assume I meant the folder multiple times :/
I saw it’s thinking tokens said something along the lines of “I have implemented it correctly but the test is failing. I’ll update the tests so the pipeline passes”
Offshore Indian devs make sense when you can have a large Indian division so you can amortize communication infrastructure/process management over a lot of heads, and you're building for international customers so you're not paying an English -> X tax inherently.
Just recently I asked a dev there for a POC of a feature with decent specificity and ended up with about 8k LOC of spaghetti. I re-wrote it later in a few hundred. This is about in-line with my career experience.
I've had a few standout devs there but it does feel like a lot are putting in the bare minimum or are just working really far outside of their abilities.
And outsourcing certainly became a thing though not in the way everyone predicted. There are far more software engineers in the US today than there were in 2004.
I work for a global corporation. We have offices in India. For the technical professionals I deal with the wage differential is maybe 30-50% and is actually quite a bit less than the cost of living difference. My personal experience is that there is a tendency for them to massively inflate their qualifications and level of experience to a point that Americans would call fraud. The only kind of people who think this is a good idea are people like Larry Fink, and I would attribute his motives to greed and malice, probably an equal parts.
What evidence is there of the quality of Indian devs specifically?
One signal I'd expect to see, for example, would be success in programming competitions. Here's the list of winners of the IOI competition [1] - India has won 3 times.
Meanwhile, Turkey has won 4 times, Estonia has won 5 times, and Vietnam has won 22 times!
Why should we suspect that there are more or better developers in Indian than in any of the countries that has produced more winners??
[1] https://stats.ioinformatics.org/countries/?sort=medals_desc
Implication for manufacturing: Going robots first shouldn't aim at just re-localizing manufacturing, but aim higher. Become the new outsourced manufacturing destination.
OAI/Anthropic are 100% going to try to take everyone's jobs, and "own" labor. The Chinese are the good guys here.
Good luck with that. This reminds me of the inspiration of declarative programming languages such as Prolog - you're supposed to declare the problems in such a way that the machine can solve it - rather than the imperative way where you tell the machine what to do. What they didn't realize that the definition is harder than the solution itself.
And, who wants to screw around with harnesses or define agent orchestration when Claude/codex are good at this and getting better every month.
0 - https://www.williamangel.net/blog/2026/05/17/offline-llm-ene...
If inference cost comes down (as it has been for the last few years) you’ll be able to run today’s SOTA in your laptop by the end of the year.
I want local AI to be a thing but the hardware isn’t here yet, because the only options are a Mac Studio or DGX machines strapped together. RAM prices needs to crash before local AI has a chance at actually competing.
If Claude hosted on AWS bedrock is not considered trustworthy, I have some bad news for you.
How is that going to work on Bedrock, when they don't even manage the infrastructure?
The weaker a developer is the higher capability AI requires. The entire premise of this article does not work because it confuses weak developers with weaker ai being better than strong developers with near atonomous ai. The weak developers with frontier ai already produce products that are worse than a capable developer paired with a weak (2 year old) AI.
To clarify: Strong developers 2 years ago could already leverage AI to produce high quality products whereas with latest and greatest AI weaker developers stills struggle strong developers can now delegate more of the work to the stronger AI increasing productivity further.
Now you can put those detailed documents into the LLM and get a better result back in a couple of hours rather than weeks for a tenth or hundredth of the cost.
And the offshore devs are going to be using the LLMs themselves, why add another layer, level of bureaucracy, language barrier in between your requirements and the result?
It's still far too costly and not effective to use Local AI that can match what the frontier models can offer, especially when the inference hardware is being heavily restricted due to geopolitical risks. Claims about local LLMs somehow putting these frontier companies a run for their money I find especially doubtful in the long run.
Tokens are getting expensive because they are beginning to corner the market and will use that advantage to limit hardware distribution within and beyond the borders.
It's more likely that some workflows will see more local LLMs but those will never be the ones that require frontier model level or beat the price that a lighter smaller version of frontier model will offer to capture that tail end
My impression is that deepseek designed v4 specifically for cheap inference and they are not loosing money even at 75% lower price.
This is absolutely false, because other providers serving the Deepseek models on OpenRouter are also able to offer very low prices, and they don't have the money to subsidize anything.
Furthermore, V4 pro was designed to run on 4 Huawei Ascend GPUs which are much cheaper than the nvidia setup others use, and deepseek probably also got some free hardware for their collab.
Hence it is entirely possible their inference costs are significantly lower than other providers.
If any providers are able to turn able to sustainably turn a profit, OpenRouter allows them to compete in an open market to process your tokens (or anyone else's tokens).
Thus anyone subsidizing tokens bears the brunt of the compute load and gains not much more than name recognition and tokens to train on, but since switching to a different provider is a matter of changing one setting in the config panel (and can be set to auto-switch based on price), switching costs are very low. Providers of open models via OpenRouter have almost zero ability to lock-in users.
So this claim that all 13 providers are selling subsidized inference is... a tough claim to swallow. Maybe some of them are, but all of them? I assume at least some providers want to show profitablity, and are pricing their service accordingly.
https://openrouter.ai/deepseek/deepseek-v4-pro/pricing
https://openrouter.ai/docs/guides/routing/provider-selection
I'm curious, who/what is operating the frontier LLM in this scenario?
The rest of the article is equally incoherent.
That is pretty usable. You could get 65t/s or more with MTP, but only if you drop the context size, which I would advise against.
Results are better with 256k context and a larger quant, however, that's not going to fit on the 4090 you already had lying around for playing cyberpunk 2077.
The MoE models make me rather unhappy. Idk. They feel braindead to me, but YMMV.
I spent a month comparing Gemini Ultra plan to using much lower cost DeepSeek v4 with open source coding harnesses and, spoiler alert: I was happier using the much cheaper and more environmentally friendly open models: https://marklwatson.substack.com/p/my-evaluation-of-ai-agent...
https://share.google/aimode/a0O95wzk2UUhIXLUI
https://cloud.google.com/blog/products/infrastructure/measur...
BTW, Google is my pick for the winner in the USA tech giants AI race. I worked at Google about 12 years ago and was impressed by their use of renewable energy, etc.
There are misaligned incentives here between users just trying to get stuff done and AI companies competing on having the "smartest" model that passes benchmarks and continuously does some nobel peace price winning stuff. It's mostly overkill for the more mundane stuff normal people actually do with them. It's nice to have the option when you need that. But defaulting to that is not economical and a bit unnecessary.
There's also a difference between smart models and bigger context windows. Most of the progress in the last year was simply the context windows getting big enough to fit all/most of the stuff needed to solve issues. Before then, you had to carefully manage the context to not run out of space and they wouldn't fit much more than small hobby projects.
With sub agents, the parent agent doesn't need to be a frontier model. It can delegate to smarter agents. And most stuff it delegates shouldn't need a frontier model. Wouldn't it be nice if it could decide on a case by case basis.
The walled gardens offered by OpenAI, Antrhopic, and others currently default to one size fits all "frontier" models. This is not sustainable. They should evolve to using smaller and effective models most of the time with complexity based escalation as needed based on either estimated complexity or when the small models fail. I'm guessing some open source based alternatives to these walled gardens are probably already heading that direction.
The irony here is that with a walled garden, these companies are selling a premium experience. But in the current market that boils down to burning billions of investor cash to keep the GPUs going without much hope on profitability. Eventually surviving companies are going to have to compete on quality, cost and margins. The smart approach would be to dynamically adapt token and context window sizes instead of blindly defaulting everything to the best possible. Don't boil the oceans for a simple email summary or a simple web UI. That stuff already worked well enough with models even a few years ago.
- 5.5 is significantly more token efficient than 5.4 - the same task takes often a third of the tokens
- because of this, is it also much faster to do the task
- you get high "intelligence" per token even after accounting for token efficiency - 5.5 medium is just under 5.4 pro levels of intelligence (imo). It has found tricky bugs for me that all other models failed at
So overall, ideally you will end up with more intelligent, faster model for slightly cheaper.
Back when it became expensive I learned to live with it and I find my "AI skills" (mainly communication) have a substantial impact on the efficiency of the model. Not saying my work is difficult, it's not, but I find there is quite a bit of wiggle room. Smaller models can still perform useful work, but you have to do the heavy lifting yourself. It saves a ton of money.
I used to burn through 75% of my tokens in an hour or two. Now I can work all day and hit maybe 50-60% if I use it heavily.
The usual counter-argument is the operational burden, but human capital is also a relatively fixed cost. A dedicated team of 3-5 FTEs could probably handle inference ops for a F500 company.
Meanwhile, the capability delta is shrinking fast. We have more evidence that local open-source is viable with the release of DeepSeek v4, and the industry is only trending further in this direction. Especially as we rely more on test-time compute and task-specific harnesses rather than model size.
So, if you're an executive looking at a marginal but fixed operations cost, added flexibility, and a rapidly closing gap in capability, why wouldn't you just run open-source models on your own infrastructure to get those highly predictable costs? Plus, you decrease the risk of one of the frontier
There’s so much uncertainty, it seems like the safe option is to give everyone a Claude or OpenAI subscription/api key until the frontier isn’t changing every six months.
27B is already really good at coding-specific tasks. Fundamentally, there is little innovation on the core architecture: LLMs are all designed essentially the same, with minor differences in how they are trained. They are all feed-forward multi-headed attention models; it doesn't matter if it's a 4B model or a 1T model, that's just scale.
Further, the frontier models cannot afford to innovate: they have to scale as quickly as possible to "beat out" their competition. The frontier models fundamentally will not create the next "attention is all you need" monumental jump in AI.
Frontier companies are stuck on scale with zero capacity to innovate. You cannot point capitalism at "basic science research" and expect any ROI. This is a known reality. Innovation is much more indirect and a "random walk" style of knowledge acquisition.
Finally, these LLMs are quite literally designed with a human-in-the-loop, and we do not give ourselves enough credit for how well we ourselves tool-call. We are doing a lot of heavy lifting to make these models useful and you cannot simply remove us from the equation without also removing ourselves from the training pipieline.
We talk about capability like it's some kind of linear scale. I am not paying 30x for 30x performance. I am paying 30x so that my use case goes from "haha nope" to a signed contract with the client. Works 0% of the time => works 3% of the time is an infinite improvement in capability. That is what the premium is paying for.
The big question is how subsidies vs technology improvement will play out. As we saw with Uber, selling at a loss can happen for a very long time, and technology improves relentlessly.
For reference, we publish https://botsbench.com/ that shows time and cost per answer are going down while quality is going up.
The contradiction here is that without frontier models, there'd be no foundation for models like DeepSeek to reference and catch up to. Is there an economic model that captures this kind of dynamic?
A bigger issue is this thing calls AIs better coders than people and I have tried for the past 4 months to get one of the several I looked into to consistently produce a simple event-bus backed Java monorepo going with exactly zero success. Claude even repeatedly wanted to put my login logic at the actual event bus, for some reason.
What does "better coder" _exactly_ mean at this point?
AI can turn it into a pseudo-poem or a 4 pages document. Or it can just fix the grammar. But it doesn't really change the point of the sentence– nor does it fix the actual issue with it.
Similar for code: There are codebases with lots of smells and really dirty parts, yet, that are still better than methodically clean ones that just don't "get to the point".
I am so sick of all the AI bloat. People were able to hide their incapability behind unnecessarily complex frameworks or obscuring it through "clean code" concepts. Now LLMs give those uninspired people the option to invest even less of what makes worthy software and hide it in more abstraction.
Just: AHA! (AI won't)
Local LLMs are great and very useful but if you are claiming that their code quality is in the same ballpark as Claude Code or Codex with their best models I cannot consider you a serious person. I feel like this is analogous to the folks arguing that The Cloud is "someone else's computer." As if billions of dollars of spend gives these companies zero benefit over a Mac mini.
Regarding offshore, at least in my experience, better coding agent output is down to two factors. First, is subject matter expertise. Providing the right context to the coding agent based on the tech you are building for is beyond critical. That's the issue with the Vibe Coded slop projects. No expertise in a technology means no awareness of gotchas, React is the most obvious because the LLM default is to useEffect endlessly.
The bigger issue is that by their very nature LLMs are very sensitive to quality prompting in English. I have seen offshore devs fail endlessly because they don't have the English skills to successfully prompt the machine. That has caused more work for my US based devs to either carefully tune the work ticket so it is basically a coding agent prompt. Or to go through multi day exercises to enforce better prompting.
A single US dev with Claude Code is orders of magnitude better than typical offshore. Adding local models into the mix would make offshore completely useless. I'm sure many companies will see ballooning AI bills and expensive onshore devs and be very tempted to go to TCS or similar. I hope so, because that will give startups plenty of easy targets to disrupt.
AI will become a commodity technology the same way virtual machines are a commodity.
This is the (m/b)illion dollar question, isn't it? I think there's also a question of what do you think capability is exactly, and how the difference manifests itself.
On the one hand, when something becomes "good enough" that's a clear capability threshold. On the other hand, what's the limit of those capabilities, and equally as important, how does capability reflect on reliability?
We've seen "local models" lately improve on capabilities where they're "good enough" for some tasks. Reliability of solving those tasks is a bit harder to measure/benchmark/test. It'll get better as more people work with those models. But, something I've noticed in the past ~6months is that the frontier models are gaining a lot in both the breadth of capabilities, as well as the reliability of solving those tasks that they're capable of solving. I think this is where scaling (both compute and data) is showing, and where having more compute is simply better (more parallel exploration, more training data output, more broad data, etc).
There's also the problem of benchmarking true capabilities. The popular ones are getting old, and aren't as reliable as they used to be (not even touching on the subject of benchmaxxing, just thinking about their saturation, even with honest intentions).
So the question then becomes what will users prefer? Do you get the best of the best, or the one that's good enough? There might be a market for both, honestly. Not everyone does SotA stuff. And a lot of what people used to do in a company is probably mundane enough that a "good enough" model with "good enough" reliability can probably handle (w/ some supervision ofc).
What I'm more interested in is if things like Thaalas succeed and they get to provide local hardware that runs models "burned in silicon". That would be interesting, because speed and all the advantages of local models are a "quality" on their own. For example, right now I'd pay ~1k$ for an external hdd-sized block that can run a ~32B model that's popular right now, even knowing that it can only run that model. I have no idea if that's feasible or not, if it makes sense from a financial pov. But I'd buy one. And local inference on dedicated chips doesn't need to be "oss only". I'm sure oAI / etc would probably take the risk of licensing one of their -mini / -lite models provided that the risk of the weights leaking is small enough (and it probably is).
> This keeps a ceiling on how much or how fast the frontier labs can raise prices.
I generally agree, but from a different perspective. Up till now we've seen that the 3 labs influence each other's price points. When gpt5 came out at a radically smaller price, the others lowered them as well. Now with opus being SotA for coding, w/ 5.5 close behind, they've raised them back. Google seems to follow slowly. But there's hope that, being 3 top labs + 2 trailing (xAI & Meta), there'll be pressure once again. If any of those trailing labs manage to get to SotA again, the prices will drop once more. Some people say that open source also provides a pressure here, but I'm not yet convinced of this. There's still a question of who'll serve the models, at what scales, etc.
"Frontier models" are caught in a financial dilemma of their own making --- they have spent such huge sums on development and as a result, they may have inadvertently priced themselves out of the market.
Energy costs are a huge factor for AI. He who has the lowest energy costs will likely be able to dictate market prices. And fossil fuels dependence doesn't look to be advantageous for AI.
The frontier models are going to win that way. They won't feed your code back into the system but they will track which code you keep and what code gets a "try again claude".
They're not going to lose on price. No consumer software ever has because ultimately it's not that expensive relative to salary and the marginal cost is 0.
This is true for traditional SaaS too, but the number of concurrent users that could be served by one machine and the cost of the hardware were both at least an order of magnitude better.
In other words, AI is not your daddy's software. Comparing AI with old school software markets simply does not compute.
Lists examples of software that are free to the users
Last week we were all talking about how Anthropic has too much demand, how they had to rent a data center from a competitor, and how the limits they’ve put on their service to deal with the demand are making users angry.
DeepSeek is cheap because they’re working hard to attract users.
The open weights models released for free weren’t free to train. It’s a loss leader to get attention to try to sell you something in the future.
The prices we pay for tokens right now are set by supply and demand, with some being sold at high premiums and others at a loss. Some models are given away for free after the companies spent money on researchers and compute.
https://openrouter.ai/deepseek/deepseek-v4-pro/providers
Deepseek v4 Pro is much cheaper when provided by Deepseek itself, likely as a combination of the loss leader strategy you mention and the desire to have more data flow through their pipeline for training. However, the same open weights model, provided by other providers, is somewhere in the $2-3/1M output-tokens range. Compare Opus 4.7 at $25/1M output-tokens.
Unless you mean that releasing open weights models is the loss leader, in which case, you might be right but I hope you're wrong. We've seen some of this from Qwen at least - their latest model is closed only. I hope there's always someone willing to make this bet and release better and better open models.
This is specifically what I meant.
DeepSeek’s official service is trying to recoup some of the training and engineering costs too.
The other providers only have to recoup their hardware costs and the cost of a team to run it.
Even though DeepSeek’s official service is more expensive per token, they’re running at a lower profit than the OpenRouter providers because they had to pay for the R&D.
This is a deliberate choice. We already see it with Qwen splitting their releases between open weight and hosted only models. The open weights are a loss leader to get attention. Without them you’d almost never hear about their hosted models.
What would this bet be? Training is expensive and open weights mean that for hosting you compete on price with people that don't have this item on their bill.
So far, it's really only the Chinese labs (and FAIR or whatever Meta's project is called now) that are doing this. Oh yeah, and Google's Gemma.
At the moment, this is all massively distorted by the prestige and investment money flowing into the space. None of the labs have to charge the real cost of inference let alone the marginal cost of training because they are instead lighting investment money on fire to cover that.
One imagines (though I have not investigated in detail) that there's a degree of national prestige work going on too. The Chinese labs are trying to show that they can build better and more efficient models and are releasing open to undercut the US labs.
This is a good insight. I think everyone has seen that chart China's electricity generation going parabolic vs the US. That combined with cheaper yet equally good talent means at least in that segment, the closed labs won't catch up anytime soon
Even if we all switch to Chinese models, the west isn't going to be running the model on Chinese servers... and the majority of costs are from inference.
> cheaper yet equally good talent
China has tech talent, but this isn't a 3rd world developing nation. Chinese AI researchers are getting paid $10M+ USD/year salaries.
Also they're equally good, but somehow consistently behind?
Which closed labs won’t catch up to whom?
Not to say that frontier labs won't make progress, but the bar for a sufficiently capable agent is all the OSS models need to meet to make this happen. I imagine a lot of hybrid setups where something like Opus is used only for planning/architecture, and anecdotally, the real token consuming part is implementation not architecture.
Nuclear power anyone?
The reason for regulation is that failure is not an option --- unless you're willing to accept the cost of making a big chunk of a state uninhabitable for a very long time.
How much would failure cost? The Chernobyl exclusion zone is over 1000 square miles --- about half the size of Delaware. And it is expected to remain uninhabitable for the next 20,000 years.
Also, the Russian Academy of Sciences estimates that up to 1 million people may suffer premature death as a result of radiation exposure and contamination from the event.
In the long run, renewable energy is a lot cheaper.
Currently the projects I am involved require devs to use approaches like Ollama, Foundry Local and co if they happen to have good enough hardware, picking the best alternatives out of https://www.canirun.ai.
I feel it'll wind up like the dotcom/fiber bubble. Way too much money poured into it, lots of expensive bankruptcies or write-offs, and a readjusted market sea level.
Actually, platforms that serve many customers can bring down the costs tremendously through caching, and don’t need the AI credits as much: https://safebots.ai/costs.html
Training these neural networks every few months isn’t energy-heavy?
Both Bitcoin and these large models weren’t “designed to be energy-heavy”. It was a consequence of first-gen design decisions to solve a specific problem. Then as time went on, costs went down and they became a huge outlier in terms of energy. The question is whether the bagholders (the AI companies that invested untild amounts into the initial training) will fight to keep people using their tech and fearmonger about everything else.
Neural nets on the other hand generally show more capability as you add more compute power. There's a point where it's less valuable than the cost increase, so people don't do more than that, but it isn't constant value like Bitcoin.
Same with AI. Now that the Mythos and other models are finding exploits in every code base and anyone can run them, you can’t afford anymore not to keep burning credits securing your code base. It’s like proof of work red queen theory. You have to run faster and faster just to stay in place. Great business model.
I tried this. My role as a human boiled down to recognizing when I need to switch to frontier model for the last mile.
1. I remain unconvinced LocalAI can work well for majority of businesses. It looks vaguely comparable on benchmarks, but it tends to be fragile and a lot of management overhead in reality.
2. Similarly, while Deepseek is comparable to Opus/Codex on benchmarks, for agentic work at scale I definitely notice the difference. That's not to say it's not economical, just that I definitely miss the big boys when I swap.
I kind of wish this was true, because the UK would be in a great place to compete with the US. But somehow people are happy to pay 3x the salary for an engineer in SF.
I'm working on an self-hostable LLM (web) UI[0] that aims to provide a comparable good UX to e.g. ChatGPT, and you are right that there is a decent amount of fragility involved, and more management overhead than most people would expect.
However, we usually find that those details happen a lot more in e.g. the harness (= out application), or some prompt tuning that's required for each of the models, rather than model quality itself. We have seen customers using self-hosted LLMs with similar user satisfaction across their organization to other customers that heavily lean on latest GPT-5 models on Azure. Especially given that you have to do some level of tuning and setup anyways, you might as well invest it in "local"/self-hosted AI (if you can make the financials of the inference cost work out for you).
I think it should also be noted that the inference providers on hyperscalers also tend to be quite fragile, each in their own way (e.g. Google with a horrible rate limit system or Azure with almost weekly intermittent 500-error incidents).
Also worth noting that it doesn't have to be full either-or, there can be a two tier enterprise deployment that routes to locally hosted vs frontier model, over time more and more usecases could get routed to local LLM
Dunno how trustworthy this source is, but it says ~35 MWh/person in China and 77 MWh/person in USA.
The second issue is that the quality of the model “operator” makes a massive difference in the outcomes. Highly skilled senior devs who know how to prompt and have high agency will outperform team people that lack motivation and foundational skills.
Lastly, there is a massive difference in capabilities, determinism, and error handling between 5T SOTA models like Opus and tiny distillations from DeepSeek that perform well only in benchmarks.