DeepSeek makes the V4 Pro price discount permanent - https://news.ycombinator.com/item?id=48237663 - May 2026 (384 comments)
They explain some of the the reasons why they have a better solution and why they are very opinionated
>Automatic prefix caching activates only when the exact byte prefix of the previous request matches. Most agent loops reorder, rewrite, or inject fresh timestamps each turn — cache hit rate in practice: <20%.
So they optimize on this plus other techniques to improve cache hits, making it cheaper.
They do this to mitigate jailbreak attempts that rely on fabricated message history (e.g. making it look like the model was compliant in previous messages, increasing the likelihood that it'll continue to be compliant in future messages).
That's really surprising, since it'd defeat the whole point of KV caching. I mean I buy it considering how sloppily coded the harnesses seem to be, but this like obvious low hanging fruit.
I've also often wondered why LLMs aren't trained with a format of having a dedicated contextual system-instruction role at the _end_, which you could use to put context like current time or other misc stuff.
There are context pruning strategies that will prune old messages that are no longer relevant, and context compaction from summaries, etc. But to say "most" do this on "every turn" is overstating things. I think it's more correct to say that "many" do this "occasionally."
I'm also not sure what they mean about injecting fresh timestamps. I could see why you'd prepend/append a timestamp to the user's messages to make the model aware of the current time, and the passage of time, but I can't think of any good reason to edit timestamps in prior messages. I'm sure someone can come up with one, but I'd be very surprised if this was a thing that most agent loops do, let along doing it on every turn.
I haven't seen that, it'd be crazy slow if they did this. What "agent loops" are they talking about here specifically? The vagueness makes it sound potentially made up.
> tool call pruning breaks cache and people will tell you this is horrible and expensive
> except i looked at some anthropic data and real user behavior ends up with better cache hits and 30% less spend
> even this is needs to be analyzed further, it's just not simple
> for openai data it's inverted! cache hit ratio is actually better [sic: I think he meant worse based on the screenshot] with tool call pruning turned on
> but the net $ saved is only 5%
> kimi is a funny one - it has better cache hits with pruning on...but is also more expensive!
There was also another thread recently where he discussed that pruning improves user experience (models are smarter with less context) but I can't find it.
This can also be disabled in the config: https://opencode.ai/docs/config/#compaction
> our implementation is it only prunes calls from > 3 user messages ago, if context is > 40K, and only if there's at least 20K tokens to be removed
Seems reasonable to me and explains why I can have long sessions (way longer than with zed agents) while still hitting cache. Opencode is just missing per-provider TTL.
Ah, reminds me of good old "There are only 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
You quip, but LLM KV caching (from the harness side) is quite easy: You get a cache hit on stable prompt prefixes, period. That means you want to keep the prefix stable, and only append at the end of the conversation. Made up example: Don't put the git branch name into the system prompt part (that comes first), as whenever the branch name changes, that'd trigger a cache invalidation of the entire prompt.
Getting this right requires some care to not by accident modify the prefix, basically, and some design on communicating the things that can change (user configuration, working dir, git information, ...).
Conceptually the underlying general idea is to sort things based on stability if you can avoid recomputing properties of the stable part.
They're aware of the issues length and they're "looking into a solution".
I want to know if I'm missing something cool!
At the end, cache hit rate is like 99.5% if Novita is not having issues.
For official DeepSeek API, 99.9% or something.
Custom harness that never compacts or otherwise doctors the history.
On the sheer performance it’s comparable to Opus ?
./cost.py amount-2026-5.csv 0.3 3.75 15
input_cache_hit_tokens: 472,971,520 tokens -> $141.8915
input_cache_miss_tokens: 13,299,013 tokens -> $49.8713
output_tokens: 3,334,962 tokens -> $50.0244
cache hit rate: 97.27% (472,971,520/486,270,533)
cache miss rate: 2.73% (13,299,013/486,270,533)
total: $241.7872
All of this usage was with an OpenCode subagent exclusively.Total input token = input + cache read + cache write Cache hit rate = cache read / total input token.
That is 71% in my very limited use of opencode.
The default is just 8GB and a full 128k context for the dense model can take most of that. So then comes an agent and causes eviction and subsequent cache miss.
Bumped the cache size (--cram IIRC) up to 48GB and had much better results.
I switched to vLLM and those went away. Need to look at my opencode config and adjust some others based on things I see here
Can you share the bridge. DeepSeek v4 is awesome paired with claude-code or opencode. I found that claude code costs me less than opencode and I am presuming this is due to a better engineered harness.
I only used it for a few hours to play around with stuff before the quota issue was fixed and I could resume using GPT models, and the bridge was coded by DeepSeek-V4-Flash-IQ2XXS + DwarfStar4 locally, I take no responsibility for what might happen with your computer or you, during usage or just reading the code.
Edit: heh, like don't look at line 117 for example where seemingly it likes to handle misspellings in the .env file which totally wasn't my fault for typo'ing the API key in that file... I'm sure there are tons of sharp edges and dumb stuff in there.
DeepSeek is good, Claude is better, at least IMHO. Deepseek is a lot cheaper though :)
Obviously, if you do deal with any sort of secrets, then using local LLMs over OpenAI, Anthropic, DeepSeek or whoever is obviously preferred, and in the case of personal data of users, probably a requirement.
Getting the source code of facebook or instagram doesn't mean you could compete with them.
I work for a company that has built relationship with event organizers over the past 10 years. The code I maintain could be written from scratch in maybe 2-3 months even though it was built over the past 10 years but besides that you have frontend / DB / hardware / logistics etc
Still, “Getting the source code of facebook or instagram doesn't mean you could compete with them.” I think to giants like that, having access to their source code could open up some very interesting loop holes for manipulating the ranking algorithms, or even security vulnerabilities.
All this to say, not even subject matter experts necessarily appreciate the risk involved in their work
Honestly I'd love to love the US again, but basically after Obama things have just gone down and down and no soul will trust the US again in the next generation or two.
The former relates to a specific investigation about potential criminal activity, the latter relates to broad illegal activity committed by the government itself unrelated to any specific case.
The US has no laws on the books forcing companies to wantonly give intellectual property and other espionage level material back to the government. If they did, no one would use cloud providers.
To avoid this, you can run your own hosted machine in a colocation facility, because in the US, people do have reduced rights when their data is controlled by a third party versus being controlled by themselves. Its the same as if the data was in your house, they would need a search warrant to obtain it, but when its at a Azure or AWS datacenter not controlled by you, your privacy rights are reduced by doing this.
I think many are trying to move away from US providers actually. FISA section 702 and the current administrations liberties taken towards international law are not helping. The trust problem is real.
Not sure I’d trust China with anything onshore. But offshore, it does seem they play by the rules, because it pragmatically serves the stability of the people. China has not started wars in the past 50 years or so. By that logic one may assume they’d not abuse the arguably broad powers over Chinese firms abroad to risk one now.
In a world where rules are increasingly less important how states use power matters more to me than how they claim to be monitored.
EU has literal directive about location of data which has to be located in the EU and not in the USA, because the data are in danger otherwise.
Correct. They come up on Twitter daily. Pardon, this other truth bullshit.
I don't care about the US more than about Russia or China these days.
They are definitely not our allies anymore.
If you're concerned about espionage then the only solution is host the models yourself, which again, only open-weight models like Deepseek enable you to do this.
Same with codex? codex-rs at least, is a TUI as well, it does run a "app-server" in the background, that the TUI actually interacts with, but that's just an implementation detail. Also makes it easy to hook in your own programs to fire of codex "headless" sessions even without the TUI.
In the end I had Claude produce a one-page html file that was 95% of the way there and it took minor editing to clearly explain the intent of the feature.
Now, that is overly critical, I’m sure their heart is in the right place. But a simpler website would do :)
The article is about an open source agent harness, Reasonix, that is built to leverage the DeepSeek native api.
There’s no company here. No design budget. These people are graciously sharing a project they made in their free time.
(The series of ‘motherfucking websites’ comes to mind, they were all very readable and simple, even if satire.)
That doesn't say much about any model though. For starters, any software engineer can tell you that leaving out features can drastically simplify any project.
If you think that dsv4 behaves differently enough from the aggregate of other models, submit a PR with a patch to special case that to your harness of choice with evidence. Just blindly assuming "append only all the time because cache" is a waste of everyone's time.
Your agent harness, brokk, looks great. I’m going to try it this morning.
The company that had that acrimonious split from OpenCode. Still, fully written in Go and compared to node-based harnesses, uses 1/5th the RAM. (At least for me.)
Works with any provider (including OpenRouter free ones).
No conflict of interest here, just a happy "customer" of this excellent resource.
The value and ease of development that slow interpreted languages used to offer is disappearing. New languages have all the nice things built in, or rather, our 1am pager alarms are starting to make us mad.
There's Google's genkit, charmbracelet's fantasy and LangChainGo. Each has ugly hacks and omissions. Then handling slice streaming of data into Elm architecture (bubbletea) is also complex.
So in theory nothing stand against but in practice one has to get quite low to the ground to get anything done.
Also: Golang agent exist! It's called crush and is developed by charmbracelet people. It's so-so though I prefer Pi myself.
I'm concerned since i really want SOTA reasoning, but DeepSeek still has me interested.
I think you should give other models a try and see how much they differ from SOTA models. I did this and realized, even Qwen-2.5-Max was enough. I am sure even Claude Sonnet 3.5 is enough for things I play around with. I am not really striving for fields medal in Mathematics.
The "cost" is dumb models is just so high for me. Eg every bad decision they make increases my frustration quite a bit. Despite putting a lot of effort into my workflow to help reduce the number of decisions they make, they always will. So my hedge is always against that.. trying to reduce how insane they can be heh.
After about 6 hours, both ultimately failed to fully RE, however, there were some drastic differences:
DS stopped every 30 minutes or so, saying it did full RE and it should all work now, while in fact, it didn't complete even 1% of it. It also looked for shortcuts again and again, despite me prompting heavily that the specific shortcut may not be used. It was a complete and utter failure.
GPT-5.5, on the other hand, blew me away. It just did the right things, didn't jump to next steps until it was sure it completed the initial layers and had a full understanding of what's required. The only time I prompted it during the 6 hours was when I saw it going in the right direction and I could nudge it slightly towards an even better way. I never felt I was fighting it. Okay, maybe a little bit - after compaction, it sometimes would go on a "no I'm not helping you with reverse engineering" tangent, but it would resolve in a clean session.
I cancelled my Claude subscription a month ago, so I haven't tested that, but DeepSeek has reminded me a lot of how I worked with Opus 4.6/4.7. Which perhaps could be a positive sign to some, but GPT-5.5 showed me that the way claude/ds work is just way too annoying.
I suspect for people doing more... website ... type development, the more "yeet this into existence" style of Opus feels preferable.
With Claude I was constantly jamming my finger on the escape key "wait, you did what?! based on what proof?!"
This is my experience with non-SOTA models across the board. When you try them on little tasks and they work it feels amazing, but then you go deeper and you're back to going in loops and fighting the model for hours.
Switching back to a SOTA model immediately yields progress again.
When I read all of the comments from people saying they can't tell a difference between Opus and <insert open weight model here> I don't know if they haven't really used it much yet, or if they're just not doing anything complicated.
[^1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...
I used to surf the three big players frequently and got really tired of the effort needed to steer some models. In the end i ended up sticking with Claude because it required less steering effort. While not strictly reasoning, a models ability to follow clear directions consistently is something i'd consider part of its SOTA capabilities.
Eventually i just tired of exploring. I just want stability.
Which ironically is why i'm thinking about moving from Claude. The very basic IDE/-p usage getting removed from my plan is a UX stability issue. I'm trying to progressively improve my workflows and efficiency, not have to establish a new foundation anytime something shifts. Quite frustrating.
There's always the option of using Anthropic's models for some tasks like planning and then just hand over the implementation task to something like DeepSeek. Across different tools, a Markdown plan works pretty okay. That's what I'm planning to do if I go from the 5x Max subscription down to the Pro.
I am also writing a launcher that makes using 3rd party providers with Claude Code easy (https://ccode.kronis.dev) and I already have a local proxy up and running, just not dynamic model switching yet. Though it shouldn't be too hard to add, will probably be there within a week or two, depending on my schedule.
I don't think it's wise to leave Anthropic altogether because their models are great (and a subscription gives you features like Remote Control which I like), but switching tiers and maybe saving a bit of money seems viable! On the other hand, you do need a quality baseline, because I remember using Cerebras with GLM 4.6 way back and there was a bit too much slop.
I’ve gone that route. I really wanted to stop using Claude, but Deepseek v4 Pro and Kimi 2.6 didn’t do the job. For a lot of coding tasks or well-specced plans, maybe… but then that’s a plan made by Opus anyway.
Even Sonnet is sometimes not worth the trouble. Opus is very thorough and reviews its own mistakes quite well. Catches a lot of edge cases.
I’m not saying we shouldn’t try other things — I did! —, but it’s more or less okay that people just like Claude Code subscriptions? The back and forth I had with Kimi on a small feature came out to ~1.8€, which is 10% of my Claude subscription each month. And that was a single session. CC with Serena uses tokens fairly well.
If you think short-term and only about yourself, paying for SOTA regardless of how many military contracts the lab has is the best thing, but paying for open models is both better ethically, and for a future where AI belongs to everyone and not just to Altman et al.
I have been using it for a while, and I wholeheartedly agree. imo, it is as good as codex or claude which I also use. It is a winner in the cost-sensitive tier, and if some startup could put it together with data-retention in mind, it could be a great product sold to the enterprise, as data-retention and privacy are the main issues for the coding-assistant usecase.
So I use Deepseek Pro on the $20 Ollama Cloud plan and it’s really not that far behind and I never triggered the plan’s limits.
It’s like 10-15% less powerful but costs 10 times less.
Totally worth it. I prefer Opus because my employer pays for it but I would personally never pay 10 times more for it.
I have got unlimited Claude Opus at work as well.
I was really having a hard time deciding between the Ollama and OpenCode plans for personal use, I couldn't really understand how much usage I would get with the Ollama plan, so in the end I went with OpenCode and I have never hit the limits despite using it most evenings and weekends for several hours.
Maybe I should try DS4p?
People are out there using frontier intelligence to make responsive headers and weekly work reports. Absolutely don’t need the latest and greatest models for this stuff
It is my default model at the moment. I'm not doing anything too complex though. I honestly found more expensive models like Qwen 3.6 to fail in tasks Deepseek nails.
I'm interested in knowing what people are using for tasks which require a bit more thinking. Kimi 2.6? Qwen 3.7? GLM 5.1?
The things that I use Opus for at work is finding bugs in about ~200k lines of microservices and libraries in a niche language. So, we will get these bug reports that are missing context, can't easily be reproduced on our dev server, and are usually the result of something deep in multiple services/libraries combining with very custom configs. I can ask Opus (max thinking) to find what could cause the bug, and it usually nails it in a few hours (would take me 1-2 weeks to trace it myself). The end result will be like less than 10 lines of code to fix it, some tests to reproduce the bug and a nice report explaining it, so it can be checked in an hour or two.
> Independent open-source project · not affiliated with DeepSeek
Yes, but a lot of harnesses change previous context. E.g. the system prompt injects the current time/date, working directory, files in the working directory, etc. Compaction also changes the whole previous context. I _think_ changing the list of tools also invalidates cache, so invoking a subagent with different tools would invalidate the cache.
My vague impression is that it's in a similar vein to functional programming languages. It generally disallows doing things that lead to bugs (cache misses in this case), and presumably allows you to do those things in a way that makes it much clearer that this is likely to cause cache misses. I would guess that in this paradigm, you don't mutate your existing session, you derive a new session by mutating the prior context into a new context.
This was actually surprising to me when I learned about it as I have never worked with (or built) any cache working like that before.
Here is an example of Reasonix resuming a feature in a project managed by Codex thus far.
> ● reasoning · 107 tok · 2 ¶ · 3.2s v4-flash ↳ The search engine is still completely broken - returning Chinese dictionary entries, Kick streamer profiles,… c. It's not just broken for ** content, it's broken for everything. The error from earlier said to switch … ⋯ 2 lines elided I need to be honest with the user that `web_search` is fundamentally non-functional regardless of what query I try. Let me clearly communicate this and suggest alternatives.
‹ reply 198 tok · 92 t/s v4-flash I'm sorry — I've now tried multiple queries across multiple attempts this session, and the web_search tool is fundamentally broken regardless of query phrasing. Every single query returns either Chinese dictionary entries, Kick streamers, YouTube homepages, or Swiss restaurants. Nothing even remotely related to the search terms.
The error message from the tool itself says:
"switch engine with /search-engine bing|searxng|metaso|tavily|perplexity|exa"
This is a user-side setting — I can't change it with any tool I have. Could you run this in chat?/search-engine bing
That would switch from the current broken Chinese-language engine to Bing (or one of the others like perplexity ). Once that's done, I'll batch-search all 322 centers efficiently — the web_search tool is the fastest way to do this since I can search multiple names in parallel.
From the FAQ, I see:
>Can I point it at a self-hosted / private DeepSeek endpoint?
>Yes. Since 0.30 we accept non-standard key prefixes for self-hosted DeepSeek endpoints. Just point `baseUrl` at your internal address — the loop, cache strategy, and tool protocol are unchanged.
But my question is: If I use Reasonix to talk to a deepseek endpoint through openrouter, am I still getting the cache-hit benifits of this agent harness?
Maybe users reporting otherwise are just looking at their client reports which wouldn't be able to tell the difference.
Is this really the behavior you want? Yes, doing tool-result clearing and such will blow your cache, but if you do it only occasionally, it's still likely a win. Yes, cache hits are good, but not so good that it's okay to be profligate with context to preserve those precious, precious KVs.
Pretty shaky datapoint though...don't use it as primary model
Overall I find their API design and docs so messy. It's a shame, since it's the main entrypoint to using their service.
AI marketing slop. This is how all models and coding harnesses work, isn't it?
The author claims (in another AI-written post):
> LangChain — along with every generic agent framework I checked — rebuilds the prompt every turn. Timestamps get injected. History gets reordered. Tool schemas re-serialize with different whitespace.
I haven't touched LangChain in a long, long time, but don't think any of the current harnesses, Claude Code, Pi, Crush, OpenCode etc do that except if you change configuration? Keeping the context stable for caching is a very basic principle and not a wild innovation.
This posing as DeepSeek-specific is also a mystery.
It's bad enough that I'm working on guardrails at the harness level because prompting appears to be useless.
Do you have the same issue?
Now that you mention it, though, I have seen it do a few things that weren't in the plan. The reviewer caught them, though, so they didn't cause a problem, and it's so cheap that overall it's a massive improvement.
(that is, different places on the Pareto efficiency graph)
trying reasonix with direct api..
This is still art as much as science and the different harnesses take different approaches.
Extremely pro consumer tool. I have been hammering it hard with 97% cache utilization and barely $0.03 dollar spent for me constantly exploring a codebase.
Have you tried using Deepseek API via other agents? This project tbh looks like a S-tier slop
> Tool arguments the model produces occasionally have JSON typos, unclosed quotes, or shape mismatches. Reasonix runs a schema-aware repair pass before dispatch so malformed args still execute.
So Deepseek API doesn't have a structured output option where you give a grammar and the model promises the output will follow this grammar?
Or it does, but it's buggy?
Is this improving the cache hit and hence overall efficiency of coding workflows?
Does it also let me host a local llm (deepseek)? What are model min requirements for this?
my fork of oh my pi that i have a lot of experiments in, is lterally designed to only work well with models that have decent reasoning levels, like deep seek models. check it out!
https://github.com/cartazio/oh-punkin-pi/blob/main/scripts/b... — thats the install script for after clone
fair warning: tis my dog food test bed as i build even fancier stuff
Any comments on what you can or cannot rely on it for relative to cc and codex would be appreciated too!
I haven't had a need for any extensions though. Maybe subagents, but I solved that with tmux. For all the rest, I just use "skills".
I specifically use multiple different models and providers, so this wouldn't be useful for me.
And it contributes to the problem of each person vibe-coding their own, incompatible, half-baked tool in a space, instead of contributing to a small set of tools and expanding them.
It'd be better to just extend an existing tool.
Will give a go and see how cache behaves
That's the pinnacle of AI slop over engineered garbage in my opinion. All of that information is noise.
Any feedback on how to make it less "shitty"? I feel like doing some vibe coding tonight.
These sites have the immediate scent of 'high design', with errors that no 'high designer' would dare make.
The italics give me nausea. Text promoted with orange fill is seemingly random. There is no thought behind the combination of art and copy. Random smattering of Title Case and Sentence case and lower case. A lack of commitment to a full stop Widowed H1s. H1s with random spaces .
At the same time, if I hammer CMD - to 25%, it looks fancy. Perhaps nobody gives a fuck.
That said, I'm excited to try this tool!
"Independent open-source project · not affiliated with DeepSeek" "Reasonix only targets DeepSeek because..." "Why DeepSeek only? Can I swap to Claude / GPT? It's a design choice, not a limitation"
The lady doth protest too much, methinks?
Nicely timed shortly after the making the rebate permanent anouncement.
Could just be Chinese devs trying to help western devs with some software and a western facing marketing campaign to raise awareness. Could be DeepSeek astroturfing. Could be "someone" in China trying to get more access to western data.
Who knows?
It's the agentic era, pick a better option
Just stop
Besides being even better at the caching, I'm not sure what benefits you'd get compared to just firing up OpenCode with the DeepSeek API yourself, it'll similarly do caching for sure and also "talks directly to api.deepseek.com" if that matters, and you'll get a much more mature harness.