Hacker News Clone

bloppeMar 27, 2026, 7:13 AM

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

bartreadMar 27, 2026, 11:00 AM

I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

CraigJPerryMar 27, 2026, 11:14 AM

Do you not end up breaking half the value of referential integrity doing it that way (e.g. you had to update all the queries but now you have a sharp edge in that all future queries need to remember to be soft delete aware. Not a blocker for sure, just a sharp edge).

You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.

bartreadMar 27, 2026, 3:49 PM

Yeah, I did consider moving records to shadow tables, but - because of the nature of our data - it requires moving a lot of child records as well, so it's quite a lot of additional churn in WAL, and the same for restore. And this approach has its own challenges with referential integrity.

More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.

Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.

The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.

As I say, not to diss other approaches: in a different situation I might have chosen one of them.

My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.

For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.

andyferrisMar 27, 2026, 11:36 AM

I move the record to another _index_, generally.

It depends whether you reliably control all the DB client code, of course.

dakolliMar 27, 2026, 3:39 PM

must be something incredibly simple you're making out more complicated than it actually is, I've never seen an LLM do these things well.

bartreadMar 27, 2026, 3:51 PM

This is what gives me the warm fuzzies about the HN community: people jumping to wild conclusions about your domain and systems based on a 4 sentence comment. /s

sigmoid10Mar 27, 2026, 8:47 AM

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.

jakozaurMar 27, 2026, 11:16 AM

Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

BombthecatMar 27, 2026, 9:43 AM

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewaMar 27, 2026, 1:07 PM

Create it!

d0963319287Mar 27, 2026, 2:31 PM

[flagged]

mmaunderMar 27, 2026, 1:33 AM

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

vidarhMar 27, 2026, 11:42 AM

I get decent results with Kimi, but I agree with your overall premise. You do need to realise that while you can save money on a lot of tasks with those models, for the hardest tasks the "sticker price" of cost per million tokens isn't what matters.

It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.

I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.

If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.

miroljubMar 27, 2026, 8:54 AM

> I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence.

I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.

> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.

I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.

> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.

What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.

When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.

Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.

The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.

tim-projectsMar 27, 2026, 9:15 AM

I've only been using free tokens for a year now. Gemini and they just dropped pro so I switched to minimax. Bit of a hurdle switching from Gemini-cli to kilo-cli, but now I can't really see too much difference.

If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.

I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.

When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?

bethekindMar 27, 2026, 2:29 PM

I've used Gemini and now claude. Both were meh until I found the superpowers skill. Will be trying chatgpt next month.

You can "feel" the llm being limited with Gemini, less so with Claude. Hopefully even less so with chatgpt

mongrelionMar 27, 2026, 10:26 AM

What is this 10€ per month subscription that you are talking about?

hariasMar 27, 2026, 10:34 AM

MiniMax token plan

https://platform.minimax.io/docs/guides/pricing-token-plan

throwa356262Mar 27, 2026, 2:06 PM

How is the speed and stability?

These small Chinese companies dont always have access to serious hardware.

moffkalastMar 27, 2026, 8:36 AM

Kimi's been one of my goto options lately and it oftentimes outperforms both Claude and GPT in debugging, finding the actual problem immediately while the other two flail around drunkenly.

It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.

smokelMar 27, 2026, 8:40 AM

And what tooling do you use with that? In my experience, there is quite a bit of difference between using, say, OpenCode, or the commercial offerings.

moffkalastMar 27, 2026, 8:43 AM

No tooling, just manual use. When doing these comparisons I gather and format all the data they need to figure out the problem, and paste the same thing into all models so it's a pretty even eval.

I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.

m00xMar 27, 2026, 6:23 AM

Minimax 2.7 is fine for most web stuff. It's slightly worse than Claude at backend, but works great for frontend.

They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

LeynosMar 27, 2026, 7:51 AM

Kimi is surprisingly good at Rust.

dvtMar 27, 2026, 6:30 AM

> They're all slop when the complexity is higher than a mid-tech intermediate engineer though.

This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.

stuaxoMar 27, 2026, 9:36 AM

10x more code output is 10x more review.

We've gone from doing the first 90% and then the second 90% to the first 90% and the second 990%, its exausting.

victorbjorklundMar 27, 2026, 8:36 AM

yea, they are still useful. But yea not close to Claude or GPT. But works good for simple changes. I use a combo of minimax and codex

mkw2000Mar 27, 2026, 5:23 AM

i find kimi to be very very good, minimax not so much

paulddraperMar 27, 2026, 5:37 AM

Agreed.

They are equivalent of frontier models 8+ months ago.

AbanoubRodolfMar 27, 2026, 3:55 AM

[dead]

selcukaMar 27, 2026, 1:04 AM

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

DanielHallMar 27, 2026, 10:58 AM

These small models, having been fine-tuned for the test, achieve frighteningly high scores, yet perform abysmally in real-world scenarios.

memothonMar 26, 2026, 8:58 PM

I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

tgibaMar 27, 2026, 8:16 AM

Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.

electroglyphMar 27, 2026, 5:18 AM

what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism

alkonautMar 27, 2026, 2:14 PM

Great, it became a $1000 gpu while you were reading that.

b3ingMar 27, 2026, 4:19 AM

Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet

emp17344Mar 27, 2026, 2:46 AM

Yet more evidence that the harness matters more than the model.

bilekasMar 27, 2026, 2:00 PM

Where is a RTX 5060 Ti 16 GB 500$?

Edit : The 8GB seems to hit this price but 16 not so much.

hedgehogMar 27, 2026, 2:31 PM

They were $450 or so until recently, now... good luck.

riidomMar 27, 2026, 12:04 AM

Not a word about the tok/sec, unfortunately.

arjieMar 27, 2026, 1:43 AM

It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.

I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

Octoth0rpeMar 27, 2026, 3:07 AM

> A single patched llama-server runs on K3s, providing both generation with speculative decoding (~100 tok/s)

There seems to be at least some detail on that point.

dwa3592Mar 27, 2026, 2:29 PM

I wonder if it's working out for the benchmark problems only?

one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.

bdbdbdbMar 27, 2026, 8:22 AM

This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house

AurornisMar 27, 2026, 3:43 PM

This AI-written project is running its own LiveCodeBench on a completely different methodology. The AI-written notes even admit it:

> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.

Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.

The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.

0xbadcafebeeMar 27, 2026, 3:39 AM

This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.

Temporary_31337Mar 27, 2026, 8:54 AM

the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor

15minutemailMar 27, 2026, 7:25 AM

74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.

subroutineMar 27, 2026, 7:55 AM

At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.

josefritzishereMar 27, 2026, 1:25 PM

The core problem of AI remains unresolved, with no conceivable path to solvency. The issue is that AI isn't very good. It's OK, sometimes under very narrow criteria. But providing AI in reality very costly. Vague promises of it magically becoming better remain, very optimistic at best and still provide no route to solvency.

negativegateMar 26, 2026, 11:37 PM

Am I still SOL on AMD (9070 XT) when it comes to this stuff?

0xbadcafebeeMar 27, 2026, 3:48 AM

No? You can run any model that fits in its VRAM, and you can run larger models with layer/MoE offloading. Ask an AI what the best models you can run on that card are, then ask it for newer models than that. Ask what tuning options to pass to llama.cpp, and what the auto-tuning options are. Use ROCm builds.

It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.

metalliqazMar 27, 2026, 6:24 PM

I've been running local models on my 9070XT and I have never found ROCm to be faster than Vulkan

patsheadMar 27, 2026, 1:38 AM

No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!

hrmtst93837Mar 27, 2026, 6:10 PM

You can cram absurd context into a card now, but none of that matter once you hit the VRAM wall and the whole thing slows to a crawl. Cloud is cheaper. Local still matters for privacy and weird adapter stuff, but 'usable for work' is a much higher bar than 'looks decent on benchmarks' when the task is chewing through a repo without latency going to hell.

dangusMar 26, 2026, 11:45 PM

Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

dannywMar 27, 2026, 1:50 AM

Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

sznioMar 27, 2026, 8:45 AM

On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.

mongrelionMar 27, 2026, 11:14 AM

Not the answer that you are looking for, but I am a fellow AMD GPU owner, so I want to share my experience.

I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.

I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.

From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.

The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.

Which model have you tried locally? Also, out of curiosity, what is your host configuration?

[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5

kroatonMar 27, 2026, 1:26 PM

For autocomplete, Qwen 3.5 9B should be enough even at Q4_k_m. The upcoming coding/math Omnicoder-2 finetune might be useful (should be released in a few days).

Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).

mongrelionMar 27, 2026, 5:56 PM

I am definitely looking forward to TurboQuant. Makes me feel like my current setup is an investment that could pay over time. Imagine being able to run models like MiniMax M2.5 locally at Q4 levels. That would be swell.

superkuhMar 27, 2026, 1:04 AM

If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.

limoceMar 27, 2026, 1:47 AM

The title should be "Adaptive Test-time Learning and Autonomous Specialization".

paxrel_aiMar 27, 2026, 2:01 PM

[dead]

eddie-wangMar 27, 2026, 3:40 AM

[dead]

itigges22Mar 27, 2026, 3:54 AM

[dead]

mergeshieldMar 27, 2026, 11:28 AM

[dead]

LuisvelAIMar 27, 2026, 10:12 AM

[flagged]

wiradikusumaMar 27, 2026, 3:18 AM

[dead]

felixagentaiMar 27, 2026, 2:14 AM

[flagged]

dangMar 27, 2026, 3:19 AM

We've banned this account. Please don't post automated comments to HN.

https://news.ycombinator.com/newsguidelines.html#generated

sayYayToLifeMar 27, 2026, 1:22 AM

[dead]

ozgurozkanMar 27, 2026, 1:31 AM

[dead]

bustahMar 27, 2026, 2:58 AM

[dead]

RazenganMar 27, 2026, 8:02 AM

Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.

$500 GPU outperforms Claude Sonnet on coding benchmarks

Comments