LLMs are designed for text completion and the chat interface is basically a fine-tuning hack to make prompting a natural form of text completion to have a more "intuitive" interface for the average user (I don't even want to think about how many AI "enthusiasts" don't really understand this).
But with open/local models in particular: each instruct/chat interface is slightly different. There are tools that help mitigate this, but the more you're working closely to the model the more likely you are to make a stupid mistake because you didn't understand some detail about how the instruct interface was fine tuned.
Once you accept that LLMs are "auto-complete on steroids" you can get much better results by programming the way they were naturally designed to work. It also helps a lot with prompt engineering because you can more easily understand what the models natural tendency is and work with that to generally get better results.
It's funny because a good chunk of my comments on HN these days are combating AI hype, but man are LLMs really fascinating to work with if you approach them with a bit more clear headed of a perspective.
This was the very recent past! Up until we got LLM-crazy in 2021, this was the primary thing that deep learning papers produced: New models meant to solve very specific tasks.
It is one of the weirdest variations of people buying into too much hype.
They cannot sell themselves without concealing reality. This is not a new thing. There were a lot of suppressed projects in Blockchain industry where everyone denied the existence of certain projects and most people never heard about them and talk as if the best coin in existence can do a measly 4 transactions per second as if it's state of the art... Solutions like "Lightning network" don't actually work but they are pitched as revolutionary... I bet there are more people shilling Bitcoin's Lightning network than they are people actually using it. This is the power of centralized financial incentives. Everyone ends up operating on top of shared deception "the official truth" which may not be true at all.
I really think using small models for a lot of smell tasks is the best way forward but it's not easy to orchestrate.
A Pi has 4 cores and 16GB of memory these days, so, running Qwen3 4B on a pi is pretty comfortable: https://leebutterman.com/2025/11/01/prompt-optimization-on-a...
If you need raw compute, I totally get it. Things like compiling the Linux kernel or training local models require a high level of thermal headroom, and the chassis has to dissipate heat in a manner that prevents throttling. In cases where you want the machine to act like a portable workstation, it makes sense that the form factor would need to be a little juiced up.
That said, computing is a whole lot more than just heavy development work. There are some domains that have a tightly-scoped set of inputs and require the user to interact in a very simple way. Something like responding to an email is a good example — typing "LGTM" requires a very small screen area, and it requires no physical keyboard or active cooling. checking the weather is similar: you don’t need 16 inches of screen real estate to go from wondering if it’s raining to seeing a cloud icon.
I say all this because portability is expensive. Not only is it expensive in terms of back pain — maintaining the ecosystem required to run these machines gets pretty complicated. You either end up shelling out money for specialized backpacks or fighting for outlet space at a coffee shop just to keep the thing running. In either case, you’re paying big money (and calorie) costs every time a user types remind me to eat a sandwich.
I think the future will be full of much smaller devices. Some hardware to build these already exists, and you can even fit them in your pocket. This mode of deployment is inspiring to me, and I’m optimistic about a future where 6.1 inches is all you need.
Nobody actually wants more weights in their LLMs, right? They want the things to be “smarter” in some sense.
2023: My model is too big