If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure.
But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping.
So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.
edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)
All of this burn-in stuff is designed to get your Markov chain to forget where it started.
But I don’t want to get from “how many apples does Bob have?” to a state where Bob and the apples are forgotten. I want to remember that state, and I probably want to stay close to it — not far away in the “typical set” of all language.
Are you implicitly conditioning the probability distribution or otherwise somehow cutting the manifold down? Then the analogy would be plausible to me, but I don’t understand what conditioning we’re doing and how the LLM respects that.
Or are you claiming that we want to travel to the “closest” high probability region somehow? So we’re not really doing burn-in but something a little more delicate?
A way to look at it is that you effectively have 2 model "heads" inside the LLM, one which generates, one which biases/steers.
The MCMC is initialised based on your prompt, the generator part samples from the language distribution it has learned, while the sharpening/filtering part biases towards stuff that would be likely to have this MCMC give high rewards in the end. So the model regurgitates all the context that is deemed possibly relevant based on traces from the training data (including "tool use", which then injects additional context) and all those tokens shift the latent state into something that is more and more typical of your query.
Importantly, attention acts as a Selector and has multiple heads, and these specialize, so (simplified) one head can maintain focus on your query and "judge" the latent state, while the rest can follow that Markov chain until some subset of the generated+tool injected tokens give enough signal to the "answer now" gate that the middle flips into "summarizing" mode, which then uses the latent state of all of those tokens to actually generate the answer.
So you very much can think of it as sampling repeatedly from an MCMC using a bias, A learned stoping rule and then having a model creating the best possible combination of the traces, except that all this machinery is encoded in the same model weights that get to reuse features between another, for all the benefits and drawbacks that yields.
There was a paper close when OF became a thing that showed that instead of doing CoT, you could just spend that token budget on K parallel shorter queries (by injecting sth. Like "ok, to summarize" and "actually" to force completion ) and pick the best one/majority vote. Since then RLHF has made longer traces more in distribution (although there's another paper that showed as of early 2025 you were trading reduced variance and peak performance as well as loss of edge cases for higher performance on common cases , although this might be ameliorated by now) but that's about the way it broke down 2024-2025
There is some work on using MCMC to sample from higher-probability regions of an LLM distribution [1], but that's a separate thing. Nobody doubts that an LLM is sampling from its target distribution from the first token it outputs.
And how that applies to LLMs? Since they don't do MCMC.
I don't see the equivalence to MCMC. It's not like we have a complex probability function that we are trying to sample from using a chain.
It's just logistic regression at each step.
Every response from an LLM is essentially the sampling of a Markov chain.
You're not just sampling from them like some MC cases.
> If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
What does "let the model run a bit longer" even mean in this context?
Well, no, it proves that Messi can reason efficiently without an inner speech.
Humans are the only known organism to do System 2 (which doesn't mean we're the only ones that do it, just that we don't know if whales do it), but System 2 is what the author is talking about when they refer to Chains of Thought.
System 1 is what they're referring to when they talk about Messi reacting to an unusual situation on the field.
Related anecdote: I tested myself for ADHD by taking amphetamines. I normally think by intuitive leaps from point to point, without doing the intermediate steps consciously. I found that during this experience my System 2 thinking was fast enough to follow and I actually experienced proper chains of thought. Or I was whizzing my tits off and hallucinated the whole thing. Not sure yet. I should repeat the experiment.
You can't test yourself for ADHD by taking amphetamines because they have profound effects on everyone - there's a reason why stimulants are some of the most popular recreational drugs. They can make everyone feel smarter, more productive, like you're on top of the world, especially at recreational doses.
When it comes to therapeutic doses for treating ADHD, the effects of the medication should be very subtle, you shouldn't "feel it working" like you've just taken that pill from Limitless. It can produce some euphoria or other recreational effects initially but they're supposed to disappear within a couple of days or weeks, otherwise your dosage is considered to be too high.
You cannot test yourself for ADHD - you need to do the actual work that is required for a proper diagnose. And you cannot test yourself for anything at all by taking amphetamines. Seriously. You were using it recreationally. When you post stuff like this online, people that believe what you write could get hurt.
Priming has issues with repeatability in experiments.
To believe that "debunks" the entirety of heuristics and bias is completely absurd. The author literally won a Nobel prize for Prospect Theory.
I think that book sucks personally. Just an incredibly dull book to read. The online social contagion that it is a work of fiction might actually be more interesting.
That's, by the way, something LLMs are very much not good at. They possess a superhuman amount of knowledge covering all areas of academia, including math, science, philosophy, engineering, computer science, social sciences and so on, but that doesn't cause them to come up with novel hypotheses and theories. Something that would be easy for a smart human even with a fraction of the academic knowledge of an LLM.
Pedantic maybe -- but does this need two plurals?
Sort of like the painters who came after the "Realism" era.
If true, this is somehow meant to be taken as a rule/law that language and thoughts are fundamentally different things.
No wonder why he got sacked, lol.
I have spent a lot of time experimenting with Chain of Thought professionally and I have yet to see any evidence to suggest that what's happening with CoT is any more (or less) than this. If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
There's absolutely no "reasoning" going on here, except that some times sampling from the typical set near the region of your answer is going to look very similar to how human reason before coming up with an answer.