LLMs and Markov Chains


Pattern matching is a dangerous business, but this is now the second third time I’ve seen LLMs compared to Markov chains in the span of a few weeks.

I think people who want to characterize that as merely the output of a big semantic forest being used to generate markov chain-style output. It’s not that simple. Or, perhaps flip the problem on its head: if what this thing is doing is rolling dice and doing a random tree walk through a huge database of billions of word-sequences, we need to start talking about what humans do that’s substantially different or better. …

One thought I had one night, which stopped me dead in my tracks, for a while: if humans are so freakin’ predictable that you can put a measly couple billion nodes in a markov chain (<- that is not what is happening here) and predict what I’m going to say next, I don’t think I should play poker against the AI, either.

This seems to be an idea that’s floating out there, and while Ranum is not saying the two are equivalent it’s now in the scientific record. Meanwhile, I’ve been using Markov chains for, oh, at least seven years, so I can claim to have some knowledge of them. Alas, I didn’t really define what a Markov chain was back then (and I capitalized “Chain”). Let’s fix half of that.

As usual, we’ve got to define our terms. The old post was titled “Why MCMC Works,” and the third link in that post is to another article that defines “Markov chain Monte Carlo.” Ranum, though, invokes “Markov chains,” as does that preprint I linked to. When lay people say “Markov chains” they usually mean “Markov chain Monte Carlo,” but you’ve probably guessed by now that they’re not quite the same. I bet you’ve also sussed out that “Markov chain Monte Carlo” is a type of “Markov chain,” but the reverse isn’t true. If so, congrats!

Markov Chains and MCMC

We’ll work backwards, from some of the original source code I linked to in that post.

const burnin = 10                      // throw away this many iterations
func SimpleMCMC(x float64) float64 {   // input: a real number between 0 and 1.
                                       // output: a random real number between 0 and 1.
	retVal := x         // set the current state to the input value
	retPDF := PDF(x)    // calculate how likely the current state is

	for i := 0; i < burnin; i++ { // run MCMC "burnin" times candidate := random.Float64() // pick a random potential future state candPDF := PDF(candidate) // calculate how likely that state is // if the new state is more likely than the old, or // a random value between 0 and 1 is less than the likelihood ratio // of the proposed future state to the current state if (candPDF >= retPDF) || (random.Float64() < (candPDF / retPDF)) {

			retVal = candidate  // transition to the future state
			retPDF = candPDF    // save the likelihood so we don't have to recalculate it
		}
                // otherwise, remain in the current state
	}

	return retVal   // save the current state
}

This is unorthodox MCMC, as typically you return a chain of states rather than one, but it’ll do. Notice there isn’t a lot of “memory” here, basically just a single real number that’s greater than zero but less than one. This is a hallmark of both Markov chains and MCMC: where you go next only depends on where you are right now. Where we were two, twenty, or two hundred billion steps ago has no influence on where we go next. By extension, where you were in the previous step only depends on where you are now; telling me where you went two, twenty, and so on steps in the future doesn’t give me any information about where we go from the current state.

That might seem odd. Shouldn’t we be able to place limits on where we could have come from, based on where we ended up? We know which transitions are possible, and which are not, after all. But flip that thinking around: shouldn’t we also be able to place limits on where we could end up in future, based on where we are? Both assume a bird’s eye view of the entire system, implicitly adding in prior history. For a proper Markov chain, none of that matters at the mechanistic level.

Another consequence is that the choice of next state is absolutely random, but not necessarily equiprobable. A Markov chain can be biased towards picking one state over another, to the point that there’s a 100% chance it moves to a particular state. Nonetheless, if one state’s transitions always have a 70/30% split between two potential future states, every time we find ourselves in that state again it must be the case that about 70 out of 100 transitions are to that first potential future state. If that weren’t the case, then our path through other states does influence our next choice. Thus no matter how closely you study a Markov chain, you’ll never uncover the actual transition that will or will not occur at any specific time, unless the transition probability was 0% or 100%.

It’s around here that MCMC breaks away. The potential number of states can be infinite, in theory, if a state is defined as a point in a continuous parameter space. Traditional Markov chains also tend to fix the transition probabilities, so you’re given the same transition probabilities should you ever return to a state. MCMC’s probabilities are instead recalculated at each transition point, based on the current and potential future state. As I’ve pointed out before, the rule to always transition if the potential future state is more likely draws the chain towards the most likely one. Working against that, the rule that flipped a random number allowed the current state to wander away to something less likely, with a success rate proportional to the relative likelihood. Thus MCMC is a “crummy random number generator,” as I like to put it, for an arbitrary probability distribution. Markov chains make no such promise.

Generative Pre-trained Transformers

To be honest, half the reason I started this blog post was because of this one line:

But GPT has studied all of humanity’s moves and knows which moves were Bonaparte’s and which were Marshal Wurmser’s.

Ranum isn’t the only person to do this, lay people see no problem shortening “ChatGPT” to “GPT.” But if you know a bit about how LLMs work, you know the former is a brand name while the latter is the technology behind large language models as we know them today.

Again, let’s rewind a bit. The neural networks of yore were one-and-done affairs: put in an input, get an output back. Plus, artificial neurons were organized into layers that could only communicate with the next layer along. Natural neurons have no such restrictions, they deal with continuous streams of data coming in and happily form circular networks.

The first attempt to deal with this came via Recurrent Neural Networks. Artificial neurons were given the ability to communicate to themselves, but on a delay. This made it easy to deal with sequential data, like sound, video, or text. Unfortunately, training these networks was a massive pain with a low success rate. Ways to fix this were developed, but another approach wound up being more successful.

Transformers are based on the idea that not all inputs are created equal. For instance, I often use the phrase “for instance” to signal that I’m shifting from talking about generalities to a specific instance. This creates a smoother transition and flags that I’m following the same train of thought, but is it really necessary? Most people would have correctly figured out what I was doing without those two words. Conversely, drop all instances of “Markov chains” from this post and it becomes quite a bit tougher to follow me. We can take advantage of that while training a neural net, by weighting the input data differently to focus the “attention” of that training. This effectively shrinks the neural network and allows outliers to be de-emphasized, so training is both faster and easier.

The catch, of course, is that we’re now on the hook for figuring out how to weight incoming atoms of data, but we can solve that by just applying the same training algorithms. Real-life transformers have gotten quite sophisticated about attention, invoking multiple different weightings in parallel, weighting pairs of data atoms, and adding info on where those highly-weighted atoms show up within the input. That location data allows transformers to handle sequences.

That finally gets us to “Generative Pre-trained Transformer.” What I’ve been calling “data atoms” are better known as “tokens.” These form a sort of “super-alphabet,” typically consisting of tens or hundreds of thousands of items. The output is tens/hundreds of thousands of real numbers, each interpreted as the probability of a specific token being next. You don’t use hand-crafted datasets with known answers when training, but instead a massive collection text documents you’ve scraped off the internet (or pirated). Break those texts into chunks of input, record the next token after that as the “correct” output, and train until your model does a good enough job predicting those next tokens. The last two sentences are the “GP” part of “GPT”, but they still aren’t enough to create ChatGPT. OpenAI also does “discriminative fine-tuning,” which amounts to additional training on artificially-created texts that correspond to tasks you think users will toss at the model. Hence why ChatGPT was excellent at adding two numbers but struggled to multiply two-digit numbers: OpenAI figured people would treat ChatGPT like a calculator, and in anticipation likely fed in an endless fountain of training texts that just consisted of adding of two numbers. Multiplication didn’t get the same treatment, or it did but the training didn’t take.

Fake it Until You Make It

If you’ve been paying attention, you probably saw an obvious problem pass by four paragraphs ago, and after I kept rambling on and on went through several stages of grief and descended into doubting you’d actually spotted a problem. Nope, you were right: RNNs and transformers are very non-Markovian by design. The last token doesn’t determine the next token out, temporal feedback and attention mean two to a million tokens actually shape that choice. And if neither align with a Markov chain, then neither can be like MCMC.

There is a way to make them Markovian, however. Rather than think of GPTs as adding one token to the end of a train of others, the contents of the entire chain point to one point in the space of all possible states. The only valid state transitions correspond to ejecting the first token from the chain, shuffling every other token down, and placing a new token in the newly-created hole.

That’s a cheat, though, as Claude Shannon spotted about seventy-five years ago. If you treat English as a Markov chain between letters and the space character, it looks like nonsense. If you increase the size of the state space from $27$ to $27^2 = 729$ by focusing on pairs of characters, it looks more like natural English. A better approach is to use words instead of characters, which increases the state space to $\approx 100,000$, and moving up to pairs of words ($\approx 100,000^2$) leads to output text that’s surprisingly coherent.

the head and in frontal attack on an english writer that the character of this point is therefore another method for the letters that the time of who ever told the problem for an unexpected.

If you create a large enough state space, you can model anything as a Markov chain. For ChatGPT 4.1, this state space has $100,000^{1,000,000} = 10^{5,000,000}$ unique states. There’s no way for a human to comprehend a space that large. The number of atoms in the visible universe is roughly $10^{80}$, so if you converted every atom to a universe like ours, then converted every atom in those universes into universes, you’d have to repeat that process 625,000 times to reach the scale of this state space. There’s plenty of room to store every text that was or will ever be written in human history. However, every state transition in this space has either 0% or 100% probability of happening, save a negligible number of corner cases where two or more texts happen to have the exact same sequence of 100,000 tokens. You don’t need to invoke probabilities at all (within epsilon), so it doesn’t seem fair to call that a Markov chain.

On top of that, what remains is nothing like MCMC. What’s the likelihood of any given state, input or output? Can we reject transitioning from one state to another? The latter is absolutely vital to MCMC, as it means values that are twice as likely as some other, according to the target distribution, show up twice as often in an arbitrarily long chain. Even if we could find some way to assign a likelihood to all possible states, an inability to reject transitions dooms the whole enterprise.

One Tiny Problem

By this point, you might be wondering why I bothered to write two thousand words that don’t argue against Ranum. He didn’t say LLMs are Markov chains, and his description of how they work is broadly similar to what I just outlined. But he’s missing a key detail, one that I rarely see acknowledged but nonetheless has a profound impact on how we think about LLMs.

As mentioned earlier, transitions between states in a Markov chain are only deterministic if their probabilities are either 100% or 0%. Otherwise, they are not only random, they are perfectly random. You cannot predict which transition will be taken. Thus real-life Markov chains really contain two systems, the first being the description of all possible states and the probabilities of transitioning between them, and a second one that actually walks the chain to turn probabilities into certainties.

Likewise, no GPT is responsible for turning probabilities into certainties. They too need a secondary system bolted on for that task, one they can only influence through a list of probabilities. If an unlikely next token happens to be picked by that system and added to the end of the context window, the GPT just has to improvise. It cannot remove or ignore the token.

This second system is independent of the GPT. Most of them allow you to muck with the output probabilities, usually via a handy knob called the “temperature.” If this value is one, then the secondary system uses the probabilities as they are. Lower the temperature, though, and the difference between high- and low-probability tokens is exaggerated by this secondary system. Lower it all the way to zero and the token with the highest probability will be used 100% of the time.

A temperature of zero doesn’t quite make a GPT deterministic, however, because it’s theoretically possible for two or more tokens to have the same maximal likelihood. To reach true determinism, you have to control the random numbers chosen by the second system. In practice this is quite easy, nearly all second systems use deterministic “random” number generators initialized with “seed” values. Two generators given the same initial seed will give you the same sequence of random values. If you don’t supply a seed, the second system silently plucks one from a less deterministic source, like another random generator built into the operating system. Hopefully, this makes the deterministic generator appear less deterministic.

That second system undercuts a lot of what is claimed about LLMs, and allows us to answer some unanswered questions posed by Ranum.

Alas, this post is quite long. And that description of temperature and random seeds is pretty abstract; a detour that plays with those values to give you a feel for what they change would be worthwhile. So let’s fire up another post to do exactly that.