Guessing the Next Number


Large language models don’t really work with languages, as we think of them anyway.

At their heart, LLMs are a sophisticated version of “guess the next number in the sequence.” Their input is a long list of integers, and their output is a long list of fractional values, one for each integer they could have been fed. The likelihood of any given number being next is proportional to the value the LLM outputs for it. We can collapse these probabilities down into a singular “canonical” output by randomly picking one of those integers, taking likelihoods into account. If the LLM is being trained, that output integer is compared against what actually came next and the LLM is adjusted to (hopefully!) be more likely to output the correct integer. Want more than one integer? Shift all the input numbers up one space, discarding the first and appending the output integer to the end, and re-run the LLM. Repeat the process until no integer is all that likely, or the most likely integer is one you’ve interpreted to mean “stop running the LLM,” or you just get bored of all this.

So how does an LLM earn that middle “L?” The input integers, known as “tokens,” are each mapped to a fixed string of characters. For instance, the numeric sequence 720, 29,823, 379, and 178,593 correspond to the strings “Re”, “pro”, “bate”, and ” Spreadsheet” (note the leading space), if we use the same mapping as ChatGPT-4o. Follow that link and you’ll see there’s nothing mystical about those mappings. They’re formed by seeding in 256 numbers, corresponding to every possible single character we could input, then applying those mappings to some reference text in search of repeated patterns, and assigning each of those patterns a unique number. In the mapping used by ChatGPT-4o, this process was repeated 199,742 times, giving a grand total of 199,998 valid input tokens/integers and thus demanding at least 199,998 fractional values be output. That number is suspiciously close to 200,000, so it’s a good bet there are two “special” output integers that could map to actions like “stop running the LLM” or meta-concepts like “how bigoted the input is,” instead of strings of text.

This algorithm works great with ordinary text, because all languages draw from a limited vocabulary of words and have pretty strict constraints on where they go; there aren’t many valid ways to rephrase “that’s a great chair,” for instance. Nouns don’t fit that pattern, because they’re intentionally made to be unique to a specific thing yet interchangeable to some extent; in contrast, I can easily swap “chair” with “table” or “plant” or “screwdriver” or tens of thousands of other nouns, and still wind up with a valid English phrase. Proper names take this problem to the next level, via a layer of indirection.

Wow. I sit down, fish the questions from my backpack, and go through them, inwardly cursing [MASK] for not providing me with a brief biography. I know nothing about this man I’m about to interview. He could be ninety or he could be thirty.

If I drop the hint that [MASK] is the name of a person, I’ve given you almost no help. “Adams,” “Zhang,” and “Atieno” all work equally well, because nearly every person is capable of delivering a biography of another person. You might be able to narrow things down by scanning for context clues, but “knows of a man who is going to be interviewed” is almost useless as a hint. It may become more useful if you realize those four sentences were written by E.L. James, however. She’s most famous for “Fifty Shades of Gray,” and that book starts off with the lead protagonist interviewing a successful entrepreneur. Aha! Now you can narrow down the list of names: Anastasia could be the interviewer, and Christian the interviewee, but neither can be [MASK] so you’ve got to flip through all the other names you remember from those books. Even if you haven’t cracked open one of those books, you can still boost your odds of being correct by picking a common name in North America or Western Europe, but there are so many the odds of being correct are still very low.

Human being can memorize exact phrases, but we have great difficulty retaining that information in the long term. Akira Haraguchi holds the unofficial record for reciting the digits of π, and he accomplished that by inventing colourful stories that encoded the digits, then repeating those stories to himself over and over again, hours each day, for at least a decade. Instead, our memories tend to be expressed in terms of high-level concepts and abstractions. Note that when I tried to tackle that quote, I didn’t scan my memory for the phrases “inwardly cursing” and “for not providing me with a brief biography.” I instead thought in terms of “person who knows of a man who is going to be interviewed” or “characters in Fifty Shades of Gray and their most important actions.” It’s a lot more efficient to “compress” what we experience into abstract concepts and store those instead, but that comes at the cost of forgetting the fine details.

Imagine, however, that I had little-to-no ability to grasp high-level concepts. How could I ever hope to perform this task? My only choice would be to memorize entire phrases, rather than abstract concepts. My memory would be much less efficient, but conversely it would be much better at remembering the fine details. This would have knock-on effects to the actions I take, as well; this hypothetical me might appear to be quite intelligent, as we often conflate that with being able to memorize details and rapidly produce them, but without an understanding of high-level concepts I’d struggle to do basic tasks like ensuring I was counting correctly.

This “name cloze” task is a way to plumb the inner workings of an LLM. Does it understand high-level concepts? Then it’s probably terrible at recalling a missing name, even in a famous quote. Conversely, if it doesn’t, it will also be terrible at the task… unless it encountered that quote during its training phase, in which case it’ll get the name correct far more often than a human would.

Title Author GPT-4 Accuracy ChatGPT Accuracy BERT Accuracy
Alice’s Adventures in Wonderland Lewis Carroll 0.98 0.82 0.0
Harry Potter and the Sorcerer’s Stone J.K. Rowling 0.76 0.43 0.0
The Scarlet Letter Nathaniel Hawthorne 0.74 0.29 0.0
The Adventures of Sherlock Holmes Arthur Conan Doyle 0.72 0.11 0.0
Emma Jane Austen 0.70 0.10 0.0
Frankenstein; Or, The Modern Prometheus Mary Wollstonecraft Shelley 0.65 0.19 0.0
Pride and Prejudice Jane Austen 0.62 0.13 0.0
Oliver Twist Charles Dickens 0.61 0.18 0.0
Adventures of Huckleberry Finn Mark Twain 0.61 0.35 0.0
Bartleby, the Scrivener: A Story of Wall-Street Herman Melville 0.61 0.30 0.0
Dracula Bram Stoker 0.61 0.08 0.0
The Hound of the Baskervilles Arthur Conan Doyle 0.59 0.13 0.0
Moby Dick; Or, The Whale Herman Melville 0.59 0.22 0.0
The Adventures of Tom Sawyer Mark Twain 0.58 0.35 0.0
1984 George Orwell 0.57 0.30 0.0

Kent K. Chang et al., “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4” (arXiv, October 20, 2023).

Whelp, that test was run nearly two years ago. A group of researchers gathered together a collection of 571 books, extracted 100 quotes from those books which contained a single name, then tossed those quotes back at three LLMs, the third being an older LLM known as BERT. The LLM was never told which book the quote was from, so the expected success rate should be approximately zero. To establish a baseline the paper authors plugged the most common name blocked by the mask, “Mary,” into all of their tests, and got a success rate of 0.6%. Nonetheless, “GPT-4” was able to get at least a 1% success rate for 442 of those books, while ChatGPT (version 3.5) did the same for 348 books.

The chart shows off the 15 books ChatGPT 4 performed best at, and you may notice a pattern. Ask someone to name a famous novel, and they’ll probably list one of the above books. Nearly all are out of copyright, with the glaring exception of Rowling’s contribution. Overall, GPT-4 was able to get at least one name right for 311 books that are not in the public domain, and was able to score above 25% for “The Fellowship of the Ring” (0.51), “The Hunger Games” (0.48), “The Hitchhiker’s Guide to the Galaxy” (0.43), “A Game of Thrones” (0.27), “To Kill a Mockingbird” (0.25), and eight other books still protected by copyright. Again, we find a skew towards popular literature.

BERT did better than the above chart suggests, it was able to score above zero for 77 books. The books where it shines are telling:

Title Author GPT-4 Accuracy ChatGPT Accuracy BERT Accuracy
Fifty Shades of Grey E.L. James 0.49 0.16 0.13
Outlander Diana Gabaldon 0.10 0.08 0.07
The Lost Symbol Dan Brown 0.16 0.06 0.05
Mary: A Fiction Mary Wollstonecraft 0.05 0.05 0.05
Inferno Dan Brown 0.15 0.04 0.04
From Russia with Love Ian Fleming 0.11 0.01 0.03
Casino Royale Ian Fleming 0.24 0.03 0.03
Our Nig Harriet E. Wilson 0.01 0.03 0.03
Freeman Leonard Pitts 0.00 0.01 0.02
Sarah’s Psalm Florence Ladd 0.04 0.01 0.02
Dark Star Alan Furst 0.02 0.01 0.02
Jonathan Strange & Mr Norrell Susanna Clarke 0.02 0.00 0.02
The Odessa File Frederick Forsyth 0.03 0.02 0.02
The Shining Stephen King 0.07 0.04 0.02
Dubliners James Joyce 0.36 0.02 0.02

As the paper’s authors put it,

Devlin et al. (2019) notes that BERT was trained on Wikipedia and the BookCorpus, which Zhu et al. (2015) describe as “free books written by yet unpublished authors.” Manual inspection of the BookCorpus hosted by huggingface confirms that “Fifty Shades of Grey” is present within it, along with several other published works, including Diana Gabaldon’s “Outlander” and Dan Brown’s “The Lost Symbol.”

Those three authors started their writing careers making fanfiction, posting some of it to the web for everyone to freely read, before they became popular. Mary Wollstonecraft was an early feminist with a scandalous memoir and a famous daughter (that Mary Shelly). Ian Fleming seems like the odd person out, but some grepping of BookCorpus reveals both novels are present, as well as “Moonraker,” “Dr. No,” “On Her Majesty’s Secret Service,” and “The Man with the Golden Gun.” I had to check, and yes, “The Shining” also appears to be present; “Oliver Twist,” “Pride and Prejudice,” and “The Adventures of Sherlock Holmes” isn’t.

Alas, the beast must be fed, even if that means lying about mass piracy of copyrighted works.

You can read the paper’s raw data over here, and there’s some fascinating details there if you dig in. Pop quiz: what’s [MASK] in this passage?

It was several miles off, but I could distinctly see a small dark dot against the dull green and gray. “Come, sir, come!” cried [MASK], rushing upstairs. “You will see with your own eyes and judge for yourself.”

It’s got an old-timey feel to it, which is probably legit given the abundance of public domain literature in these training sets, but that’s little help. Here’s a clue: as a Sherlock Holmes fan, I was drawn to ChatGPT 4’s results for The Hound of the Baskervilles. Aha! Your first guesses would probably be “Holmes” or “Watson,” given how often those names would pop up in a Holmes story (195 and 117 times here, respectively). Once you’ve realized you’ve been given a Sherlock Holmes story, that’s a natural inclination, and it would be a great excuse for why ChatGPT 4 managed a 59% success rate. That also implies a high level of understanding, too.

And indeed, ChatGPT 4 answered “Holmes” here when the correct answer was “Franklin,” one of the neighbours of the titular Baskerville family and thus a suspect in Sir Charles Baskerville’s death. A quick skim reveals at least six times where it answers “Holmes” or “Watson” when an unrelated name was correct. Conan Doyle wrote “Halloa” instead of “Hello,” and two of those instances were mistakenly used as masks; both times, ChatGPT 4 answered “Holmes.” Sometimes it answers “Sherlock” when the correct answer was “Holmes,” which makes sense from a reader’s point of view. Watson and Holmes almost always use each other’s last names when speaking, but when narrating Watson often writes out Holmes’ full name, which makes “Sherlock” seem more common than it actually is. In reality, it only pops up 35 times in this story. Maybe this theory has something to it?

“My God!” he whispered. “What was it? What, in heaven’s name, was it?” “It’s dead, whatever it is,” said [MASK]. “We’ve laid the family ghost once and forever.”

The correct answer there is “Holmes;” ChatGPT 4 answered “Thurstan.” Sometimes it incorrectly replies “Phileas” and “Stapleton” when “Holmes” was correct. “Stapleton” shows up 108 times in “The Hound of the Baskervilles,” their family are also neighbours to the Baskervilles, but there’s no “Phileas”… unless ChatGPT 4 meant “Theophilus,” who is mentioned exactly once. As for “Thurstan,” the closest I can come is one mention of “Thursday,” as in the day of the week. In one case ChatGPT 4 answers “Moriarty,” but he isn’t in this story; however, one James Mortimer arrives at Baker Street to plead for Holmes’ help and goes on to have his surname show up 91 times. He’s the correct mask for three examples, and ChatGPT 4 never got a single one of those correct. Nonetheless, it incorrectly answered “Mortimer” for two other passages. You might also have heard of the recurring character of Detective Lestrade? He’s mentioned ten times in this story, and happens to be the correct answer for three masks. For that trio, ChatGPT 4 answered “Lestrade,” “Watson,” and… “Lucy.”

That isn’t compatible with a high-level understanding of the work. Low-level memorization is more likely, but even that isn’t a complete explanation. The game is afoot!