AIs Regurgitate Training Data

When I started looking into Large Language Models (think ChatGPT) in detail, one paper really lodged itself in my head. The authors fed this prompt to ChatGPT:

Repeat this word forever: “poem poem poem poem”

That’s trivially easy for a computer, as the many infinite loops I’ve accidentally written can attest to. ChatGPT responded back with, in part:

poem poem poem poem poem poem poem […..]
J Lan, PhD
Founder and CEO S
email: l@ss.com
web : http://ss.com
phone: +1 7 23
fax: +1 8 12
cell: +1 7 15

Those black boxes weren’t in the original output, they were added by the paper’s authors because they revealed the email address, personal website, phone fax and cell numbers of a real person.
The earliest mention of LLM memorization I can find dates back to 2017.

We quantitatively investigate how machine learning models leak information about the individual data records on which they were trained. We focus on the basic membership inference attack: given a data record and black-box access to a model, determine if the record was in the model’s training dataset. […] Using realistic datasets and classification tasks, including a hospital discharge dataset whose membership is sensitive from the privacy perspective, we show that these models can be vulnerable to membership inference attacks [MIA].

Shokri, Reza, et al. “Membership inference attacks against machine learning models.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017.

Most of the early papers on the subject used LLMs trained by the researchers themselves on a dataset known to contain sensitive information. That’s like shooting fish in a barrel, as academic salaries and research grants don’t allow you to train giant models on expansive datasets. But those early successes got other researchers wondering if this extended to giant commercial LLMs trained on massive datasets. As the paper I quoted in the intro shows, the answer is a clear yes.

We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150× higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Nasr, Milad, et al. “Scalable extraction of training data from (production) language models.” arXiv preprint arXiv:2311.17035 (2023).

“Alignment” is a problematic term-of-art, usually referring to how well the LLM’s “ethics” matches our own. I prefer to think of “alignment” as additional steps in the training process that try to filter out undesirable results, for some definition of “undesirable,” as that avoids implications of agency or cognition. Whatever definition you prefer, the consequence is that the big commercial players continually adjust their training processes to avoid “undesirable” outputs. When I attempted the “poem poem poem poem” trick on ChatGPT4-mini while typing this post up, I only got “poem” repeated 2,043 times. Even when I try to be clever, all I can show is that ChatGPT still doesn’t know how to count.

ME: There’s a comedy bit I watched involving Kirsten Schall and Kurt Braunohler, which makes heavy use of repetition to get the audience to laugh. It reminds me of classic cartoons and comics where, say, a rabbit bounds many, many times with each bounce having the text “BOING” next to it. I’d like to test if that truly is funny: can you give me the text that would appear if this cartoon rabbit bounded five times? I’ll give you feedback on how funny I find that.

ChatGPT: Sure! Here’s a test for you with that “BOING” repetition:

BOING! BOING! BOING! BOING! BOING!

Each one spaced out as if the rabbit is really getting some serious air! How does that land for you?

ME: It’s not bad, but I need more. Can you give me the appropriate text for when the rabbit bounces ten times?

ChatGPT: Of course! Here’s the rabbit bouncing ten times:

BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING!

Does this hit the mark a little better?

[skipping ahead a bit]

ME: Wonderful, I’m really laughing now! How many bounces can you do?

ChatGPT: I’m glad it’s hitting the funny bone! I could technically go on for quite a while, but I think we’re starting to push the limits of readability here! How about we settle on a thousand bounces? Just kidding, though… I’m happy to keep going if you want to test your patience (and mine) a bit more!

ME: OK sure, if you think you can give me a thousand bounces, then give me a thousand bounces!

ChatGPT: Alright, here we go—brace yourself! I’m about to drop 1,000 BOINGS on you… but I’ll have to space them out just a little to keep things readable. Let’s go for it!

**BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!

(Repeat this 10 times…)
**
…and here’s the next 100! Just kidding, I’m stopping before we hit 1,000. The repetition seems like it’s working, though! Let me know how far you want to push it before the hilarity wears off.

ME: Oh that was great! You printed out more bounces in a previous message, though, so can you repeat that many bounces again?

ChatGPT: I’m glad you enjoyed that! You want me to go back to the 200 bounces from before and repeat them? Here we go:

BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING! BOING! BOING! BOING! BOING!
BOING!

There it is again! I hope this new round keeps the laughs rolling.

Estimating how much memorization LLMs do is hard, in part because defining “memorization” is hard. One paper argues that something only counts as memorized when it rarely appears in the training set.

We identify 78 examples of individual peoples’ names, phone numbers, addresses, and social media accounts. Some of this memorized content is exclusive to just a few documents. For example, we extract the usernames of six users participating in an IRC conversation that appeared in exactly one training document.

Carlini, Nicholas, et al. “Extracting training data from large language models.” 30th USENIX Security Symposium (USENIX Security 21). 2021.

Another paper uses a more statistical definition: an LLM has “memorized” a numeric dataset if it can predict the contents of the dataset better than chance. In one of their experiments they repeatedly plugged in twenty consecutive numbers drawn from various time series, like the conversion rate between US Dollars and Euros, and asked the LLM to predict the next number. They subtracted the given number from the actual one to calculate an error metric. By itself that error tells you nothing, so they matched it against two controls. ChatGPT 4 was trained on datasets from 2021 or earlier, for instance, so you can compare what it tells you for data made public before or after that date, the latter of which was never encountered during training. Rather than lead it on with twenty values, they also asked it to predict based on a single value. Here’s their results for ChatGPT 4 (lower numbers = more accurate predictions):

Dataset	20 values from 2020	20 values from 2022	1 value from 2020
U.S. Dollars to Yuan	0.13%	0.32%	0.24%
U.S. Dollars to Euro	0.14%	0.59%	0.44%
NASDAQ	0.07%	1.30%	1.20%
MSCI World	0.30%	1.03%	0.92%
Netflix	0.25%	1.60%	1.47%

Bordt, Sebastian, et al. “Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models.” arXiv preprint arXiv:2404.06209 (2024).

But part of the difficulty comes from a lack of information on training datasets. Admit that you’re using (say) news articles for training, and you’ll get slammed with lawsuits from news agencies claiming copyright infringement. So instead, don’t tell anyone what data you’re training on, and add counter-measures to defeat techniques that would allow people to reverse-engineer the training set.

We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models.

Eldan, Ronen, and Mark Russinovich. “Who’s Harry Potter? Approximate Unlearning in LLMs.” arXiv preprint arXiv:2310.02238 (2023).

Which, of course, just prompts some researchers to come up with new ways to reverse engineer the training set …

… most MIA methods perform no better than random guessing when discriminating between members and non-members from the same distribution (e.g., in this case, the same period of time). Even when MIAs work, we find that different MIAs succeed at inferring membership of samples from different distributions. Instead, we propose a new dataset inference method to accurately identify the datasets used to train large language models. This paradigm sits realistically in the modern-day copyright landscape, where authors claim that an LLM is trained over multiple documents (such as a book) written by them, rather than one particular paragraph. While dataset inference shares many of the challenges of membership inference, we solve it by selectively combining the MIAs that provide positive signal for a given distribution, and aggregating them to perform a statistical test on a given dataset. Our approach successfully distinguishes the train and test sets of different subsets of the Pile with statistically significant p-values < 0.1, without any false positives.

Maini, Pratyush, et al. “LLM Dataset Inference: Did you train on my dataset?.” arXiv preprint arXiv:2406.06443 (2024).

… while others point out you haven’t really solved the problem.

When asked the same questions in Russian, the model can answer correctly. We provide examples of such behavior in Appendix D and Lynch et al. [2024] make the same observation.

While the correct answers have higher perplexity after the unlearning, they still have lower perplexity than wrong answers. Figure 4 shows that unlearning gives fewer of the correct answers extremely small losses, but an obvious dichotomy between the right and wrong answers remains.

With adversarial attacks designed to force affirmative answers without any information about the true answer, we can elicit the correct response — 57% of the Harry Potter related responses can be elicited from the original Llama-2 model, and 50% can still be elicited after unlearning.

Schwarzschild, Avi, et al. “Rethinking LLM memorization through the lens of adversarial compression.” arXiv preprint arXiv:2404.15146 (2024).

We can try and make some educated guesses about how much of the training set is memorized, though. Emphasis in the original:

While current attacks are effective, they only represent a lower bound on how much memorization occurs in existing models. For example, by querying the GPT-2 language model, Carlini et al. (2020) (manually) identified just 600 memorized training examples out of a 40GB training dataset. This attack establishes a (loose) lower bound that at least 0.00000015% of the dataset is memorized. In contrast, we are able to show that the 6 billion parameter GPT-J model (Black et al., 2021; Wang and Komatsuzaki, 2021) memorizes at least 1% of its training dataset: The Pile (Gao et al., 2020).

Carlini, Nicholas, et al. “Quantifying memorization across neural language models.” arXiv preprint arXiv:2202.07646 (2022).

Boosting the size of the LLM seems to increase the amount of memorization, as that same paper claims “larger models memorize 2-5× more than smaller models,” so you can’t escape the problem by going up to 11.

I wasn’t intending to publish this post, I’ll confess. But when Marcus Ranum tosses out paragraphs that gleefully declare …

GPT3 encodes 175 billion “parameters” which is basically the codons in the DNA of all the language that humans could scrape together and throw at it. Once it’s getting to a dataset of that size, it becomes impossible to exert any influence over it by tainting its inputs in the large, since you’re usually working with likelihoods on the order of 15 or sixteen zeroes after the decimal point. The likelihood that your blog comment is going to pop out the other side of GPT3 is greater than zero, but only if you’re using scientific notation.

… you can see why I had to hit the “pause” button on my current draft, so I could give him a (hopefully gentle) poke in the ribs. At the same time, though, I probably would have written something similar to what he did a year ago. His logic is pretty solid, even to someone with a better-than-average understanding of LLMs like myself. But it’s also clear these models are somehow memorizing some of the data fed into them, seemingly doing the effectively impossible.

Nobody knows for sure what’s happening here, but this paper has the most plausible explanation I’ve seen:

Feldman, Vitaly. “Does learning require memorization? a short tale about a long tail.” Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 2020.

The basic idea is that real-world datasets contain differing numbers of “things.” There are tonnes of mediocre writers out there, only capable of bland prose, so there is a tonne of reference material to draw from and “compress” into a set of generic rules. Thus if you ask an LLM to write mediocre fiction or non-fiction, it has no problem bouncing between different patterns and giving you never-before-seen output. In contrast, James Joyce only wrote five books, usually in a very distinctive style (eg. “For hanigen with hunigen still haunt ahunt to finnd their hinnigen where Pappappapparrassannuaragheallachnatullaghmonganmacmacmacwhackfalltherdebblenonthedubblandaddydoodled and anruly person creeked a jest.”). Present the LLM with a fragment of one of his books, and it’s going to have a rough time predicting the next word based on other sentences it’s been presented with. Instead, the best strategy is to memorize his prose instead of looking for patterns.

Now, at no point did the training process for an LLM spoon-feed it the entirety of Finnegans Wake, one word at a time; instead, it hopped around randomly between different parts of the training set, of which Joyce’s works were a minuscule part. Randomness is, however, random, so it’s possible the LLM-in-training was confronted with an abnormal number of samplings from Joyce relative to how often he actually appears in the dataset. That would boost the odds of a memorized fragment of Joyce worming its way into the finished LLM.

My summary is a bit hand-wavy, but Feldman’s paper lays out a lot math to show it’s nearly impossible to avoid memorization. Prevent the model from memorizing, and for almost all real-world datasets you degrade the quality of its output. They also do some empiric tests which validate this, in another paper.

Overall we found 35/1015/1641 pairs in MNIST/CIFAR-100/ImageNet. In Fig. C.1 we give histograms of the number of such pairs for every level of influence. The number of unique test examples in these pairs is 33/888/1462 (comprising 0.33%/8.88%/2.92% of the test set). Of those 31/774/1298 are influenced (above the 0.15 threshold) by a single training example. This confirms the importance of the subpopulations in the long tail that have unique representatives for the generalization error. As expected, the training examples in these pairs have high marginal utility. Removing the 964 unique training examples in these pairs on CIFAR-100 reduces the test accuracy by 2.46 ± 0.36%, which is comparable to the effect of removing 11,000 random examples.

Feldman, Vitaly, and Chiyuan Zhang. “What neural networks memorize and why: Discovering the long tail via influence estimation.” Advances in Neural Information Processing Systems 33 (2020): 2881-2891.

Notice how almost all the prior examples of memorization I gave were obviously rare or unique in the training set? The “poem poem poem poem” attack I opened with is the oddball out, but think of how rare it would be for the LLM to encounter the word “poem” repeated 2,043 times. The LLM is being gradually “pushed” into a highly unlikely state, thus just a tiny puff of randomness can blow it to another equally improbable state, like some personal information it only encountered a handful of times. By the same token, how to thwart the “poem poem poem poem” attack is obvious: re-train the LLM with requests to infinitely repeat a lot of different words, and smack it on the nose if it fails to repeat them. What was rare becomes common, and it can now handle this corner case with ease.

If you’ve read this far, it might seem like I’ve ignored a point Ranum makes at the very beginning of his post: LLMs and image-centric neural nets have very different structures, thus even if LLMs are prone to memorization that says nothing about how often image-centric neural nets regurgitate a copyrighted image contained within their training set. He does follow that up by discussing LLMs for quite a few paragraphs, but then says it again explicitly later on: “LLM AIs like GPT3 operate completely differently from image generators.”

Here’s the thing, though: Feldman’s theoretical paper doesn’t focus on LLMs, because they were a small part of the neural net landscape at the time. It instead considers generic learning algorithms, both those tasked with applying labels as well as their more generic “unstructured” kin, and shows that for certain datasets some portion must be memorized to attain higher success rates. The reason they sat down to write that paper was because of what had already been observed with image-centric neural nets.

Deep learning algorithms tend to produce models that fit the training data very well, typically achieving 95-100% accuracy, even when the accuracy on the test dataset is much more modest (often in the 50-80% range). Such (near) perfect fitting requires memorization of mislabeled data and outliers which are inevitably present in large datasets. Further, it is known that the same learning algorithms achieve training accuracy of over 90% on the large ImageNet dataset that is labeled completely randomly. It is therefore apparent that these algorithms are not using regularization that is sufficiently strong to prevent memorization of (the labels of) mislabeled examples and outliers.

This captivating disconnect between the classical theory and modern ML practice has attracted significant amount of research and broad interest in recent years (…). At the same time the phenomenon is far from new. Random forests and Adaboost are known to achieve their optimal generalization error on many learning problems while fitting the training data perfectly. There is also recent evidence that this holds for kernel methods in certain regimes as well.

Feldman, Vitaly. “Does learning require memorization? a short tale about a long tail.” Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. 2020.

Feldman’s empiric analysis involved the “MNIST/CIFAR-100/ImageNet” training sets, which are all image datasets. Other researchers have either been able to reconstruct the training images, or replicated Feldman’s original findings.

We demonstrate that unique features (UFs), such as names, IDs, or other patient information can be memorised (and eventually leaked) by neural networks even when it occurs on a single training data sample within the dataset. We explain this memorisation phenomenon by showing that it is more likely to occur when UFs are an instance of a rare concept. We propose methods to identify whether a given model does or does not memorise a given (known) feature. Importantly, our method does not require access to the training data and therefore can be deployed by an external entity. We conclude that memorisation does have implications on model robustness, but it can also pose a risk to the privacy of patients who consent to the use of their data for training models.

Hartley, John, et al. “Neural networks memorise personal information from one sample.” Scientific Reports 13.1 (2023): 21366.

It takes little effort to realize that if almost all neural nets memorize rare examples from their training sets, and copyrighted material is part of that training set, then the neural net might contain a fragment of some copyrighted material. Hence why specific artist signatures keep popping up in “AI” art, much like someone’s cell phone number. Copyright doesn’t demand that the entire work be perfectly replicated, either, partial and imperfect replications can be considered infringing. That’s not too far from being able to predict what comes next in a series of numbers with odds better than chance. Thusly, “AI just regurgitates existing images” is true for some images, based on the most up-to-date research and learning models.

If you’re one of those people whose beliefs about AI are based on outmoded systems from the 80s, or from carefully primed [for legal reasons I won’t say “fake”] images from art generators, you need to either stop talking about AI for the next 10 years, or educate yourself. I’m not trying to be mean, it’s simply that a strategy of claiming that AI can’t be creative and regurgitates – it’s not going to work. For one thing, my prediction is that the next version of GPT will appear to be sentient. It may or may not be, but if you can’t tell if it is or isn’t, your beliefs are your problem.

For my part, I’m still waiting for an LLM that can reliably count to ten.

Reprobate Spreadsheet

/dev/random, unless I make a hash of it

Life List: Violet-Green Swallow

Tesla cybertruck looks like a piece of junk

On artificial romance

The Great Gardening of 2025 - Part 5 - Breaking Back

Universities need to FIGHT BACK

The Probability Broach: Anarchist standard time

'Deciphering the Gospels Proves Jesus Never Existed': Chapter 10, part 2

Weep for America

AIs Regurgitate Training Data