What does it mean that AI is “remixing existing work”?


Marcus reminded me of the common claim: “AI is just remixing existing works”. Or the more colorful version, “AI just regurgitates existing art”. This is in reference to creative uses of AI image generators or LLMs.

While there may be a grain of truth to the claim, I have difficulty making sense of what it’s even saying. It’s basically an unverifiable statement. I think both pro- and anti-AI folks would be better served by a more technical understanding.  So, instead of being stuck at an impasse, we might be able to actually find answers.


Initial reactions

First, let’s get some quippy thoughts out of the way.  We all have them, leave yours in the comments.

1. I am an origamist who posts original designs alongside recreations of other people’s designs, and readers seem to enjoy them in an ordinary fashion. If AI is just remixing existing work, I cannot see how this isn’t even more blatantly true of origami.

2. Having worked on non-creative uses of LLMs, I have to say that the creative spirit of AI is an honest to god barrier to replacing human workers with robots. I suspect that in the use cases that most inspire investors, they don’t want AI to be creative, they just want it to produce the right answer.

What does that even mean?

I do not know what people mean by “AI is just remixing existing work”. One could imagine an algorithm that, when asked for a dog, retrieves a bunch of dog pictures, and copies little rectangles from each of them, and stitches them together into a frankendog. I’m not a stable diffusion expert, but that’s very obviously not how stable diffusion works, and I don’t think anyone is seriously claiming that it is.

Maybe it’s doing something more subtle, like taking different elements from different images? For example, it might take the outline of the dog from one image, the colors from a second image, and the textures from a third image. This is likely closer to the mark, although the elements that stable diffusion borrows from existing images do not neatly correspond to any human concepts. e.g. it is unlikely that the neural network has a single neuron corresponding to “fur”.

If an LLM model were doing nothing more than taking a sentence from Dickens, a sentence from Shakespeare, and a sentence from Twain, we might be able to agree that this is a form of regurgitation. But if it took a word from Dickens, a word from Shakespeare, and a word from Twain, isn’t that just how language works?  Obviously, every word we use is just a word that we have seen others use.*  So where do we draw the line between “regurgitation” and “how language works”?  And what is the equivalent dividing line for elements of the visual language?

*Okay, humans can invent new words, but so can ChatGPT. If you know anything about tokenization, you’d know this is not a trivial fact, but it’s one I’ve tested and you can too. I used a random name generator and asked ChatGPT to generate portmanteau couple names. e.g. Ricky Alyson + Rowland Zeb produced “Rilyand” or “Zeblyson”. …I’m not claiming that ChatGPT is good at this.

“AI is just remixing existing work” is an unverifiable claim, because on the one hand, this could describe ordinary language, and the other hand, could describe the frankendog.  Nobody knows what anyone is trying to say!  It depends on how finely we chop the pieces, but where do we draw the line?  How can we possibly draw a line, when the “pieces” taken from the training data are not literal rectangles or phrases, but something altogether more abstract, and human-uninterpretable?  No wonder the argument goes in circles.

What about quoting Wikipedia verbatim?

There is a real problem in here, for both LLMs and image generators, that sometimes they generate large recognizable chunks from known sources. For example, quoting Wikipedia verbatim, or reproducing Claude Monet’s The Water Lily Pond.

My informed speculation is that most cases are the result of duplicates in the training data. This is an acknowledged source of error in natural language generation, as I read in this review paper:

Another problematic scenario is when duplicates from the dataset are not properly filtered out. It is almost impossible to check hundreds of gigabytes of text corpora manually. Lee et al. [134] show that duplicated examples from the pretraining corpus bias the model to favor generating repeats of the memorized phrases from the duplicated examples.

The cited study found that over 1% of output was quoting verbatim from the training data. They were able to reduce this by a factor of 10 by removing duplicates in the training data.

There are a few takeaways. First, this is only happening in 1% of the output. That figure is a measurement on a particular model, and it could be wildly different for ChatGPT or Midjourney, but it’s not the 100% of the time that people seem to imagine.

Second, if ~90% of the problem comes from duplicated data, this is much more likely to happen to “well-known” texts or images. People see verbatim quotes of Wikipedia, and they imagine everything else ChatGPT says is a verbatim quote from more obscure sources. But that’s probably not true; there is something special about Wikipedia that makes it particularly likely to get quoted.

Even at ~1%, this is obviously still a serious problem that AI companies have a vested interest in addressing. What happens when a user inadvertently reproduces a copyrighted work? Who will the courts find liable? I’m sure they have whole legal teams and maybe even a fraction of a data scientist thinking real hard about how to shift blame onto users.

Are AI models overfitting?

Apart from duplicates in the training data, there’s also a broader problem that AI models could potentially suffering from. Are AI models overfitting their training data?

“Overfitting” and “underfitting” are important concepts in data science, long predating so-called generative AI. You can find plenty of explainers and helpful images, like this one:

three plots illustrating underfitting, overfitting, and a balanced fit

Source: AWS

When a model is overfit, it works too hard to reproduce the training data, at the expense of its ability to generalize. Overfitting makes a model look better, if you’re not careful about the measurement, while actually making the model worse. Overfitting is more likely when you have a very flexible model with insufficient training data; it can be addressed by making the model less flexible and/or collecting more training data. Another way to think about it is that underfit models suffer from bias, while overfit models suffer from variance (which I’ve discussed here).

In the context of creative generation, I am not exactly sure what this would look like, but it might manifest as a decreased ability to extrapolate towards artistic styles that do not exist within the training data. It might also manifest as recognizable reproductions of existing work. (To be clear, overfitting and duplications in the training data are distinct problems, though they may have overlapping symptoms.)

So are AI models overfit? I believe the conventional wisdom among experts is that large language models are overfit. There’s a well-known paper that argues current LLMs would benefit more from additional training data rather than additional parameters, and that implies some sort of overfitting. Google also turned up a paper that addresses the question more directly by measuring performance on grade school math.

At least for LLMs, there are some obvious incentives that cause models to be overfit. LLMs used to brag about how many parameters they have, under the theory that bigger models are stronger—but this likely made the models just more expensive and more overfit. And as I mentioned before, overfitting will make the model look better if you’re being uncareful with measurements. AI companies have an obvious interest in making their models look better, and it’s perfectly within their power to be uncareful. (See this discussion or my discussion of why you shouldn’t trust AI press releases.)

Common image generation models are a different story. I couldn’t find a concrete answer to whether they are overfit, but I’m not an expert in that area, so maybe the research is somewhere out there.

I don’t think image generators suffer from quite the same incentives as LLM models do, since they’re not producing press releases pumping up how good image generators are at grade school math. To be quite honest, image generators are kind of a footnote to LLMs–OpenAI is far larger and richer than Midjourney, inc. Companies like Midjourney probably just do the cheapest thing they can get away with.

Conclusion

It’s clearly a significant problem that both LLMs and image generators will sometimes reproduce recognizable chunks of well-known works. This could be symptomatic of duplicates in the data set, and/or model overfitting, or maybe something else. There’s good reason to suspect that LLMs are overfit, and this very well could be the result of perverse capitalistic incentives.

However, there is little reason to believe that every single output from an AI model is simply reproducing existing works. To the extent that this occurs in current models, it’s a property of the models rather than a universal property of AI. (And really, it’s still a problem even if it only happens some of the time.)

While there may be a grain of truth to the claim that “AI is remixing existing works”, it’s not a good way of understanding the issue. Likewise, the counterclaim “All art is just remixing existing work” is not very helpful.  Both claims are too vague to be verifiable. And perhaps that’s by design, to produce an unresolvable disagreement.

To find a path forward, we need a more technical understanding, based on overfitting and training data quality. Though the research is limited, it’s a question for which we could in principle find answers. So I say, go forth and demand some answers.

Comments

  1. Bekenstein Bound says

    underfit models suffer from bias

    So there’s a mathematical reason why uneducated people from noncosmopolitan environments suffer from bias … the models in their heads are underfit.

    If only it were easy to convince them that their biases ought to be seen as a problem, rather than objects of pride.

  2. says

    I think you’re making a pun and I’m not going to respond to it. But in case you’re not, “bias” is a technical term that means something else.

  3. says

    I haven’t heard the “remixing existing work” comment, but I would guess it means “an AI can’t be truly creative, like a human can” but how do you measure that, and does it matter? There must be a lot of tasks that don’t require new creative ideas, but just need to produce something within the range of what people typically produce. (Like translation- my husband uses ChatGPT for translation a lot.)

    I’m also curious about what overfitting would mean for an LLM. I’ve tried programming demos with neural nets before but I’m not familiar with how LLMs work.

Leave a Reply

Your email address will not be published. Required fields are marked *