After writing about LLM error rates, I wanted to talk about a specific kind of error: the hallucination. I am aware that there is a lot of research into this subject, so I decided to read a scholarly review:
“Survey of Hallucination in Natural Language Generation” by Ziwei Ji et al (2023), publicly accessible on arxiv.
I’m not aiming to summarize the entire subject, but rather to answer a specific question: Are hallucinations are an effectively solvable problem, or are they here to stay?
What is a hallucination?
“Hallucination” is a term used in the technical literature on AI, but it’s also entered popular usage. I’ve noticed some differences, and I’d like to put the two definitions in dialogue with each other.
The review defines a hallucination as an AI response that is not faithful to the source material. For example, if asked to summarize a paragraph about COVID-19 clinical trials, even though the paragraph never mentions China, the AI might claim that China has started clinical trials. Now, it might actually be true that China has started clinical trials–but if the paragraph didn’t say so, then the AI shouldn’t include the claim in its summary. The review notes the distinction between faithfulness (i.e. to a provided source) and factuality (i.e. aligning with the real world).
As for the public understanding, I think of all the interactions people have with ChatGPT, where it provides a wildly inaccurate answer with complete confidence. For example, my husband asked it for tips on Zelda: Tears of the Kingdom, and it provided a listicle full of actions that are not possible within the game.
One example that garnered public attention, was when someone asked Google what to do when cheese falls off a pizza. Google’s AI Overviews suggested that they “add some glue”. It’s summarizing a real answer that someone provided (in jest) in a Reddit thread. This was characterized as a hallucination in the media, although I would ask if it really counts by the technical definition. It is, after all, being faithful to a source.
Here are three major differences between the public definition and the scholarly review’s definition:
- The public does not distinguish between factuality and faithfulness. It is assumed that the source is factually correct, or else it is the responsibility of the LLM to provide factual information even when provided a source that is not factual. (The review observes that researchers also often fail to distinguish factuality from faithfulness. Researchers often make the dubious assumption that sources are factual.)
- The public talks about models being overconfident. It’s not just about models saying something wrong, it’s about the way they say it. People are used to expressing some degree of uncertainty when they feel uncertain, and used to picking up uncertainty in other people. AI models often lack these signs of uncertainty, and this can be a problem in natural conversation. However, this subject is not discussed at all in the review, and so it appears not to be a major research area.
- When we speak of faithfulness to a “source”, what is the source? We can imagine providing the AI with a paragraph and asking it to summarize that paragraph–here, the paragraph is the source. However, what if I ask it for general information without providing any source? Here, the source is the AI’s “memory”. That is to say, we expect the AI to retain information that it learned from training data. Research on hallucinations has primarily looked at the first kind of source, where public discussion has largely centered on the second kind of source.
Processing machines and knowledge machines
That last distinction is so important, because it gets to the heart of what LLMs are even for. Do we view an LLM as a processing machine, able to process information that is immediately in front of it? Or do we view an LLM as a knowledge machine, dispensing knowledge from its memory banks?
The scholarly review discusses hallucinations almost entirely in the context of processing tasks. For example, summarizing a paragraph, or answering a question about a paragraph, summarizing data in a table, or translating a sentence into another language.
However, most of the review is about the broader topic of natural language generation. When it shifts focus to the narrower topic of large language models, it also shifts focus from processing tasks to knowledge tasks. By virtue of their size, LLMs have a much greater capacity for parametric knowledge–knowledge that is stored in the model parameters during training. So when researchers explore hallucinations in LLMs, they focus on models’ faithfulness to parametric knowledge, rather than faithfulness to provided documents.
In my opinion, there are two completely distinct research questions: Do LLMs perform well as processing machines? Do LLMs perform well as knowledge machines?
Researchers seem to be aware of the distinction, but I think they put far too little emphasis on it. In a more recent paper titled “AI Hallucinations: a Misnomer Worth Clarifying”, the authors survey the definition of “hallucination” across the literature, and conclude that the term is too vague, and that it may be stigmatizing of mental illness. I don’t disagree with those points, but it’s telling that the authors don’t even mention the distinction between hallucinating from a document, and hallucinating from parametric knowledge. I would go so far as to say that AI researchers are dropping the ball.
This also has important consequences for communicating with the public. For example, let’s look at the Wikipedia article on hallucinations. Wikipedia describes several possible causes (drawing from the very same scholarly review that I am reading). Did you know, one of the causes of hallucinations is that the AI places too much emphasis on parametric knowledge? This only makes sense when we realize that it’s talking about hallucinations that occur during processing tasks, and not during parametric knowledge recall. When the general public reads this article, they are largely seeing it through the lens of parametric knowledge hallucinations, and the article will be very confusing and misleading.
Can we rely on parametric knowledge?
Let’s return to my opening question. Are hallucinations are an effectively solvable problem, or are they here to stay? More specifically, is it possible to solve parametric knowledge hallucinations? Because if not, then maybe we should focus on using LLMs as processing machines rather than knowledge machines. We should be educating the general public about it. We should be educating the CEOs who are deciding how to invest in AI.
I have good reason to ask the question. Fundamentally, parametric knowledge seems like an inefficient way to store knowledge. Imagine freezing and compressing all of human knowledge into a few hundred gigabytes of matrices. Imagine that whenever we want to look up one little thing, no matter how small, we need to load up all those hundreds of gigabytes and do some matrix multiplication with them. That’s effectively what LLMs are doing! Surely it would be more efficient to use a conventional database.
So what does the review say about it? There’s a short section discussing methods to mitigate hallucination in large language models. Most methods are variants on the theme of “fix the training data”. Getting rid of low quality data, getting rid of duplicate data, punishing answers that users mark as non-factual. Is it enough? I can’t say.
And then there’s retrieval augmented generation (RAG). This is a technique where the responses are enhanced by an external source of information. For example, if you query Google’s Gemini model, it may run a conventional Google search, and then summarize the results. Effectively, this transforms the task from a knowledge task to a processing task. Of course, even processing tasks are still be plagued by hallucinations. And then we have the problem of unreliable and inconsistent information–the metaphorical glue on our pizza.
So, I clearly have my prejudices, and I tend to think RAG-type solutions are the best. But I cannot say that my viewpoint is the one exclusively supported by the literature. RAG is just one way that researchers have used to mitigate hallucinations, and it’s a solution that introduces even more problems. I still think it’s the way to go.
This all leads back to our chronic question: what are LLMs good for? When I consider a potential use case, there are two main questions I ask myself. First, how error tolerant is the task? Second, is it a processing task or is it a knowledge task?
John Morales says
State of the art, at least for free.
Early days. S-Curve.
—
I found this to be an amusing story, and relevant to your post:
https://www.theguardian.com/australia-news/2024/nov/12/real-estate-listing-gaffe-exposes-widespread-use-of-ai-in-australian-industry-and-potential-risks