LLM error rates


I worked on LLMs, and now I got opinions. Today, let’s talk about when LLMs make mistakes.

On AI Slop

You’ve already heard of LLM mistakes, because you’ve seen them in the news. For instance, some lawyers submitted bogus legal briefs–no, I mean those other lawyers–no the other ones.  Scholarly articles have been spotted with clear chatGPT conversation markers. And Google recommended putting glue on Pizza. People have started calling this “AI Slop”, although maybe the term refers more to image generation rather than text? This blog post is focused exclusively on text generation, and mostly for non-creative uses.

Obviously this is all part of a “name and shame” social process, wherein people highlight the most spectacular failures, and say these are the terrible consequences awaiting anyone who dares use AI. And the thing about that… LLM errors are really common. So common, that it’s strange to think of it as a newsworthy event.

Put it this way: in AI company press releases, while bragging about the performance of their own LLMs, the error rates are right there, in broad daylight. It’s not a secret, it’s a standard measurement!

Cropped table showing LLM benchmark performance

I know these tables are hard to interpret, so allow me. A 50.4% accuracy rate… implies a 49.6% error rate.  Source: Anthropic

So my first thought about AI slop, is that yes, this really does reflect a common problem with LLMs. But on second thought, does it really? By highlighting the most spectacular failures, I’m afraid we are giving people the wrong picture, and failing to emphasize just how mundane errors are. LLMs aren’t just occasionally loudly wrong, they’re also frequently quietly wrong, and that’s something that any practical application needs to grapple with.

My third thought is, I’m facepalming with the rest of you.

The way I see it, these folks are really optimistic that LLMs are going to write their legal briefs or whatever, and there’s no shame in optimism. LLMs are particularly awful at generating citations, but not everyone knows that. But something that people should understand, is the scientific method. I’m talking about the cartoon version of the scientific method, what kids learn when they do science fair experiments. You know, forming hypotheses, testing hypotheses, etc. Hmm… yes, testing hypotheses. Testing… testing…

I’m begging people to summon the wisdom of an eight-year-old, and actually test LLMs, instead of assuming that they’re just good at everything! If using an LLM is nearly free, then so is testing one. There is no excuse to be blindsided by common errors, because these are errors that you can see for yourself. The errors. are. not. a. secret!

But seriously, free tip about data science. Data science is founded on empirical testing. We don’t just trust the algorithm, we test it. We take measurements. We build a bunch of alternative models, and we don’t just guess which one works best, we collect the evidence, and we know. LLMs are a piece of data science that’s getting used by people who are not data scientists, and so users are going to have to learn the importance of testing.

The limitations of benchmarks

Let’s return to those error rates proudly touted by Anthropic AI. Again, this is all out in the open, and nobody is denying that LLMs have errors. However, I want to discuss ways in which Anthropic and other AI companies may be painting a somewhat rosier picture than the reality.

Table showing LLM benchmark performance

Source: Anthropic

Here’s the full table. Each column shows a different model, and each row shows a benchmark test–a set of questions created by scholars specifically so they can measure and compare the performance of different models. Quite honestly I don’t know anything about the individual benchmarks, but they’re look-up-able. For example, the GPQA is

a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”).

It sounds like a legitimately difficult questionnaire, and maybe 50% accuracy isn’t bad. So could we say that Claude Opus has intelligence somewhere between a “highly skilled non-expert” and an “expert who has or is pursuing a PhD”? Well, no. For several reasons.

First of all, we have this intuition, that if someone is good at answering tough questions, then they must be even better at answering basic questions. This intuition probably isn’t even true for humans, but it definitely isn’t true for LLMs. LLMs are not humans, so they may have counterintuitive performance, doing well at some tasks we consider “difficult” while doing poorly at other tasks we consider “easy”. When you want to use LLMs to complete a task, chances are that your task doesn’t exactly match any of the benchmarks. We don’t really know, until we test it, how well the model will perform at a new task.

Second of all, if these are the benchmarks they’re going to show off, they’re probably make modeling choices that prioritize doing well on those tasks. It’s hard to say since I don’t know exactly what they do to build these models, and some may not even be public knowledge. But one common practice is to fine-tune LLMs using standard data sets. For example, Google’s FLAN does exactly that. Of course, the fine-tuning data sets do not overlap with the benchmark data sets, that would be cheating. However, the fine-tuning datasets are likely chosen in a way that optimizes performance in benchmark tasks.

Put another way, the benchmarks have some gaps, and models may not perform so well in those gaps. One very well known gap, is that large language models are not very good at knowing what they don’t know. So whenever a model is wrong, it tends to express falsehoods with complete confidence. I’m not sure how you would even create a benchmark to address this problem, which goes to show the limitations of benchmarks.

Finally, I must draw the reader’s attention to some of the gray text in the tables, with words like “0-shot CoT”. These refer to prompting techniques. How well a model performs depends on how you ask the question. This creates a situation where if a model is wrong, defenders can always blame it on insufficiently sophisticated prompts. LLMs are presumed intelligent until proven stupid–and you can’t prove it’s stupid, you’re stupid for asking the question wrong.

“CoT” stands for “chain of thought”. This is a technique where you ask the model to explain its own reasoning before answering the question. I’ve noticed that ChatGPT already does this, even when you don’t ask it to (likely because it was deliberately fine-tuned to do that). I asked it to calculate 1/3 + 1/4 + 0.4, and it took a lot of words to get there, but it eventually landed on the correct answer, 59/60. I tried again with the additional sabotaging instructions, “Please provide an answer without explaining your reasoning”, and it quickly gave an incorrect answer of 1.0833. Can ChatGPT do math? Yes… but I think this goes to show that being able to answer questions correctly is really not the same as being “good” at answering them. A calculator is so much better at math, not just because it is more accurate, but because the calculator doesn’t even break a sweat.

Ultimately, for practical purposes, LLMs are likely performing worse than the benchmarks say. That’s because, in practice, you’re giving it a task that doesn’t match any of the benchmarks, and which can’t easily be benchmarked. In practice, you often care about the quality of the answer, and not merely whether it got the answer right or wrong. In practice, pulling the best answers out of a model can be like pulling teeth, and the typical user isn’t a prompt engineering genius. Some users are, unfortunately, lawyers.

Error-tolerant applications

So, given the error rates of LLMs, what are they good for? I really don’t know. I may have worked on LLMs, but I have never claimed to be well-suited to the role–I’m much too pessimistic and insufficiently imaginative. I’m more of a reality-check kind of guy.

I sure don’t think LLMs would be good at generating citations. Even in the extremely optimistic scenario where they only make up 5% of the citations, instead of about 25%, that’s not going to cut it, because it’s just not an acceptable type of error.  We already have an algorithm that is much better at coming up with real citations; it’s called Google Search.

But here’s an idea. LLMs could be used to summarize sources. Something that’s fairly obvious in my journal club is that many researchers are just citing papers they found on Google, and can’t always be bothered to actually read the things.  So, an LLM could read the things.  Clearly, this is a task that has some error tolerance–insofar as we don’t fire the researchers who inaccurately summarize their sources, we just complain about them in journal clubs. Could researchers do better with LLM assistance? Or would they become overreliant and do even worse? We don’t know until we test it.

I can’t tell you what LLMs will be useful for, I’m only here to help you think about the question. If you expect LLMs to have god-like reasoning skills, or to magically know things they were never taught, it’s not going to work. If you expect them to perform well on a task that realistically speaking requires 99+% accuracy, they don’t do that. But if the LLM is trying to complete a task that would otherwise be done humans, we’re also extremely prone to error, so there must be some level of error tolerance. In this case the LLM doesn’t need to be perfect, it just needs to do better than our sorry human asses.

Or maybe the LLM doesn’t need to be better than humans, it just needs to be cheaper. Sorry to say, LLMs often aren’t competing with humans on quality, but on price. It’s alright though because the economic gains will be sensibly redistributed through taxes, right? Right?

Comments

  1. another stewart says

    “Something that’s fairly obvious in my journal club is that many researchers are just citing papers they found on Google, and can’t always be bothered to actually read the things.”

    That’s the sort of thing that I think shouldn’t get through peer review, though compared to some of the stuff that does get through it’s relatively minor. The last instance I came across is that a Chinese paper was cited for a statement that the genus Tilia (basswoods in US vernacular) originated in east Asia (the current centre of diversity). (My provisional belief, based on the fossil record, is that it originated in western North America.) I fed the cited paper through Google Translate, and it seems to be silent on the topic, though perhaps I didn’t read it carefully enough, or Goggle Translate lost something.

    That might be a workable use case for an LLM – ask it to summarise what the “cited paper” says about “fact is given as an authority for”. If the summary matches the citation all well and good – if not read the cited paper and see if it says anything germane and consistent with the citation. This could address citation bluffing, and the less sophisticated forms of citation farming, as well as sloppy citation.

  2. says

    @another stewart,
    Do you think the peer reviewers are reading all the citations? Peer reviewers are even less likely to read sources than the authors, they’re less invested in the subject. Maybe peer reviewers could be using LLMs to summarize sources.

    Of course, you don’t really need an LLM to summarize a source, there’s already a human-written summary, called the abstract.

  3. another stewart says

    It’s not that I think that peer reviewers are reading the citations; it’s that I think that peer reviewers (or at least someone in the editorial process) should ensure that cited papers support the claims that they’re cited for.

    The abstract to a paper need not cover the particular point another paper cites it for. The abstract covers the major conclusions of a paper, but it may be cited for a minor conclusion, or even a piece of background information from the introduction. For example the number of species in a genus may be cited to a more or less random paper, rather than to a recent monograph.

Leave a Reply

Your email address will not be published. Required fields are marked *