Normally, in the introduction to an article, I would provide a “hook”, explaining my interest in the topic, and why you should be too. But my usual approach felt wrong here, since I cannot justify my own interest, and arguably if you’re reading this rather than scrolling past the title, you should be less interested than you currently are.
So, review scores. WTF are they? I don’t have the answers, but I sure have some questions. Why is 0/10 bad, 10/10 good, and 5/10… also bad? What goals do people have in assigning a score, and do they align with the goals of people reading the same score? What does it mean to take the average of many review scores? And why do we expect review scores to be normally distributed?
Review scores are intuitively understood as a measure of the quality of a work (such as a video game, movie, book, or LP)–or perhaps a measure of our enjoyment of the work? Already we have this question: is it quality, or is it enjoyment, or are those two concepts the same? But we must leave that question hanging, because there are more existentially pressing questions to come. Review scores do more than just express quality/enjoyment, they assign a number. And numbers are quite the loaded concept.
First, numbers are fully ordered. Is quality/enjoyment ordered? Could we really take any two works, or any two experiences, and judge whether one is better than the other? Can two works be equal to one another?
Second, perhaps more troublingly, numbers can be added and subtracted. What would it even mean to add or subtract two review scores? Could we say that the “difference” between a 1/10 and a 4/10 is equal to the “difference” between a 4/10 and 7/10? I dunno about that. My intuition is that 7/10 might be something I would enjoy, but both a 1/10 and 4/10 likely represent something I wouldn’t enjoy, so the distance between 1/10 and 4/10 is relatively small.
Now you could say that just because review scores are represented with numbers, and just because it makes sense to add and subtract numbers, does not therefore mean that it makes sense to add and subtract review scores. And I agree! However, in that case, numbers seem like the wrong mathematical metaphor. If review scores are ordered, but cannot be added or subtracted, then review scores are more properly described as ordinals, rather than numbers.
The thing is, we do not treat review scores like ordinals. We like taking averages of our review scores. An average, importantly, involves adding multiple review scores, and then dividing by the number of scores. For instance, if one person gives a 1/10 rating, and another gives a 7/10 rating, the average score is a 4/10. Our method of taking averages implies that the difference between a 1/10 and 4/10 is equal to the difference between a 4/10 and 7/10, intuition be damned.
My understanding is that websites that aggregate review scores typically do not take simple averages, perhaps because they are aware of these very issues. I do not know how they compute averages, and probably they don’t want anyone to know, lest they game the system. So, in practice, are we treating review scores as numbers or as ordinals? We don’t even know! How do we sleep at night, not knowing if the sheep are counted or merely arranged in order?
To be reductive, the purpose of a review score is to inform a buy/no-buy decision. To be slightly less reductive, a review score also might inform you whether to buy now or wait, how excited to be, or how much trust to put into the work.
Informing a buy/no-buy decision is not quite the same as just telling you whether or not to buy. A review score is just one of many inputs into our brains’ algorithms. For instance, I might buy a game if it’s in the “puzzle adventure” genre, is praised for its story, and has reviews scores of at least 6/10. But I may not buy a first person shooter regardless of review scores.
If we’re being less reductive, we might say that the purpose of the review score is to express a certain level of enjoyment on the part of the reviewer. So, if I were to give these games ratings based purely on my own enjoyment, I would have to give categorically higher scores to puzzle adventure games, and categorically lower scores to first person shooters. But would this actually be useful to someone trying to make the buy/no-buy decision? It seems that rather than hearing about my genre preferences, you’re better off just having an understanding your own genre preferences, and finding review scores that reflect the quality of a game within its own genre.
So on the one hand, we have this idea that review scores should reflect the internal enjoyment of the reviewer. On the other hand, we have this idea that review scores should be useful for making decisions. These two goals are at odds!
We almost always just ignore this problem by finding the right reviewers—reviewers whose personal tastes just so happen to align with what is useful to the consumer. You want a review of a romance novel, you get it from someone who likes romance novels, that’s just common sense, right? You certainly wouldn’t get it from me, because I’d just write some meta about my issues with the romance genre as a whole (yes I have done this).
Something I’m circling around, is the so-called “objective review” commonly demanded by gamers. The philosophical problems with the “objective review” are too glaring for me to waste breath on—and I say this after having wasted a bunch of breath on numbers vs ordinals.
But even though reviews are not in any sense “objective”, neither do they represent unfiltered subjective opinion. We filter review scores by selecting “critics” whose opinions are somehow more valuable than the rest of us genre-picky riff-raff. Or, in the case of websites with user review scores, we are selecting reviewers who are sufficiently engaged to leave ratings, and putting these scores through an averaging algorithm nobody quite understands. So what even is a review score? Stare at this mystery long enough, and you might just go mad, maybe become one of those gamers who demands “objectivity”.
In the process of writing this essay, I did a bit of “research”, looking to see what the score distributions are on sites like Goodreads, IMDB, or Metacritic. A few interesting things came up. First, I found an interesting academic article comparing review score distributions on Goodreads vs Amazon. The difference between the two was explained as the result of Amazon being more directly tied to sales. So for instance, people give more 1 and 5 star reviews on Amazon, because that’s the best strategy if you’re trying to maximize your influence on buy decisions.
The second article I found was a data scientist trying to select the “best” movie review website, purely on the basis of how closely their score distributions resemble a centered normal distribution. Along similar lines, there’s a FiveThirtyEight criticizing Fandango on the basis of its skewed score distribution. Both of these articles are deeply misguided.
FiveThirtyEight complained that 98% of all movies on fandango are rated between 3 and 5 stars. It tells a story of someone being persuaded to watch a terrible movie on the basis of a 3-star score on Fandango. Well nobody ever told you that 3 stars means “good”, that was just your assumption.
And I have to point out… if your complaint is that the distribution is asymmetrical, there is an easy way to make it symmetrical again. Change 3 stars to 1 star; 3.5 stars to 2 stars; 4 stars to 3 stars; 4.5 stars to 4 stars. Now you’ve got yourself a nice centered normal distribution, which by your own standards means it’s the perfect movie review website. If a review score distribution is mathematically isomorphic to the perfect review score distribution, I’m forced to conclude that the score distribution was perfect to begin with. At worst, we could say that the scores were communicated poorly.
Both of these articles express the assumption that the “ideal” center of the distribution is 3/5, or alternatively 5/10. But where does this assumption even come from, and is it consistent with our other ideas about review scores? For example, many people think 5/10 describes something you neither liked nor disliked. But what if most people like most movies? Wouldn’t you expect the review score distribution to be centered above 5/10?
Or maybe 5/10 means good, but only good enough to match the experience of doing something else besides watching a movie. Maybe 5/10 means that even people who love movies enough to rate them on Fandango could take or leave this particular movie, so don’t bother unless you love movies even more than that. I don’t know!
Personally, I just guess at the meaning of review scores based on past experience. In the realm of music, a 10/10 means “worth trying first 20 seconds”. In the realm of video games, a 7/10 means “it is a game that someone thought was worth reviewing”. On Goodreads, a 4.01 means “amazing”, and 3.98 means “totally unreadable”. So on and so forth.
Another assumption the data science article made was that review score distributions ought to be normal, peaked distribution. The author went so far as to dismiss Rotten Tomatoes, which had a uniform, unpeaked distribution. Why though?
Normal distributions make some sense, insofar as movies are the sum of many small parts, which randomly improve your experience or detract from it. On the other hand, uniform distributions maximize entropy. In plainer terms: uniform distributions give you greater power to differentiate between movies. Uniform distributions also lend themselves to a natural interpretation, that of percentiles.
So it’s funny. On the one hand, the data science article assumes that the normal distribution is best, and dismisses websites with uniform distributions. On the other hand, I found another analysis that translates IMDB scores into percentiles, essentially forcing a uniform distribution. Clearly we have some fundamental disagreements over what we want in review scores.
So, to summarize my conclusions. WTF are review scores? Are they numbers or ordinals? What purposes do they serve? What is their appropriate distribution? How does anyone else get by without wondering these things?