If you take a rating website, say IMDB, or Goodreads, and you sorted items purely by review scores, the stuff that would float to the top would be pretty obscure. That’s because the easiest way to maintain a perfect score is to have a very small sample size.
So, a math question: what is the statistically “correct” way to handle this?
In this analysis, I will assume there exists a “true” average review score, and we are trying to estimate it. The “true” average is the average that would be attained if there were a sufficiently large sample of reviewers. We’re not imagining that everyone in the world is reviewing the same book (for example, we don’t expect book reviews to reflect the opinions of people who don’t like reading books period). But we could imagine, what if there were a billion identical yet statistically independent Earths, and we averaged all their review scores. Obviously it’s very hard to come across a billion identical yet statistically independent Earths, and that’s why we use math instead.
This premise may be fairly questioned. I once discussed the philosophical problems with review scores, including questioning the very idea of taking averages. But here, I’m just focusing on the math for math’s sake. And, I really mean it, it’s hardcore math. If you don’t want math, just skip to the last section I guess.

