In “Review scores: a philosophical investigation“, I pointed out all the weird things about review scores that we tend to take for granted. More broadly, I have questions not just about review scores, but also ratings and rankings. What do they mean, what purpose do they serve, and how do we produce them? Here I examine a case study: the Dominion fan community’s many ways of rating and ranking Dominion cards.
In case any readers aren’t familiar with Dominion but are still interested in this article, I’ll provide a bit of basic background. Dominion is a “deckbuilding” game, and was in first of its kind in 2008. In contrast to trading card games (such as Magic: The Gathering), you don’t build a deck in preparation to play, instead every game you start with a basic deck, and build it as you play. Dominion has inspired a whole genre of board games and video games, perhaps most notably Slay the Spire.
In each game of Dominion, there are ten “kingdom” cards that are available to put into your deck, usually by purchasing with in-game resources. The kingdom cards are chosen randomly at the beginning of each game. There are also “base” cards that are available for purchase in every game. Currently, among all the Dominion expansions, there are 366 distinct kingdom cards, which is quite a lot to rank, but it’s at least feasible. Contrast with Magic: The Gathering, which has tens of thousands of unique cards, far too many to rank all at once.
Most cards have a particular price point, bought with in-game currency. But as more expansions have been released, there are increasing complications. Some cards have unusual costs, or variable costs. Some kingdom cards are actually piles consisting of multiple distinct cards with distinct costs. There are also additional cards that can’t be bought directly, but are gained through other cards. There are boons and hexes, which are randomized bonuses/penalties granted by certain cards. There are events and projects, which can be bought for a particular price but are never added to your deck. And finally there are ways and landmarks, which cause global rules changes.
The original community rankings were created by user Qvist, and ran every year from 2011 to 2018. Users were provided with an interface to rank or rate cards. In the years I participated, it was also possible to rank cards by “duel”–the interface would prompt you to compare two cards, and if you answered enough times it would create a ranking from your choices. Regardless of how users input their choices, the responses would be interpreted as a percentile scores. When all responses were collected, the average percentile would be calculated for each card, and the final rankings would be announced and discussed in the forums.
Cards would be categorized based on cost, and ranked only within their own cost category. The idea is that usually you’ve hit a particular price point, and you have a choice among all the cards at that price point. Comparing cards with different costs is harder, because sometimes you can only afford one of those cards and not the other, so are they even comparable? Perhaps more importantly, there are a lot of cards to rank, and placing them into smaller categories reduces the amount of computation.
The purpose of the ranking was to summarize the collective wisdom of Dominion players, especially top players. In fact, the percentile averages would give more weight to players who did well in the online leaderboards.
Here’s a question: Why percentiles? Why not ratings? In a way, rankings are more “objective”. If players assigned ratings from 0 to 10, one player might have minimal impact on the rankings because they give almost everything a score between 6 and 9, while another player might have a disproportionately high impact by using the full range 0 to 10. On the other hand, rankings also lose some information. If you think the top card is far superior to the 2nd and 3rd cards which are about the same, you can express that opinion through ratings, but not through percentiles. In the end, users didn’t entirely agree on this question, so the poll would allow players to input ratings or rankings–even though the responses were ultimately encoded as percentiles regardless.
The Qvist rankings suffered from some problems that became worse over time. While the cost-based categorization scheme initially made sense, it was difficult to deal with all the exceptions introduced by newer expansions. New ad hoc categories were created, but often didn’t provide much value because they were too small. And some cards, people disagreed on how to categorize. But I believe that the real downfall of the Qvist rankings was the computational complexity. Producing a full ranking is O(N log(N)), and at the end the largest category had 127 distinct cards. For this reason, participation declined in later years.
Starting in 2017, Adam Horton would distribute a poll asking people to rate cards on a scale from 0 to 10. To keep people on the same page, the poll would provide loose guidelines for what each rating meant.
The philosophy behind this poll contrasted with that of the Qvist rankings, in that they were emphatically ratings, not rankings. When summarizing results, Horton would deliberately avoid ever sorting or comparing cards, even though there is nothing to stop other people from doing so. Instead, the main comparison of interest was between different ratings of the same card. For instance, which cards had high variance in ratings, and which had low variance? How do initial impressions compare to the impressions after trying many games with the card?
Also in contrast to Qvist, the purpose wasn’t to summarize expert opinion, but just to gauge community impressions. A greater effort was made to poll the broader Dominion community, such as the subreddit. No additional weight was given to experts.
Horton’s ratings don’t suffer from many of the problems of the Qvist rankings. The computational complexity is O(N) instead of O(N log(N)). And there is no need to argue about how to define subcategories.
ThunderDominion is a ranking system created by the Dominion Discord community starting in 2018. It was created as a more casual form of ranking, where the main goal is not to come up with the best or most comprehensive ranking, but to provide a structure for people to endlessly discuss cards.
ThunderDominion is structured as a series of public votes. The host would establish the thing to be voted on, people would argue about it, place their votes, and sometimes change their votes. The vote would determine part of the ranking, and then the next vote would begin. This format only works with a limited number of cards at a time, so generally they would group cards by expansion, ranking about 25 cards at a time.
The particular decision in each vote changed over the years. Initially it used a selection sort, where people would vote on the top and bottom remaining cards. Later they switched to an insertion sort, voting on where to insert cards within the ranking so far. Then they used an insertion sort, where the median vote won instead of the plurality vote. And in the latest edition, they used quicksort.
My own experience with ThunderDominion is very limited, but my understanding is that the particular choice of voting system has surprising and often perverse effects. Plurality voting means you have vote strategically, sometimes teaming up with other people to get satisfactory if not optimal results. There are also considerations of the kind of discussion you want to generate. Insertion sort, for instance, leads to discussion more focused on particular cards than selection sort. But insertion sort also has the issue that initial votes are trivial and later votes are very difficult. While these problems might also exist in other kinds of polls, they’re completely exposed in the ThunderDominion format, resulting in a ranking that nobody believes is perfect, but which is still fun to create.
Dominion Card Glicko
The Dominion Card Glicko is an ongoing ranking system based on card comparisons. Rather than sorting everything all at once, users are asked to compare ten pairs of cards, indicating whether one is stronger than the other, or if they’re too similar to call. Everyone’s comparisons are aggregated together using the Glicko-2 ranking system, which I understand to be conceptually similar to the Elo rating used to rank chess players.
This ranking system provides a more serious alternative to ThunderDominion, without many of the drawbacks of the Qvist ranking. There’s no concern about computational complexity, because you’re just contributing however much you want to a communal ranking. And though the rankings are reported by the same categories used by Qvist, in principle the scores are comparable across categories, so there’s no need to argue about category definitions. If an individual does not think that two objects are comparable, they can skip the question and the ranking will rely on other participants’ opinions.
There are still a few drawbacks though. The ranking gives more weight to the people most willing to put time into it. And as far as I know all comparisons are aggregated together without regard to timestamp, so it’s unclear how well the rankings adjust to changing opinions. I believe the idea is to periodically take snapshots of the rankings and reset, although there is only one snapshot so far.
Although each of these ranking/rating systems is nominally attempting the same project, I find it fascinating how they each have different goals and different methods. Is the purpose of a ranking/rating to give strategy advice, to describe current community opinion, or to provide a vehicle for discussion? The method of ranking is also quite significant, and depends on practical considerations like what discussion it generates, how much of a burden it places on participants, and how it addresses philosophical disagreements among players.