On Wishlists, homemade Elo rankings

Or: The Steam Tournament; Chapter 1

`WARNING: Contains mathematics. And your opinion of me might worsen after learning what I do to amuse myself.`

The Steam Summer Sale is still on. Even though I have more than enough games to keep me entertained for *years* to come, I still have a wishlist. New games come out and some of them are of my interest, maybe I’ll play them at some point in the future. The wishlist is still useful…

…up to a certain point. Just like my `todo`

file, the wishlist grows *almost* monotonically^{1} in two major axes:

- Cardinality, and
- Time-distance.

In other words, not only there’s more and more items on the wishlist, but the delta between the first and last added items grows constantly. Some games were added yesterday, some were added 6 years ago.

Both of these run contrary to the very function of a wishlist (to create a subset of all games to track interest and prices) so something must be done.

**Buying games** is a way to decrease the wishlist size, but is not sustainable to stabilize it. Even when I had a regular job and disposable income the wishlist always grew faster than I could buy games.
**Pruning**—the other big strategy—is easy in theory, but not in practice. Sure, some games have demonstrated that the initial hype was only that and my interests change over time, but those are a minority. A fun game is still fun years later, and older games tend to be cheaper, so that their price-per-hour ratio tends to decrease while their fun-per-price ratio tends to increase. This means that age alone is not a good criteria for pruning.^{2}

Enter another criteria, more or less built in Steam itself: **Wishlist rankings**, a user-defined order from 1 to `n`

. Even though the actual implementation in Steam is quite clunky^{3} the mere idea of ranking the games is—in theory—a good approach to aid in wishlist management.

I have no hope that Valve will ever address this.

Having some sort of ranking could—in theory—aid in decision management. For example, my Wishlist currently has a bit under 300 elements. If they were ranked, I could set some rules, like:

- Only buy games in the 1–
*n* ranking, for some value of *n*
- Create some sort of mathematical index, like, say,
*p**r**i**c**e* × *r**a**n**k**i**n**g* and to buy only when that score falls under some threshold value^{4}
- Regularly switch the top and bottom halves of the rankings for the hell of it.

Now, this rests on the *assumption* that the games are already ranked. But what if they aren’t?

## Rankings and tournaments

So, the problem is that

- There’s 200+ elements of a set, that
- should be ranked more or less uniquely,
- according to
*subjective* and *loosely-defined* criteria,
- in a time-efficient way, and
- using an algorithm that should be repeatable (given that there will be newcomers)

Of course, I’m not alone in this problem. Other, more intelligent giants of the past have dealt with similar troubles. After some reading I came to the following premature conclusion:

In an **ideal** world, I’d program some sort of ranking that would choose a Condorcet winner, then remove that winner from the pool as #1 rank and repeat the process until all games have been sorted.^{5}

Of course, this is untenable, mostly from a computational perspective. But there’s real-world alternatives to my ideal solution.

### Round Robin

The most naive way would be to hold a round-robin tournament. Every game faces against every other game and for each pair of games, I decide which one I like best.

There’s some advantages to this. I’m not entirely sure *what* is being ranked here, so making a comparison among 200+ elements is next to impossible; but whatever it is, it’s easy(-er) to do when there’s only two. In theory, if I held a round-robin tournament I could somehow count the wins of every game and declare a winner.

Astute readers have already dismissed this, of course. Round-robins are fine for small sets, but grow quite fast. With only 50 participants, there’s 1225 individual comparisons to make^{6}. 200 participants would require 19,900 results that could be made in just over 5 and a half hours, assuming 1 second per comparison and no breaks.

The bottleneck is obvious now. Just as in sorting, the amount of comparisons must be as low as possible, mostly because it has to be made by a meat-based calculator.

## Elimination-style tournaments

What about single- and double-elimination tournaments? They need considerably fewer matches. The main disadvantage for this is that only the “top” of the list is more or less properly ranked.

Consider a regular single-elimination tournament among 8 participants. The matchup is something like this:

```
A---|
A----
B---| |
D----|
C---| | |
D---- |
D---| |
D
E---| |
E----| |
F---| | |
H----|
G---| |
H----|
H---|
```

After only 7 matches it’s obvious who is first and second place, but further rankings are fuzzy. There’s two possible approaches:

Hold extra matches, as is usually done in sports. The 3rd and 4th place can be determined by a match between the losers of the semifinals (`A`

& `E`

) taking the total to 8 matches. This is enough for sports where they only hand out 3 medals, but if the goal is ranking *all* players, this logic dictates that a new mini-tournament is needed to sort the 5th through 8th places. Such a new tournament would add 4 extra matches^{7} taking the total to 12. This is still better than 28 in a round robin, but it hinders a bit our goal for simplicity.^{8}

Construct some sort of ‘categories’ for the games. for instance, in the tournament above there’s 4 losers in the first round, 2 losers in the second, 1 loser in the third and one winner in the third. Then, the ranking is more or less:

`D`

`H`

- {
`A`

, `E`

}
- {
`B`

, `C`

, `F`

, `G`

}

This maximizes simplicity of procedure; and could be a better representation of what’s actually happening in my inner, subjective ranking: the deeper a game is in the ranking, the more difficult it is to meaningfully differentiate it from its neighbors. Maybe it’s true in this example that `B`

, `C`

, `F`

and `G`

are all equally “valuable” in my subjective, inner ranking. However, I fear this simplification can “punish” games by not giving them a fair stand against the rest. Maybe losing one or two matches can drastically alter the *perceived*

## Elo rankings and Not-a-real-ladder

I figured some kind of scoring could help capture the nuances between similarly ranked games while also providing an objective ranking. The first thing that came to mind was to use Elo ratings. Lo and behold, I started coding because at this point I was craving some kind of result before binning the project altogether.

Elo is simple to code. In the end it’s a function that depends on four pieces of information:

- Player A’s score,
- Player B’s score,
- The Development Coefficient
*K*, and
- The actual outcome of the game.

Once I had this function in place, I wanted to try it out, so I exported the wishlist and assigned all games a starting score of 1500 and set *K* = 40. Why? No reason in particular, maybe I’ll need to adjust the values later on.

But I wasn’t sure how to start matching the games against each other. For simplicity, I decided to create a pseudo-ladder tournament that goes like this:

- Take the first two players on the list, match them and assign a win, loss or tie.
- Update their respective scores.
- Move on to the next pair (players 3 and 4 on the list), match them and assign a win, loss or tie.
- Update their respective scores
- Repeat steps 3–4 until the end of the list
- Sort the list by Elo rating, descending.
- Save the list. End of Round

The idea behind this is that a single pass on the list starts dividing the participants. With only *n*/2 matches there’s a top and bottom half. After several passes, some sort of structure should arise. I

I haven’t *seen* any major disadvantages to this pseudo-ladder, but I suspect there might be some. With this algorithm it’s very unlikely that any game from the bottom half will ever face someone from the top. In some sports this is desirable so as to avoid matches with large skill gaps that might negatively affect the weaker player. But these aren’t people and I don’t care for their well-being. I’d like to have large gaps if doing so helps me create a better ranking.

This pseudo-ladder will only face participants against games of similar strength and maybe that means that scores are *hard* to change significantly. If game `x`

lost its first match and then went on to win 49 more matches in a row, his score will improve, for sure, but how fast? Will it be enough to reflect his status as an almost invict player?

I don’t know. I’ve run only a few rounds and the score gap is 1587.9136762730113 − 1412.0863237269887 = 175.8273525460226. Not even 200 points. Is this normal? I don’t know, I started with the naive assumption of giving everyone the same initial Elo ranking of 1500.

More information will come

Strictly speaking, it’s not; but the two major strategies to decrease its size are overshadowed by the regular wishlist growth. More on this later.

And, while we’re at it, neither is the game’s reviews past a certain point. Sure, a game with 10% positive reviews is almost surely a waste of time and money, but between the *Mixed* and the *Overwhelmingly Positive* games, there’s a lot of grey area. I’ve enjoyed games that have lukewarm reviews and have hated games that were critically acclaimed. Reviews are a rabbit hole in and of itself and I won’t touch them here.

My main issues with this system are:

- As soon as games are added to the wishlist, they are given a ranking, but I’ve yet to figure out how this number is calculated. In recent times they seem to be added to the bottom, but I’ve seen all sorts of strange behaviors;
- There’s no easy way to re-assign a game’s ranking. Or rather, there’s no easy way to reassign
*all of your game’s* rankings. The drag-and drop system is not the best, for it’s prone to system speed, misclicks and the window size itself.

I confess I came up with this example on the fly, but it seems interesting, although problematic. Some games will *never* fall under a certain price (many AAA games come to mind) and some are consistently interesting to me, which skews their ranking to the top values. Maybe the index itself could include a `lowestprice / retail_price`

term… I can certainly nerd snipe myself.

As I was discussing this problem with a friend, she suggested some kind of ranked voting to repeatedly “drop” the last participant from the pool. In theory, that should produce the same result, but in this case it’s more important to know the top of the ranking rather than the bottom and all else being equal I’d rather have an algorithm that could be stopped at any point and spit out partial results.

The progression is the same as the triangular numbers.

Say, `B`

vs `C`

; `F`

vs `G`

; `C`

vs `F`

and `B`

vs `G`

I’m looking for a proper way to count or calculate *exactly* how many matches are needed to rank 2^{k} participants via elimination tournaments. There’s a few candidates in the OEIS that fit the first few entries, but I don’t know yet which one of those reflects this problem, if any one does.