ranking by pairwise comparison

published on 2019-02-01

You may think that ranking by pairwise comparison is a fancy way of describing sorting, and in a way you'd be right: sorting is exactly that. But what we intend to cover here is more general in two ways. Firstly, sorting presumes that comparisons between elements can be done cheaply and quickly on demand. For integers and strings of text, this is certainly true. But when trying to find the best performing variant of your product through A/B testing, comparison may take weeks! Secondly, sorting algorithms generally depend on comparisons being deterministic, or perfectly predictable. FC Barcelona is currently considered to be one of the most competent football clubs but that doesn't mean that they win every single match, not even against teams that are considered much weaker. To properly rank football teams by their skill level, we need a scheme that is robust to stochastic noise.

There are many ways to determine the order of things. Say we want to determine a league's best football club. Commonly, teams are matched up in a tournament, a system where each team plays every other team a fixed number of times. Based on the outcome of each match, the teams involved are awarded a number of points. At the end of the season we rank the teams on the sum of their awarded points. We have ordered them.

For football, and sports in general, this is a sensible approach. It's fair in the sense every team plays the exact same opponents as every other team. This also means that a team should be well-rounded in the sense that it must do well against most teams and can't rely on strategies to exploit a single other team. Perhaps most importantly, since football is a form of entertainment, the format allows predictable scheduling and a fairly large number of games for the fans to enjoy.

From a mathematical perspective however, the tournament approach seems suboptimal. Letting a bottom table regional team play against Barcelona is very unlikely to provide much new information, whereas a match between two teams that are nearly tied is much more likely to. Further, this setup is very restrictive: for countries' international teams, it is often infeasible to play on a fixed schedule since the team's players are often scattered all over the world. Such leagues therefore often use different ranking systems.

ELO rating

One system that grants more flexibility is the ELO rating system. A rating system is a special class of ranking systems that assigns points to players or teams (rating) and then orders them based on these points (ranking). The ELO system was initially developed by Arped Elo to more accurately track the skill level of chess players.

The ELO system allows for competitive environments where pairing isn't strictly predetermined. Where the fixed reward system of common leagues football would allow for easy gaming by only playing weaker teams, ELO awards points roughly inversely proportional to the expected win rate. When the ELO ratings are accurate, the expected outcome would be that neither team will gain or lose any points (in the long run). If a team's ELO rating is too low for their actual skill, it should increase over time.

The core assumption behind ELO is that a player's performance in each game is a normally distributed random variable (or a closely related distribution like the logistic distribution). Rating players is then effectively reduced to the statistical estimation of the mean of their personal distributions through comparisons to other players. This still leaves many freedoms: should we presume all players' variance to be equal? (ELO says yes) What factor should we use to scale shifts in rating?

Especially that last question has proven difficult to answer. Choose a so-called K-factor that is too high, and the ratings fluctuate too much and overrepresent recent performance. Set it too low and the ratings won't be able to keep up with changes in actual performance. This is a trade-off between stability and adaptability that all ranking systems must make. ELO tends to the safe side; a chess outsider without ELO rating beating Magnus Carlson in a tournament setting would instantly elevate that person to world-class status. However, their rating after such a tournament would not exceed that of a local club player.

Y ~ N(1880, 102)X ~ N(1700, 102)P(X > Y) ≈ 0.106

ELO ratings have the neat property of being easy to compute. This was more important in the 1960's, when it was first introduced in chess, than it is now, but it still adds to its elegance. In the example above, with K=32K = 32, player X's updated rating will be

1700+32(outcome0.106),1700 + 32 , (\text{outcome} - 0.106),

with outcome\text{outcome} being 1 for a win, 0.5 for a draw and 0 for a loss. That means that X stands to gain 28.6 points in case of a win or 12.6 for a draw and stands to lose only 3.4 points in the (likely) case of a loss.

eigenvector centrality

Another option for ordering things is by way of eigenvector centrality. Though it may sound intimidating, the idea behind it is relatively simple. Let's perform the following thought experiment: consider the end of a football league season. Instead of determining the ranking by awarding points for every win (or draw) and awarding a cup to the team with the most accumulated points, award the cup to a random team. Then, have that team pass on the cup to a random team that they lost to during the season. Have that team do the same, and then the next team, and so on. If we keep repeating this a very large number of times, we'd expect the teams that did well during the season to receive the cup relatively often since they regularly end up in the cup selection pool. Teams that did not win much aren't picked for the cup selection pool as frequently and are hence expected to receive the cup less often. We can then rank the teams on the proportion of time they held onto the cup during this process.

So why is this concept called eigenvector centrality? Centrality is a term in graph theory which measures the importance of nodes in a network. The league above can be considered as a network when we regard every team as a node, and add arcs from losing teams to winning teams. Centrality in this context is the fraction of time you expect to visit a node during an infinite random walk on the network — which is how we model the cup passing scheme above.

To make sense of the eigenvector part of the name, we turn to a special property of this so-called long-run distribution. It turns out that if you award the cup randomly to a team according to this distribution and then have that team pass on the cup one more time, the distribution does not change. Intuitively, this may make sense: if we've already passed on the cup infinitely many times, what's one more going to do? This can be expressed mathematically as follows: let rr be the long-run distribution expressed as a column vector of size nn, where nn is the number of teams. Let AA be the n×nn \times n transition matrix, which describes the probability of passing the cup from one team to another. For teams any two teams ii and jj, we set

Aij:=w(i,j)w(i),A_{ij} := \frac{w(i, j)}{w(i)},

with w(i,j)w(i, j) being the number of times team ii beat team jj and w(i)w(i) the number of times team ii won in total. Since this is only defined for teams that had wins, we disregard any teams violating this requirement and place them at the bottom of the table. Then there exists a unique scalar value λ\lambda such that

Ar=λr.A r = \lambda r.

In linear algebra, rr is known as an eigenvector of AA. The value λ\lambda is the corresponding eigenvalue.

Solving this equation provides an alternative method of computing the long-run distribution. Eigenvectors and their corresponding eigenvalues are generally not unique for a given matrix, but with the restriction that all its components are non-negative, which is required for it to be a probability distribution, a unique solution is guaranteed.

Note that this method of ranking is not just some mathematical curiosity, but is actually used extensively for many practical applications. In fact, Google was built on the PageRank algorithm, which is a very closely related cousin to the method described above. In PageRank, the graph represents not teams and their wins, but web pages and their links to other pages. One of the reasons it has worked so well for them (other than its computational feasibility) is that it generally only matters who links to your page, not what other pages your page links to. This makes sense for the web. Linking to unknown or unpopular pages should intuitively not detract from a page's score, whereas having a website like Wikipedia link to a page is a strong signal of quality or relevance. This same property probably makes it a bad algorithm for scoring football teams, because the strength of the teams you lose to is very relevant for determining your own strength.

poset methods

One drawback of all the previously discussed schemes is that they always work, even when they shouldn't. For example, consider two chess competitions using ELO ratings. One is held in Japan, the other in Belgium. After a few years of play, ratings for most players will have stabilized and an order will emerge within each competition. How would you then compare players from the Belgian competition and the Japanese competition? Even if their rating systems are exactly identical, there is no way of knowing how a 1600 ELO player from Belgium would stack up to a 1550 player from Japan. The numbers suggest that the Belgian player is stronger, but without games between players of the two competitions, this is absolutely unknowable a priori.

To remedy this, let us finally examine a ranking scheme that does not involve computing ratings. As before, let's imagine a network of nodes with arcs from every 'loser' to every 'winner'. We then consider a node xx better than node yy if there exists a path of nodes (x,a1,a2,,an,y)(x, a_1, a_2, \dots, a_n, y) in this graph. In other words: xx ranks above yy if and only if xx beats a1a_1, a1a_1 beats a2a_2, and so on until finally ana_n beats yy.

Such a ranking scheme has some special properties. For example, if a collection is completely ordered in this way, a new element can immediately jump to the top of the ranking by beating the current top element. Another, more important, property is that this generally does not generate a complete ranking, or total order. Only pairs of elements for which there exists an uninterrupted chain of wins can be compared. This explicitly prevents comparing Belgium and Japanese players in the scenario above, since they live in disconnected components of the graph. We say that this algorithm produces a preorder.

Note that for arbitrary graphs, it is possible for an element to be both better and worse than another element. In the game of rock-paper-scissors for example, rock beats scissors, scissors beats paper and paper beats rock. Now there is a path from every node to every other node! How would you rank them in this case? We cannot. Our implicit assumption of the transitive property (a a beats bb, bb beats cc implies that aa beats cc) is violated. Every other ranking system will struggle with this example though. Again, they will produce a result, but it will be without value since they, too, rely on the transitive property. We call preorders which are free of such inconsistencies partially ordered sets, or posets for short.

The example of rock-paper-scissors may be slightly pathological, but it does highlight a deficiency. Inconsistencies like that may show up even when the transitive property generally should hold true, like we assume for most sports. It is entirely possible for a weaker player to defeat a stronger player due to chance for example. Even one such occurrence would bring the above algorithm to a halt because even though there may be many paths from aa to bb, even one path from bb to aa will rank it higher. A node should not be both ranked below and above another node. The algorithm cannot cope with conflicting evidence.

All is not lost, however. We simply have to remove conflicting evidence from the graph, thereby upgrading our preorder to a poset. What does that look like in the context of a graph? By our definition above, conflicts are pairs of nodes (a,b)(a, b) for which there exists both a path from aa to bb and a path from bb to aa. This is called a cycle in graph theory. So all we have to do is remove any cycles from the graph!

Removing all cycles can be done trivially by removing all arcs from the graph, but we'd like to be as non-destructive as possible in order to preserve as much information as possible. In other words, we try to minimize the number of arcs to be deleted. This is known as the minimum feedback arc set problem. Unfortunately, this is in the class of NP-hard problems, which means that solving it optimally can become absolutely intractable for larger graphs. There are however a number of heuristics which provide very good (but possibly imperfect) solutions while maintaining computational feasibility.


It's possible to further restrict what elements can be considered comparable by demanding a minimum number of pieces of supporting evidence. In graph theoretic terms this is the minimum number of arcs that should be removed from the graph until you can no longer reach one node from the other. This is called their edge connectivity number.

As a final note on the poset scheme: it is possible to create a total order from a partial order that respects the original ordering in the sense that if aa ranks above bb in the partial order, aa will also rank above bb in the total order. This is ordinarily done by topological sorts. We lose any information on what elements can actually be compared though.


The methods above are just a few popular ways of producing rankings through pairwise comparisons. Choosing between them should depend on a number of things. When simplicity and predictability is key, a sports league style scheme is your best bet. If your comparisons are unpredictable or uncontrollable but a full ranking is required, ELO type systems will probably work well. The centrality and poset methods are appropriate when your problem is slightly different. Centrality measures capture asymmetry between 'wins' and 'losses' well, and the poset methods yield more control over what elements are actually comparable.

If there's one thing all methods have one thing in common, it's their boundless variety and depth. We've only scratched the surface here. There's so much more to explore.