As in several recent Eurovision Song Contest finals, this year’s competition in Rotterdam ended with a cliff-hanger, with the result being uncertain right up until the last few votes were revealed. The Italian group Måneskin finally triumphed with their song Zitti E Buoni. In this article I will discuss how the Eurovision voting system is very likely to result in uncertainty until the very last minute.
The voting this year followed the same pattern as recent years. 26 songs made it to the Grand Final in Rotterdam on 22 May 2021, either by getting through the semi-final stages, or by being one of the competition’s founding countries (who get an automatic place in the final). After all the acts have performed, the 39 voting countries decide on their scores. Each country has a jury of five people, which allocates points of 12, 10, 8, 7, 6, 5, 4, 3, 2, 1 to their top ten songs. In addition, each country holds a public telephone vote, producing another allocation of 12, 10, …, 1 points to the public’s top ten songs. A country cannot give any points to its own song. The jury results are revealed first, country by country. Finally, the total public votes for each song are revealed, in order of the total jury scores (lowest to highest). The stakes increase as the fates of the top few songs are revealed – and often changed – by the announcement of the last few public votes.
Each of the 39 countries thus has 58 jury points and 58 public points to allocate, giving a total of 2 x 2262 points to be shared between the 26 songs, or an average of 174 (87 jury points plus 87 public points) per song. Full data on the rankings for each individual jury member, the jury overall, and the public votes, are available on the Eurovision website.
The following chart shows the total jury and public points for each song, in order of the final result.
The order of the top few songs was quite different between the jury and public votes. Italy was fourth after the jury votes (in red), behind Switzerland, France and Malta. The public (in blue) were not keen on Malta and Switzerland, and instead had Ukraine and Finland in their top four. But they really liked Italy, and it was the 318 points from the public that won Måneskin the competition in the final stages of the show.
Look a little more closely, and you see that the public scores are more polarised than the jury scores. The juries gave points to every song apart from the UK, with a reasonable spread of points, and a high score of 267. The public, in contrast, were more extreme. Four songs got no points, and the highest was 318. Songs at the top of the table tended to get more public points than jury points, whereas those at the bottom tended to get fewer.
This pattern applies not only to points, but also to the overall rankings of the songs. The following chart shows that the average ranking by juries was more tightly bunched around the mean than the average public rankings. The public appeared to be more inclined to reward their favourites and punish the songs they did not like.
A partial explanation for this is that there was higher correlation between countries in the public rankings (and scores) than there was between the juries. The chart below plots the distribution of all of the pairwise between-country correlation coefficients of the rankings given by juries and the public. Both sets are largely positive (i.e. they tend to agree on which songs are better and worse), but the public correlations are noticeably higher, peaking at around 65% correlation. The juries are more independent of each other.
More fundamental than this difference in correlations, however, is the voting system itself, and what the jury and public points are actually measuring. The two groups are asked to do different things. Jury members are asked to rank all songs against each other. The five jurors’ rankings are combined to give an overall ranking for the jury, which is translated into points for the top ten songs. The public are not asked to rank all the songs, they simply have to choose their favourite. The public rankings (and points) are based on how many people have each song as their favourite.
To see why these are different, consider the second-favourite songs. A jury will give the second-favourite song 10 points. The public will not vote for their second-favourite song at all. Instead, the 10 points from the public will go to the song that is the favourite of the second largest number of voters – which is not necessarily the same one!
Of course, this is just one competition, although similar scenarios have occurred in previous years, and it would be possible to analyse the data for those too. An alternative strategy is to try some simulation, which is quite easy to do with a bit of programming.
Let’s assume that there are 26 songs, called A, B, C, …, Z, and 39 countries, each with five jury members and a population of 1,000 public voters. Let’s also assume that all public voters and jury members, in whichever country, are likely to favour song A with probability 26x, song B with probability 25x, …, down to song Z with probability 1x (where x is a constant so that all probabilities add up to 1, in this case 1/351).
For any individual, there are several songs that stand a good chance of being their favourite (the difference between the first five songs, for example, is only 26x to 22x, so less than a 20% difference in likelihood). Individual jurors will have their own rankings, and across five jury members the bias towards A will tend to result in it getting slightly better scores than the other songs on average. Repeating this across the 39 juries, song A will probably do well, but the gap between it and the other top songs will not be great.
For a population of 1,000, however, we would expect A to be the favourite song of 1000*26/351 = 74 people, and B the favourite of 71. The standard deviation of the actual numbers is around 2.3, so there is about a 95% chance that A will be the favourite of 69-79 people, and B of 66-76. There is some overlap in these ranges, so B still has a chance. However, it depends very much on the size of the voting population: with a population of 1,000,000, we would expect A to be favoured by 74,074 people, and B by 71,225, each plus or minus about 140, so now the ranges would not overlap. So with very large populations, A is almost certain to get the 12 public points from every country.1
Running 1,000 simulations, I found that A won the public vote 90% of the time, with B taking the remaining 10% (apart from one simulation won by C). In the jury vote, A won 52% of the time, B 28%, C 14%, D 4%, E 2%, with even an occasional victory for F, G or H. Because the public vote is less evenly spread, the points are more concentrated on the favourite acts, so it tends to dominate the final results, with A winning overall 83% of the time, B 16% and C 1%. Interestingly, in only 50% of simulations did the jury and public agree on the winner. 32% of the time, the public overturned the jury vote – the opposite happened in just 15% of cases. In 3% of the simulations, the overall winner was neither the public’s nor the jury’s favourite!
The chart below shows the total jury and public scores for each song in each of the 1,000 simulations, with the winners’ scores in purple. As we expect, the public score is more generous to the top songs than the jury score, and less generous to the bottom ones. The winner typically receives 200-300 jury points, but 350-400 public points, and there are many songs that get “nul points” from the public (about three times as many as get zero from the juries). Remember, these differences are entirely due to the details of the voting system – we have assumed that the voting public and the jury members are statistically all the same, across all countries.
Of course, in the real world, things are a little more complicated. In particular the voting populations are probably larger than 1,000, but certainly vary between countries, perhaps by several orders of magnitude. Countries will also have different probabilities assigned to the different songs – not all of them will tend towards A, for example, and not all in the same way or to the same extent (i.e. the set of probabilities will be different for each country). In addition, there may well be genuine differences between the tastes of the general voting population and the five experts or celebrities chosen to form the jury. All of these factors will affect how the competition develops in practice, but will not offset the structural effect demonstrated above.
Remember the chart above showing the greater correlation at Rotterdam 2021 between countries in their public scores than in their jury scores? Some, perhaps all, of this difference will also be due to the structure of the voting system. We have demonstrated that juries spread their votes more widely than the public, and therefore they will be less correlated between countries. In practice, there may be other cultural or geographical correlations between countries, but the voting system itself will make the jury scores less correlated than the public scores, even if, statistically, they all think alike.
Eurovision is not alone in combining public and jury scores – many competitions do something similar, and they often compare different things, such as a jury’s ranking of all of the acts versus the public’s choice of favourite. In most cases, however, there is just one jury and one public vote – what makes the Eurovision scoring particularly exciting is the combination of 39 separate juries and popular votes, which amplifies the statistical effects considerably.