1 Introduction

The Eurovision Song Contest (ES Contest in what follows) is a popular international song competition featuring participants representing, essentially, European countries.Footnote 1 With the exception of 2020, it has been held annually since 1956. The idea is simple: Each participating country submits an original song to be performed and broadcasted live. Competing countries cast votes to determine the ranking of all candidates.

The voting rules changed 18 times between 1956 and 2018.Footnote 2 The 65th contest (which will be the focus of our analysis in this paper) took place in Rotterdam in May 2021. Thirty-nine countries presented candidate songs.Footnote 3 Thirteen were eliminated during the semifinals.Footnote 4 We mainly discuss the voting rules and the results of the final in which candidates “face” two very different types of juries: experts and tele-voters.

In both, the semifinals and the final, professional juries of five experts in each country rank all songs, with the exception of the one presented by their own country. Abstentions are not allowed, and it is forbidden to award the same rank to two different songs. In each country, the song receiving the highest number of votes is ranked first, the song receiving the second highest number of votes is ranked second and so on. Only the first ten collect points: 12 points for the top song, 10 for the second, 8 to 1 for the eight remaining songs; all other songs get zero. Each country’s jury includes five experts. The rules state that members of each national jury must rank all songs and that “the combined rank of each country’s jury members determines the jury result of that particular country,” but no details are given as to what is meant by combined rank. The literature on social choice discussed in Sect. 2 shows that this is far from being a trivial issue.

Tele-viewers do not vote in the semifinals, but participate in the final via the official app, telephone and/or SMS. Each tele-voter can vote for any finalist (except for the one representing her own country of residence), but voters seem to have the possibility to vote as many times as they wish during the 15 minutes between the last song and the moment at which final results become public. Once votes are cast, they are added to produce a ranking for each country, in a similar way to what happens with the juries of experts. The final ranking is obtained by simply adding the number of points given by experts and tele-voters.

Classical contributions in social choice explain the above voting procedures. We, nevertheless, endorse a different voting method based on the Shapley Value (Shapley, 1953), a classical contribution in game theory that provides a natural way to allocate the total surplus generated by the coalition of all players involved in a joint venture, or a cooperative game, based on the marginal contributions players produce. Beyond the normative foundations for Shapley Voting we offer, we also emphasize an important advantage from a practical viewpoint: It is easier and faster than the current protocol in the ES Contest. To wit, under the current rules in the ES Contest, experts and tele-voters have a short period of time to decide on their choices (points, rankings or votes) after having listened to all 26 singers. There is a vast literature on speed–accuracy trade-offs documenting that people are more likely to make mistakes under time pressure, or that they regret their ranking when forced to do it quickly; see, for instance, Kocher & Sutter (2006), Milosavljevic et al. (2010), Fehr & Rangel (2011) or Heitz (2014). Thus, simplifying the process is a worthy enterprise. As we shall argue later, the Shapley Voting method would help to do so. The method would imply that judges simply say “yes” or “no” to each option, without the need to impose a ranking.

We also consider some biases that may have an effect on the outcomes of the ES Contest. First, there is literature on how alternative framings of information may influence final rankings in a positive or negative way (Flores & Ginsburgh, 1996; Glejser & Heyndels, 2001). We explore whether they may also play a role in the ES Contest, concentrating on the so-called opening advantage, as well as contrast effects. More precisely, we show that being surrounded by bad performers may enhance one’s own performance or the perception of those who have to judge the performers. If a song is performed among objectively poor performances, its quality might be perceived as higher than it objectively is. This is connected to observations made on status-seeking behavior and (relative) performance rankings in flat-wage environments (Charness et al., 2014).

Small world networks (Watts & Strogatz, 1998) abound everywhere. They are regular networks “rewired” to introduce increasing amounts of disorder, which can be highly clustered. We find that clustering effects can also be observed in the ES Contest. More precisely, we observe that obvious pairwise clusters (such as Cyprus and Greece) or larger clusters (such as Scandinavian countries) exhibit a more collusive voting pattern than generic countries. In other words, reciprocity seems to be a non-written norm for the scoring mechanism of countries within the clusters.

We conclude addressing the long-standing debate of whether experts are better judges than average citizens (tele-voters in this case), resorting to their respective predictive power.

The paper is organized as follows. In Sect. 2, we discuss and formalize the voting rules used in the 2021 ES Contest and their results, as well as alternative voting rules and the comparison of their results with those that were officially announced. Section 3 explores the remaining issues reflecting biases, ranging from contrast effects to clustering effects and expert versus tele-voting. Section 4 concludes.

2 Voting rules

2.1 Classical voting rules

Classical voting rules are often discussed in terms of their theoretical underpinnings or axioms.Footnote 5 Results obtained from one rule might well differ from those obtained from another. Therefore, the rule must be carefully chosen. The ES Contest’s ranking is the outcome of several intermediate steps in which classical voting rules appear, but it is difficult, or even impossible, to judge the final tally in terms of axioms. Thus, small changes in one or the other step may end up with undesired or unplanned consequences.

Arrow’s seminal work makes it clear that there is no perfect voting method, and some trade-offs in the choice of the method are unavoidable. In a two-candidate situation, ordinary majority voting is unambiguously the fairest method.Footnote 6 Nevertheless, if three candidates or more are at stake, ambiguity arises, as we argue in what follows.

Plurality voting in which each voter selects exactly one candidate, and the candidate receiving the largest number of votes wins comes to mind as a natural extension of majority voting. More generally, a scoring method (Young, 1975) is defined by the choice of a sequence of scores \(s_{1},s_{2},\ldots ,s_{n}\): A candidate k scores \(s_{k}\) points for each voter who ranks k in the kth place; the candidate (or candidates) with the highest total score wins (or win). The scores decrease with respect to ranks, i.e., \(s_{1}\ge s_{2}\ge \cdots \ge s_{n}\).Footnote 7Plurality corresponds to the scores \(s_{1}=1\), \(s_{k}=0\) for \(k=2,\ldots ,n\). Thus, it reflects only the distribution of the “top” candidates and fails to take into account the entire preference relation of the voters. The so-called Borda rule (de Borda, 1781) accounts for this flaw by endorsing the homogeneous (linear) scores \(s_{k}=n-k\) for \(k=1,\ldots ,n\). That is, Borda scores are derived automatically from the rank. This is a very natural protocol. Nevertheless, many institutions endorse other heterogeneous (nonlinear) methods in converting ranking to scores. For instance, Formula One recently moved from a Borda scheme to another scoring scheme in which the first (in a race) gets 25 points, the second 18, the third 15 and the followers 12, 10, 8, 6, 4, 2 and 1, respectively. The underlying logic is that there is additional merit in being best of all. This nonlinearity is also visually apparent in athletics, or the Olympic Games in general, where the podium height difference between first and second is greater than the difference between second and third. Finally, heterogeneous scoring methods may have flaws too. An interesting example is the one provided by Ashenfelter & Quandt (1999) who considered a case that became famous, as it may have changed the world of wines in 1976, at a time in which American wines were not well known. In the so-called Judgment of Paris, in which 9 judges gave scores between 0 and 20 to ten red wines (four American and six French), an American wine, the 1973 Stag’s Leap Wine Cellars S.L.V. Cabernet Sauvignon, was ranked first, overtaking very famous Bordeaux wines such as Château Mouton-Rothschild, 1970; Château Haut-Brion, 1970; and Château Montrose, 1970.Footnote 8 Ashenfelter & Quandt (1999) and Hulkower (2009) argued that using the Borda rule (a homogeneous scoring method, instead of the heterogeneous one being used at the Judgment of Paris) the 1973 Stag’s Leap Wine Cellars S.L.V. Cabernet Sauvignon no longer dominates all Bordeaux wines. This would have saved the honor of French wines, though it was too late. But it may also be the case that if voters knew that they had to rank the wines instead of grading them using points, the final result would have been different from the two methods just discussed.

Finally, Approval Voting (Brams & Fishburn, 1978) is another voting method in modern social choice theory (currently in practice in some US local elections, as well as to elect officers in numerous professional organizations). It allows each voter to cast a vote for as many candidates she wishes; each positive vote is counted in favor of the candidate. The votes are then added candidate by candidate, and the winner is the one who gets the largest number of votes. Under plausible assumptions, approval voting compares favorably with both the plurality rule or Borda’s rule (Weber, 1995). For instance, in the classical three-candidate setting, if two similar candidates share the support of a majority of the voters, a candidate preferred only by a minority of the electorate will never emerge as the clear victor with approval voting. Similarly, in the absence of polling data, when voters can be assumed to vote sincerely, approval voting is more effective in leading to an election outcome that well represents the preferences of the electorate. Only under approval voting do all of the equilibria involve every voter casting a ballot on which the votes for each candidate decrease monotonically with the utility derived by the voter from each candidate’s election.

2.2 The Shapley voting rule

The grand final of the ES Contest unfolds during one evening. In 2021, this took about two and a half hours. After the last song, tele-voters have 15 minutes to tally up their points. This is followed by the announcement of points given by each country’s experts, which is itself followed by the announcement of points given by tele-voters.Footnote 9

As mentioned in the Introduction, there is a vast literature on speed–accuracy trade-offs documenting that people are more likely to make mistakes under time pressure, or that they regret their ranking when forced to do it quickly. Expressing a yes or a no for each candidate instead of rating would be both much easier and faster. Now, in approval voting discussed above, a judge who chooses to vote for a large group of candidates is exercising more political or strategic influence than the one who chooses to vote for one candidate only. A natural suggestion is that each judge should receive a unique vote that can be distributed over candidates. If a certain judge chooses several candidates, each of them receives an equal fraction of her unit of voting. The argument for equal sharing of votes is that a judge votes for a group of candidates without expressing preferences over the members of the group.

Formally, the ballot works as follows. Each judge receives one vote that she can equally divide among a subgroup of size k, \(0 \le k \le n\) of the n possible candidates. Each chosen candidate gets a fraction 1/k of this unique vote, while the others get 0.Footnote 10 A judge who chooses \(k = 0\) does not count. These fractions of votes are then added candidate by candidate. The winner is the candidate who collects the largest number of fractional votes, but all other candidates can be ranked as well. This is more straightforward than ranking or scoring, as judges are simply required to choose the candidates that they like instead of having to rate or rank them.

The method presented above is the so-called Shapley Voting/Ranking, which we endorse here.Footnote 11 It owes its name to the well-known Shapley value (Shapley, 1953) in cooperative game theory, which provides a natural way to allocate the total surplus generated by the coalition of all players involved in a joint venture (a cooperative game), based on the marginal contributions players produce. In our voting context, the players are the candidates. Given a ballot profile, we can define a (cooperative) game associated with it whose characteristic function assigns to each subset S of candidates the number of electors whose approval set is included in S. A solution to this game provides a ranking of candidates by specifying for each of them a score equal to a fraction of the total number of electors. Ginsburgh & Zang (2012) prove that the Shapley value of a game so defined coincides with the voting procedure defined in the previous paragraph. In other words, the “amount of votes” (hereafter AVs) associated with each candidate in the Shapley voting rule yields a measure for its overall contribution (or quality, or weight). Some of the competing candidates likely to be of “better quality” are therefore chosen more often by judges and accumulate larger AVs. In the same way, groups containing substitute candidates are likely to be penalized, while those containing unique complements are likely to be valued by judges and compensated through their overall ranking. As stated by Dehez & Ginsburgh (2020), Shapley Voting is characterized by the following set of weak and natural properties:

  • Efficiency. The total AV, cast by all judges, is fully distributed among the participating candidates,

  • Null Candidate. Candidates appearing on no ballot get a zero AV,

  • Anonymity. If candidates’ names are permuted, AVs are permuted accordingly,

  • Additivity. The AV associated with a sum of ballot profiles on a common set of candidates is equal the sum of the AV associated with each ballot profile.

Table 1a gives an example with 5 judges and n = 10 candidates. Judge A, for example, chooses 5 candidates (1, 2, 3, 6 and 7), which implies that each of them gets 1/5 of his unique vote. Judge B chooses only two candidates (1 and 3), so that each of them gets 1/2. Column “Total” shows that candidate 1 ends up with \(1/5+1/2+1/3 = 1.03\), while candidates 2 gets \(1/5+1/4 = 0.45\), and so on.

In what follows, we try to simulate whether using Shapley ranking would have led to results that are close to (or very different from) the final observed ranking of the contest. The problem is that, as we have no information on the number of candidates each expert or tele-voter would have chosen, we have to make the restrictive assumption that each judge gets the same number of say k choices where \(0 \le k \le n\). Table 1b gives an example in which each of the five judges votes for \(k = 5\) candidates. Judge A chooses candidates 1, 4, 5, 7 and 9, judge B chooses 2, 4, 5, 6 and 8, and so on. The outcome in column “Total” is that candidates 1, 2, 4, 6, 7 and 10 are the winners with 3/5 points, candidates 3, 5 and 8 follow with 2/5 points and candidate 9 with 1/5 points is the last. It is obvious that this method may produce ties if the number of candidates and judges are small, as is the case here. But this is not so for the ES Contest in which there are 26 countries judged by 39 experts and tele-voters.

Table 1 Two Examples of Shapley Voting

2.3 Shapley voting and Eurovision

Though there are two stages in the ES Contest (semifinals and final), we concentrate mostly on the final which is judged by professional juries (also called experts in what follows) and tele-voters.

The professional jury from each country \(i = 1, 2,\ldots , 39\) is composed of five experts \(j = 1, \ldots , 5.\) Each expert j of country i has to rank the 26 songs \(k=1,\dots , 26\) admitted to the final. Let \(a_{ij}\) be the ranking given by judge j from country i. The combined rank of the five experts of country i determines its ranking \(b_{i}\). Formally, \(b_i=f(a_{i1},\dots ,a_{i5})\), but the function f is not specified in the rules.Footnote 12 The combined ranking is converted using the 12, 10, 8, \(\dots\), 1 scheme that leads to a scoring vector associated with country i, and denoted by \(e_{i}=(e_{i1},\ldots ,e_{i26})\). The final score of each song \(s_{k}\) is obtained by aggregating those scores across countries, that is, \(s_{k} = \sum _i e_{ik}\).

Tele-voting of country i is somewhat different, as there is no fixed number of votes. Let \(d_i=(d_{i1},\ldots ,d_{i26})\) be the vector of tele-votes cast by its citizens.Footnote 13 This aggregation is not explicitly detailed in the rules, but is probably obtained by adding the votes for each song in country i. The elements of vectors \(d_i\) are transformed into scores using the same 12, 10, 8, \(\dots\), 1 scheme. This leads to a scoring vector associated with each country i, denoted by \(g_{i}=(g_{i1},\ldots ,g_{i26})\). The tele-voting score of each song k is \(t_{k} = \sum _i g_{ik}\).

The final score for each song is obtained by an unweighted average of the two scores (experts and tele-voting). Formally, \(v_{k} = (s_{k}+t_{k})/2\), for each \(k=1,\ldots ,26\). Based on these scores, an ensuing final ranking is trivially obtained.

Table 2 gathers the results. Columns (2) to (4) display the number of points given to each country’s song; columns (5) to (7) display the number of votes that each country received. As can be seen, the points given by both experts and tele-voters differ quite strongly. Switzerland, Malta, Bulgaria and Portugal get a much larger number of points by experts than by tele-voters. The inverse is true for Italy, Ukraine, Finland and Lithuania. Belgium has the largest relative difference in the number of points (and of votes) between experts and tele-voters (moving from 3 to 71 and from 2 to 17, respectively). Spearman’s correlation coefficient r is still positive, but its value (0.38) implies that there is little agreement between experts and tele-voters. The same holds for the number of votes, though the differences are less striking, with, maybe, the exception of Bulgaria, Malta, Ukraine and Finland. Spearman’s correlation coefficient is very similar (\(r = 0.39\)).

Table 2 Points and votes. Experts and tele-voters

In Tables 3 and 4, we compute the Shapley scores of the Contest by experts and tele-voters. We assume that all judges choose the same number of candidates. We thus have to assume that all judges can choose either one, or two, or three,..., or ten candidates. We also assume that if each judge had one choice only (that is \(k = 1\)), she would have granted it to the candidate to whom she gave 12 points in the “real” contest. If judges could choose two candidates (\(k = 2\)), their votes would have been given to the candidates who received 12 and 10 points, and so on.

Table 3 Experts
Table 4 Tele-votes

Table 3a displays the results of the procedure for expert voting. The first column contains the countries of the candidates. Column (2) shows the number of points each candidate collected in the (real) competition. Columns (3)–(12) show the number of points when each judge could choose \(k = 1, 2, 3,\ldots , 10\) candidates. This leads us to the following results. If judges had only one choice (plurality voting), eight of them would have chosen Switzerland and France, four would have chosen Malta and Italy, a result that is already very close to the final “real” ranking (Switzerland, France, Malta, Italy and Iceland), though there are ties, and Iceland is not part of the top five. The results show that the final ranking of the top candidates (Switzerland, France, Malta and Italy) would have needed one choice only (\(k = 1\)).

With \(k = 2\) choices per judge (12s and 10s), the winner would have been Switzerland (14), followed by France (12), Italy (10), Iceland (8) and Malta (6), thus the five winners, but not in the final “real” order (Switzerland, France, Malta, Italy and Iceland).

Interestingly, Switzerland would not have been the winner in either of the last three cases (that is, with \(k = 8, 9\) and 10 choices). With \(k = 10\) choices, Malta would have received 35 votes. The group of first five countries (Switzerland, France, Malta, Italy and Iceland) appears, however, almost always among the winners with the exception of the \(k = 1\) choice only in which Iceland would have been excluded. It got one vote only, while Bulgaria, Greece and Moldova got 2 votes.

Following the spirit of the Eurovision scheme (12, 10, 8 to 1 points), we could also assume that each approved country would get a weighted vote, with the weight corresponding to the 12, 10, 8 to 1 scheme just mentioned. The results are shown in columns (3)–(12) of Table 3b. For instance, column (3) simply multiplies the entries in column (3) from Table 3a by 12. Column (4) results from awarding each country 12 points each time the candidate was first and 10 points each time it was second, etc., while the last column obviously repeats column (2), as both columns give the total number of weighted votes. In all cases, Switzerland would have been the winner (tied with France in the first case). But other differences arise. For instance, Malta would had been fifth with \(k = 2, 3, 4\) or 5 and fourth in the remaining cases, except for the \(k = 7\) choices, in which it would had been third (as is its “real” case in the Contest).Footnote 14

Similar results for tele-voters can be found in Tables 4a and b. In Table 4a, Italy would have remained the winner (tied with Ukraine, Lithuania and Serbia in the case of \(k = 1\), with Ukraine in the case of \(k = 7\), and with Ukraine and France if \(k = 10\) votes). But again, other differences arise. For instance, Serbia ends up being ninth although it ties in the first place with three countries in column (3). On the other hand, Switzerland ends up in the sixth position, but it was only awarded 12 points once, less than 10 other countries. As for Table 4b, where weighted votes are displayed, Italy is clearly always first (only tied in that place in the case of one vote only). But Serbia gradually goes down in the ranking in the following columns, until it ends up being ninth. On the other hand, Switzerland is tied for the 17th place when only \(k = 3\) choices are available (with a score around 20 times lower than that of Italy), but it ends up being in the sixth place (with a score higher than half the score of Italy).

The previous procedure can be mimicked using other scoring schemes. Instead of using the 12, 10, 8 to 1 scheme, we also computed the homogeneous Borda scheme (from 10 to 1) or the current Formula One car racing scheme (25, 18, 15, 12, 10, 8, 6, 4, 2, 1) for the top ten songs.Footnote 15

3 Biases in the results

3.1 Framing effects

The framing effect is “a cognitive bias wherein an individual’s choice from a set of options is influenced more by the presentation than the substance of the pertinent information” (Plous, 1993).Footnote 16 In this paper, we interpret the word framing broadly and, thus, assume that framing effects encompass different effects arising from the structure of the ES Contest (such as order effects and contrast effects).

We start referring to an example related to our context: the Queen Elisabeth Piano Contest, which takes place in Brussels every four years. The final is spread over six evenings and the results are proclaimed at the end of the sixth evening. Flores & Ginsburgh (1996) noted that those who play during the last evenings are better ranked than those who play during the first ones, though the order in which pianists perform is chosen randomly before the competition starts.Footnote 17 There is thus no reason to think that the quality of the pianists who play first is different (that is, worse) from those who play later. The difference comes thus from the way juries grade. One may wonder whether this may also be the case for the ES Contest.

Table 5 distinguishes three groups of candidates. Two are devoted to the semifinals (16 candidates in the first, 17 in the second one), in which only experts vote. The third is concerned with the grand final (26 candidates), for which we distinguish experts and tele-voters. In each part, countries and their points are shown in the (supposedly) random order in which singers performed. The table is divided into four parts (two semifinals, the final for experts and the final for tele-voters). In each case, column (1) is devoted to the names of the countries, column (2) to the number of points that each country obtained in the semifinals and the final, and column (3) to the mean scores of the two groups of 8 countries in the semifinals,Footnote 18 and of 13 countries in the final. The results show that the average score in the first group is always smaller than in the second one, with the exception of the final graded by experts. This contradicts the so-called opening advantage Haan et al. (2005), which argues that the singer who performs first has a better chance of winning.Footnote 19 In the first semifinal, the opening singer (Lithuania) obtained the fourth highest score. In the second semifinal, the opening singer (San Marino) obtained essentially the average score of her group.

Table 5 Are the last doing better than the first?

Somewhat related, we now compare the number of points given by experts to the 20 countries which participated in both the semifinals and the final. The number of points distributed during the two semifinals (\(2 \times (12+10+8+\cdots 1) \times 39\)) is twice as large as the number distributed in the final (\((12+10+8+\cdots 1) \times 39\)). To compare the two stages, we thus need to divide by 2 the number of votes collected by each candidate in the semifinals. As Table 6 shows, the results between the two sessions are reasonably consistent, though there are a few large differences. The ratio of points is higher than 1 for Belgium (+21%), Malta (+28%), Switzerland (+84%), Iceland (+38%) and Bulgaria (+12%). This brought Malta, Switzerland and Iceland among the top winners, while Albania, Serbia, Azerbaidjan and Norway incurred large losses (in percentages).

Table 6 Differences of points between semifinals and finals

We also explore contrast effects. There is a large literature on context-dependent choice in various settings such as finance, law, marketing or psychology.Footnote 20 The econometrics used to discern the effect need a large number of observations, which is not so in our case. This renders the remarks that follow a bit loose, but nevertheless interesting. In our setting, by contrast effects we mean the regularity that a given song scores better if it is surrounded by worse songs. Italy, the winner of the ES Contest, happens to be a late performer, but is surrounded by poor performers. The country obtained 206 points from experts and 318 from tele-voters, whereas its immediate predecessors (Azerbaidjan, Norway and the Netherlands) obtained only 32, 15 and 11 points from experts and 33, 60 and 0 from tele-voters. The followers (Sweden and San Marino) also obtained low scores. France ended in the second place obtaining 248 points from experts and 251 from tele-voters. Its three followers (who were precisely Italy’s predecessors mentioned above) were quite poor. Likewise, Switzerland ended up in the third place with 267 points given by experts and 165 by tele-voters. Its four predecessors (Portugal, Serbia, the UK and Greece) were quite weak. Its successor (Iceland), however, was actually strong as it finished fourth, outperforming it with tele-voters (although doing much worse with experts). In summary, it seems that being surrounded by poor singers when performing may enhance one’s own performance (or, at the very least, the perception others get).

3.2 Friends and foes

Although the ES Contest declares itself as a non-political event, geopolitical aspects play a role in it. For instance, in the aftermath of the Russian invasion of Ukraine, Russia was banned from participating in the 2022 edition and Ukraine was largely leading the polls to win it.Footnote 21

The collusive (I vote for you if you vote for me) or strategic (let us vote against a third party) voting behavior in the ES Contest has been studied by several scholars with various backgrounds, including computer science, economics and sociology. Yair (1995) was among the first to do so. Using votes between 1975 and 1992, he found that there are three bloc areas: Western, Mediterranean and Northern Europe. Gatherer (2006) used Monte Carlo simulation methods to study voting patterns from 1975 to 2005. He emphasized that large geographical blocs emerged in the mid-1990s. Ginsburgh & Noury (2008) found that there seems to be no reason to take the results of the ES Contest as mimicking political conflicts and friendships. Felbermayr & Toubal (2010) nevertheless used bilateral score data from the ES Contest to construct a (time-dependent) measure of cultural proximity and show that their measure positively affects trade volumes in a trade gravity equation. More recently, Budzinski & Pannicke (2017) argued that voting biases do not only matter in international contests but also occur in similarly organized national contests with roughly similar magnitude and quality.

In the 2004 Contest, which used tele-voting only, Ukraine, the winning country, benefited from the votes from all its former political “neighbors.” Ukraine’s average rating was equal to 8, but the country collected 12 from Estonia, Latvia, Lithuania, Poland and Russia, and 10 from Belarus, Serbia and Montenegro. Though they were far from winning, Belgium and the Netherlands could be suspected to have colluded: Belgium collected positive scores from Andorra (1), Cyprus (1) and the Netherlands (5) only, while the Netherlands only garnered positive scores from Estonia (3), Malta (2) and Belgium (6).

Tele-voting in the 2021 edition shows some similar patterns. Italy collected 12 points from friendly or neighboring nations such as Malta, San Marino and Serbia (as well as from Bulgaria and Ukraine). Serbia collected 12 points from Croatia, North Macedonia and Slovenia (as well as from Austria and Switzerland), but only 22 points in total from all other 34 countries. France received 12 points from Belgium and Spain (as well as the Netherlands and Portugal). Finland and Iceland reciprocated giving each other 12 points. The same happened with Greece and Cyprus.

In general, results from the 2021 edition show that reciprocity is more likely to manifest among geographically close countries. Greece and Cyprus is a case already mentioned above. So are the following pairs: Bulgaria and Moldova; Moldova and Russia; Russia and Azebaidjan; and Bulgaria and Greece. But the analysis can also be extended from individual reciprocity to group reciprocity. That is, we explore the existence of clusters of countries in the voting process. Scandinavian countries (Denmark, Finland, Iceland, Norway and Sweden) is a first (larger) cluster (though Denmark did not make it to the final). Finland collected 13 points from the cluster (almost 20% of its overall score), Iceland collected 27 points, Norway 5 points only but still a third of its total score. The case of Sweden in remarkable, as it received 25 points (more than half of its overall score). Serbia, the only Balkan country which made it to the final, collected points from North Macedonia, Croatia and Albania, Finally, France received 12 points from four of its neighbors (Spain, Germany, Switzerland and the UK, provided the channel). The remaining neighbors did not treat France so nicely. Italy (a strong opponent in the Eurovision Song Contest) gave it only 3 points and Belgium gave it zero. France did not reciprocate much with its neighbors and only two among them received points from it: Switzerland (7) and, funnily enough, Belgium (6).

In what follows, we look at the possibly symmetric behaviors (I vote for you if you vote for me) of experts and tele-voters in the final. In both cases, we concentrate on the top ten winners, as this is probably where the game is fiercest, but the same calculations could be made for all 26 finalists. Table 7 displays the votes cast by Swiss, French, Maltese, Italian, Icelanders, Bulgarians, Portuguese, Russians, Ukrainians and Greeks to the same ten countries. Table 8 is constructed in the same way for tele-voters.Footnote 22

Table 7 Winners and experts who voted for them
Table 8 Winners and experts who voted for them

If symmetry in the two \(10 \times 10\) matrices mentioned above were perfect, \(t_{ij}\), the number of points given by i to j (where \(i\ne j\)) would be equal to \(t_{ji}\), the number of points given by j to i. The numbers in Tables 7 and 8 show that symmetry is far from being perfect, though there are cases that are close. In Table 7 for example, Portugal gave 7 to Switzerland, and Switzerland gave 7 to Portugal. In Table 8, Finland gave 8 points to Italy, and Italy gave 8 points to Finland.

We now compare the symmetry (or asymmetry) of the two matrices. To do this we resort to a metric of asymmetry. For each matrix A, denote its transpose matrix by \(A^{t}\) and its Euclidean norm by ||A||. Let \(As=\frac{A+A^{t}}{2}\) and \(Aa=\frac{A-A^{t}}{2}\). Note that if matrix A is symmetric then \(As = 2A\), while Aa is the zero matrix. The number \(\sigma = \frac{||Aa||}{||As||}\) is measuring the degree of asymmetry of matrix A. The degree decreases if A gets more symmetric and will obviously be equal to 0 if A is (fully) symmetric.Footnote 23 We obtain that \(\sigma = 0.315\) for experts and 0.210 for tele-voters. The difference between the two numbers is quite small, but we can conclude from these values that tele-voting exhibits a little more symmetry (or reciprocity) than expert voting.

3.3 Are experts better predictors than tele-voters?

Artists, critics, philosophers and economists alike have long argued about whether specialists or the general public assess the quality of art more accurately (Wijnberg, 1995). The ES Contest allows us to provide an additional perspective to this debate, as it makes the comparison straightforward because judgments by experts and tele-voters take place at the same time. And neither experts nor tele-voters know the final result.

Haan et al. (2005) precisely addressed this issue making use of data from the ES Contest. More precisely, they used 42 years of data (1957–1997) from the ES Contest and stated that the order of performance should not matter for judgments of quality (as we also did above in Sect. 3.1). They found that experts are less influenced by order than the public and concluded from there that experts are more consistent.

In contrast, we concentrate here on predictive power (which, we acknowledge, might not be equivalent to quality decision making). Formally, denote by \(x_k\) the realization of event k and by \(x_{ik}\) the forecast of event k by forecaster i. Under perfect expectations, forecasters are able to perfectly predict events, that is \(x_{ik}\) would be equal to \(x_k\), for each i.Footnote 24 If their forecasting power had been “perfect,” experts as well as tele-voters should have been able to predict who would be the winner (with \(x_{1}= 12\)), the second (with \(x_2 = 10\)) and so on. In other words, the average number of points given by experts or tele-voters would be \(1/n\sum _i x_{ik}=x_{k}\) in each case and the forecasting ratio \(\rho = \frac{1/n \sum _i x_{ik}}{x_k}\) would be equal to 1.

We again restrict these calculations to the top ten countries \(k = 1, 2, \ldots , 10\), in their order of success (Italy, France,..., Bulgaria). Table 9 displays the observed number of points that each of the ten finalists received from experts and from tele-voters and compare them to the “perfect” number of points \(x_k\). For experts, the ratios shown in Table 9 are closer to 1 three times (Switzerland, Iceland and Malta) whereas tele-voters’ ratios are closer to 1 in seven cases (Italy, France, Ukraine, Finland, Lithuania, Russia and Bulgaria). In addition, the (Euclidean) distances of both vectors of ratios to the vector of perfect forecasting ratios (12, 10, 8,..., 1), are equal to 1.18 for tele-voters and 2.98 for experts. If we get rid of Bulgaria for which experts were quite far from 1, the Euclidian distances are 1.16 and 1.30, respectively. Altogether, we can safely argue that tele-voters are closer to the targets than experts.

Table 9 Experts and tele-voters. Perfect expectations

One might be tempted to argue that our finding goes in the opposite direction to the findings of Haan et al. (2005). Nevertheless, as mentioned above, they were not concerned with predictive power, but with immunity to order effects. Furthermore, we reiterate that predictive power might not be equivalent to quality decision making.

Finally, we acknowledge that there might be other reasons behind the feature that the public are better predictors of the finishing position. A plausible one is that the public is subject to plurality voting (just voting for a particular song, unless they unilaterally decide to vote for more than one song) whereas experts are required to input their full (1–26) ranking by means of another (non-degenerate) scoring method. The latter method is bound to be noisier when it comes to mapping quality to rank. Nevertheless, as we argued in Sect. 2, we believe Shapley voting should be used instead of plurality voting or any other scoring method. Imposing this same method for both experts and the public might make the predictive power of both groups more similar.

4 Concluding remarks

We have analyzed in this paper a certain number of problems posed by the 2021 ES Contest. We first focused on the voting process, which led us to suggest an alternative voting rule. Given the short time span (15 minutes) between the last performance and the announcement of the results, it may be much easier for experts to use Shapley voting which is not based on points, but on yeses and noes only (1 or 0). The method could also be used for tele-voters (instead of plurality), which would have both groups on equal footing.

Note that our suggestion is actually to replace, rather than complement, the current ES voting system with Shapley voting. In that hypothetical situation, judges would just have to provide a number of “yeses” (with judges having discretion over how many “yeses” they award). The computation of the Shapley value would not need to be performed by the judges themselves. Thus, it would actually decrease drastically the amount of work with respect to the current ES system. And the gained simplicity should help reduce the mistakes that both judges and tele-voters might make.

Beyond illustrating the effect that alternative voting rules could have, we have also analyzed how results are subject to specific biases. We have mostly concentrated on framing effects, encompassing opening as well as contrast effects. We have found that (a) the singer who performs first does not have a better chance of winning, (b) a reasonably good performance may be given a higher number of points if it is surrounded by poorer ones, (c) there is some collusive/reciprocal voting behavior within clusters and (d) tele-voters are slightly better predictors of the final results than experts. We should nevertheless state that this evidence refers to the 2021 ES Contest, as we have not explored earlier editions.

The existence of these biases casts doubts on the design of the ES Contest’s convoluted voting sequence, which renders almost impossible to find where the core of the matter lies. It may well be that adopting Shapley voting for experts and tele-voters may reduce biases. For instance, as mentioned in Sect. 3.3, it seems plausible to argue that the different voting systems experts and tele-voters use might have an effect on their (different) predictive power. Imposing Shapley voting for both groups (instead of plurality voting or any other scoring method) would level the playing field for both groups.

Leveling the playing field for experts and tele-voters seems to be a pressing goal. This is, for instance, illustrated by a recent controversy in the first edition of the so-called Benidorm Fest, which will actually grant the representation of Spain in the upcoming editions of the ES Contest. A candidate (Chanel Terrero) narrowly conquered the first position in the 2022 Benidorm Fest, in spite of having two other candidates (Rigoberta Bandini and Tanxugueiras) finishing above in the public vote. This was an extremely controversial outcome, which not only fueled the spread of conspiracy theories in social networks, but also received considerable attention in the conventional media (and even in the political arena). As a way to extinguish the fire, the Spanish broadcaster officially issued a statement calling for acceptance of the rules. At the same time, they also announced to launch discussions to improve the Benidorm Fest process in the future. We believe our work might help in that process.Footnote 25

To conclude, we mention that the comparison between experts and tele-voters might not be entirely surprising as it has often been argued that experts are not perfect in predicting quality.Footnote 26 Our findings might shed some light on the merit of expert judgment versus public opinion, provided we assume some correlation between quality judgment and predictive power. Now, as acknowledged by Haan et al. (2005), the data we used in this paper are a bit unusual to study the judgment of quality of cultural output. But, as they put it themselves, the character of the data, referring to an identical contest with the only difference of being judged by experts and the general public, provides a unique opportunity to test for differences between the two. We also believe, agreeing with what they state, that our results generalize to other cases where the quality judgment of cultural output is an issue.