1 Introduction

Article 6 ECHR (as does Article 47 Charter of Fundamental Rights of the European Union) stipulates that in the determination of his civil rights and obligations or of any criminal charge against him, everyone is entitled to a fair and public hearing within a reasonable time by an independent and impartial tribunal established by law. From this fundamental right follows, firstly, the primary responsibility of a judge to ensure his impartiality and to recuse himself when his impartiality can reasonably be questioned, and secondly, when the judge does not do this on his own motion, the necessity of a legal system to have a mechanism to challenge the jurisdiction of the judge or judges during court proceedings because of real or reasonably apprehended bias or lack of impartiality. This article focuses on the latter. The terminology differs between countries with different legal traditions and between regional and international courts. In this article we will refer to this mechanism as “challenging the (impartiality of the) judge(s)” and “challenging procedure”. The mechanism is triggered by a “request” or “motion” “to recuse” or “for recusal” by one of the parties or by their legal representation on their behalf. We will use these different wordings without meaning any difference in substance. A distinction can be made between objective and subjective impartiality. Objective impartiality concerns the absence of factors that can be objectively established and throw doubt on the impartiality of the judge, especially conflicts of interest stemming from e.g. family or other social or business ties between the judge and a party or his legal representation. Subjective impartiality relates to personal attitudes or opinions of the judge that lead to doubt about his impartiality. While breaches of objective impartiality can in most countries be established more easily and objectively on the basis of existing guidelines or codes of conduct for judges and the facts of the case, infringement of subjective impartiality is much harder to establish, and therefore requires a full judicial procedure to decide charges of subjective partiality. For both categories it is sufficient for a motion to recuse to succeed, to establish that, despite the fact that in all legal systems the personal impartiality of a judge must be presumed until there is proof of the contrary,Footnote 1 there is an apprehension of bias, i.e. a reasonable and informed person would be concerned that the judge might be biased.

The challenging procedures differ across countries (Giesen et al. 2012). Differences are about, for instance, whether or not the main case is halted once a motion for recusal has been made, which judge or judges hear the motion, whether or not (financial) sanctions follow a rejected motion. A motion for recusal is in some jurisdictionsFootnote 2 (in first instance) heard by the judge whose impartiality is questioned, in others (most European countries) the motion is heard by other judges from the same or in some cases from another court. If the motion is not followed by recusal, the judge continues. Otherwise, another judge takes his place. In many countries among which Belgium, France and the UK there are concerns about the improper use of this mechanism (Giesen et al. 2012). Improper use may aim at retarding the main procedure, at replacing the judge when a party has the impression that the judge will not rule in favor of him on legal and/or factual grounds, or at venting frustration with the legal system in general or the course of events in the case without offering any indication of partiality. In several countries, these concerns have led to a higher bar for charges of impartiality by, for instance, introducing a court fee or requiring that charges can only be made by a lawyer, and the introduction of financial sanctions in case of improper use. For instance, financial sanctions can be imposed in England, Belgium, Italy, Spain and Switzerland, while in France a financial sanction must be imposed if in criminal cases a charge is dismissed. Another approach is to summarily dismiss charges that evidently have no ground. See again Giesen et al. (2012). In other countries barriers are resisted. The Netherlands are a case in point. In 2013, 479 challenges were made in the first instance court, of which 12 were upheld, while in the appeal courts 147 challenges were made of which 9 were upheld (Raad voor de rechtspraak 2014, Tables 31 and 32).Footnote 3 These figures point to a very inefficient procedure, as there is no reason to assume that the decisions on challenges were biased against the challengers. It should be noted that not only improper use of the challenging procedure may cause this huge difference between challenges and upheld challenges. For instance, lack of information and emotional reactions to unexpected and/or undesired decisions of the judge play a role, especially for parties that do not have (adequate) legal counsel. This, however, does not alter the inefficiency of the procedure. Lawyers and judges alike consider barriers and, in particular, financial barriers an unwarranted infringement of the right of an impartial trial, although the general population believes otherwise (Van Rossum et al. 2012). Also, financial sanctions are considered to be ineffective by these actors. Empirical research about actual challenge behavior of parties is scarce. In Sect. 2 we will discuss the literature known to us. To our knowledge, no empirical research has been undertaken to establish the impact of barriers. In the absence of facts, everybody can pick his own argument, congenial to his normative position.

Given the divergent approaches and opinions, the subject merits research. The data for the Netherlands indicate that challenge procedures that have no barriers are extremely inefficient. The inefficiency is likely to increase, as in many countries the role of the judge in a case is changing from a passive to an active role, increasing the potential for conflicts between judge and parties (see Sect. 3). The issue is, therefore, how to increase the efficiency of challenge procedures, while maintaining the right to an impartial trial. We will focus on a financial sanction in the form of a fine and will address two questions. First, are fines effective in discouraging unfounded challenges? Second, do they lead to well-founded challenges not being made? We use experimental methods in an economic framework to address these questions.

Section 2 addresses briefly what is known about challenge behavior, while Sect. 3 examines changes in the role of the judge that will have an effect on the use of challenge procedures. Section 4 sets out the model, Sect. 5 the predictions and Sect. 6 the experimental design. The results are presented in Sects. 7 and 8 concludes.

2 Challenge behavior in practice

We know only of a study in the Netherlands (Van Rossum et al. 2012). This study as well as the comparative study mentioned before (Giesen et al. 2012) were commissioned by the Netherlands Council for the Judiciary in response to the sharp increase of the number of challenges, experienced in that country. The number increased from 258 in 2007 to 607 in 2011, while the number of confirmed challenges increased from 16 to 36 (Van Rossum et al. 2012, p. 26).Footnote 4 The confirmation rate remained roughly the same. The number of the challenges is a very small percentage of the total volume of cases: 0.015 % in 2007, increasing to 0.034 % in 2011 (Van Rossum et al. 2012, p. 28). As pointed out in the study, the number of challenges is, however, high, when compared with the number of judges in the Netherlands (2500). Challenges have become a common phenomenon. As to the results of the study, numbers are presented about the grounds that were adduced to substantiate challenges (Van Rossum et al. 2012, p. 33). Challenges of objective impartiality were scarce (2.5 % in 2011). Thus, nearly all challenges were of a subjective nature. A major category concerns procedural decisions by the judges (e.g., planning of the hearings and whether or not to allow witnesses to be heard). This category accounts for 21.5 % of all challenges, and is relatively stable over time, according to the authors. A rapidly growing category is the treatment of parties or conduct of the case (disrespectful behavior, utterances that give an impression of partiality, suggestions that the judge already made up his mind, etc.). In 2011 the proportion was 33 %. Another category pertains to earlier decisions of the judge that suggest that he is not impartial (10 %), doubts about his professionalism (19 %), information deficiencies (3 %), distrust (2.8 %) and rest categories of unknown and miscellaneous reasons. The report gives also some anecdotal evidence that the grounds are not always the real reasons to challenge a judge. This happens, for instance, when a lawyer (unreasonably) wants to get more time. The authors found that, while in the decisions about challenges it is concluded in 12 % of the cases that the procedure has been abused, there is no shared definition of what amounts to abuse. What is often treated as abuse are challenges that are not aimed at the judge (but at the judiciary in general) or repeated challenges based on the same grounds. Strategic use aimed at getting more time or at getting a more favorable judge is not treated as abuse of the procedure. Challenges aimed at such objectives have not been identified. They can be found, in particular, in the categories procedural decisions and treatment of parties or conduct of the case, although it should be noted that the study finds that especially in situations where parties represent themselves anger and frustration play an important role. These people generally want to signal their dissatisfaction. Lawyers take a more rational approach by weighing advantages and disadvantages, and behave strategically. Both groups note that no financial costs have to be incurred and that that makes the decision to challenge easier. The authors find it surprising that the mechanism is not used more often by lawyers. While they found that lawyers use the mechanism strategically, they also found that lawyers are reluctant to use the instrument. Expressed reasons are that it prolongs the procedure, that a challenge puts a strain on the relationship with the judge, but also that lawyers still adhere to the social norm that judges are in principle impartial and that a challenge is only fitting in extreme cases. Whether this social norm will continue to be shared, can be doubted.

3 The role of judge and challenging behavior

There is reason to believe that unfounded challenges will increase. Generally, judges conducted and in some countries still conduct procedures in a passive manner. The judge behaves like a “sphinx”. To quote from a recent report of the European Network of Councils for the Judiciary about judicial reform: “… in several countries the parties often only give long explanations about their view of the case, without any dialogue or questions from the judge and then the judge only fixes the date when the judgment will be pronounced. At the end of the hearing, parties have no idea which direction the verdict will take. Very much importance is given to the briefs, but the hearing would be much more interesting for the judges and the parties, if there would be more “discussion” or “dialogue”.” (ENCJ 2013, p. 17). As the judge gives no indication of his thinking about the case, even if only by his questions, parties have no reason to doubt his subjective impartiality or to worry about the likely direction the verdict will take. This passive, neutral role is giving way to a much more active role (Bauw 2011). The report of the ENCJ describes this trend and advocates its further adoption as part of essential judicial reform. Case management is an important instrument. This is defined by the ENCJ as “…the judge taking the lead in resolving a legal conflict in a fair, expeditious and efficient manner. Case management applies to all areas of law. Within the law, the judge determines the procedure in cooperation with the parties and their legal representation, and ensures that this procedure is adhered to. The judge ensures that the procedure is commensurate with the complexity, size and relevance of the conflict. Therefore, it is the responsibility of the judge not only to decide the case, but also to direct it.” (ENCJ 2013, p. 14). Pre-trial conferences to establish the proper method to resolve the case and to sort out differences of opinion about procedure are an essential component of case management. Obviously, the judge has to give insight in his thinking about the case. In addition, he runs the risk of giving too much away of his views or giving wrong impressions by expressing himself in an unfortunate manner. According to the ENCJ, judges struggle with this new role: “A typical case is Belgium, where pre-trial conferences are short, and the only purpose is to create a more proactive approach and prepare questions for the parties in the final hearing. However this is more an exception, than the rule. Most of the Belgian judges clearly fear to give, by their questions, a statement about their position in the case, which would permit one of the parties to claim that the judge isn’t neutral.” (ENCJ 2013, p. 15).

The ENCJ also promotes the simplification of procedure, for instance, by restricting the number of procedural steps in a case. In its view, “it is entirely reasonable to require parties to supply the court with all relevant information up front, instead of holding back information for strategic reasons (see above on pre-trial conferences). Repeated exchange of arguments on paper could be disallowed, and replaced by a swift hearing, immediately followed by an oral or written verdict. … Methods currently used in on line dispute resolution may provide the courts with tools to have parties present and discuss their disputes in a more informal and interactive manner.” (ENCJ 2013, p. 18). These developments will lead to more conflict in the court room. As the judge puts more pressure on parties to proceed in an expeditious manner, some parties and their legal representation will disagree with the judge and will try to gain time by the remaining means such as a charge of partiality. Also, as the judge reveals more of his thinking, parties will challenge his impartiality, either to put pressure on him or in the hope that the challenge succeeds and a new judge will think differently. It seems that these developments require the mechanisms to challenge judges to include barriers for unfounded challenges, while upholding the fundamental right to an impartial trial.

4 Model

There are several rational reasons for a party or lawyer to challenge. (1) There may be reasons to believe that the present judge is biased to his disadvantage and a replacement judge will rule more favorably; (2) there are no signals that the judge is biased, but there are indications that the judge will rule against a party on legal or factual grounds, for instance when the judge gives his interpretation of a relevant law or when he critically examines a witness. Replacement of the judge may lead to a new judge who has a different opinion; (3), a lawyer can request a challenge because, even if the challenge will be dismissed with certainty, the short delay gives him more time for preparation; (4), a challenge may signal the aggressiveness of the lawyer to the judge, to the adversary and to his present and prospective clients; (5), the replacement of the judge and the restart of the trial can be very advantageous for the party for whom the status quo is favorable, as the trial will be delayed substantially. For such a party, even a very small possibility of success makes a challenge worthwhile in expectation.

To model the situation in a format that can be studied in a laboratory, we have to simplify; see Table 1 for a summary of the sequence of events. Consider the following court case. Two business partners, Mr. Red and Mr. Blue, have decided to split up their company. However, they are unable to come to an agreement on how to split up the remaining capital (100 points in the experiment) and go to court. Mr. Blue is the claimant and Mr. Red is the defendant. A judge will divide the 100 points between the parties. Judges can be biased in favor of the Blue party, the Red party or be unbiased. The unbiased judge will divide the 100 points evenly, while the Blue (Red) judge will appoint 75 points to Blue (Red) and the remaining 25 points to Red (Blue). The odds for each type of judge to be handling the case is always \({\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 3}}\right.\kern-0pt} \!\lower0.7ex\hbox{$3$}}\) and it is never clear beforehand which type of judge has been appointed to the case. However, during the session the judge will give 9 signals that indicate potential partiality. For the unbiased judge each signal is Blue (or Red) with probability 50 %. For the Blue (Red) judge the probability of a Blue signal is 75 % (25 %) and a Red signal 25 % (75 %). After the signals are observed a request to challenge can be made. When the judge is not challenged, or when the challenging request is turned down because the judge was in fact impartial, the game ends by the judge dividing the 100 points. When a challenging request is made and the judge was Blue or Red, the judge is replaced and the procedure starts again. For practical reasons, we limited the number of times a judge can be replaced; the fourth judge cannot be challenged.

Table 1 The sequence of events in a period

The payoffs for the Red and Blue parties in the control treatment (without summary review whether or not a challenge is evidently unfounded) are as follows. Representing the gains from a signal to a client (see point 4 above) or the extra preparation time (2), a party that requests a challenge earns 2 points regardless of the judge turned out to be partial or not. However, these points are only awarded when a subject requests a challenge of the first judge, i.e. in the first stage of the period.Footnote 5

Only one party, generally the defendant, gains from delaying the court by the replacement of a judge (5), and the other party, the claimant, loses. In the experiment it was assumed that every delay was a disadvantage for the Blue party, while the Red party would gain. This was represented by the fact that the Blue party had to pay the Red party 5 points whenever a judge was replaced by a new one after a successful challenge. Contrary to the gains from the signal to a client, this delaying effect also took place for the second and later judges/stages.

The final payoffs for the Blue party consists of the points awarded by the final judge (50 when this judge is unbiased, 25 when this judge is Red and 75 when this judge is Blue), minus the number of replaced judges multiplied by 5 points, plus 2 points only if Blue challenged the first judge. For the Red party the final payoffs are the points awarded by the final judge (50 when this judge is unbiased, 75 when this judge is Red and 25 when this judge is Blue), plus the number of replaced judges multiplied by 5 points, plus 2 points only if Red challenged the first judge.

The experimental treatments all three contain a summary review whether a challenge is evidently unfounded. A challenge is evidently unfounded if less signals or only marginally more signals of bias are in favor of the opposing party than of the challenger. A challenge after four or less signals in favor of the opposing party is always regarded as evidently unfounded and a request after 5 signals is evidently unfounded with a probability of 50 %. This probability reflects the uncertainty involved at the margin in judging whether a challenge meets the criterion of being evidently unfounded. In practice, parties cannot be certain about the outcome of the decision.

Challenges that are ruled to be evidently unfounded are summarily dismissed, either without further consequence for the challenger or at a cost to him. The level of the fine is announced at the beginning of a period and is 0, 4 or 12 points.

Note that the definition of the problem and the choice of parameters are heavily influenced by practical considerations. In the real world the percentage of biased judges is very low and signals in one or the other direction are rare, while in our experiment 66 % of the judges are biased and the parties always observe 9 signals. If we would use a more realistic percentage of biased judges and fewer signals we would need many more periods and participants, but the basic underlying problem is the same. In the real world the costs and benefits can vary widely (for example the benefit or cost of delay will differ between cases) but in the experiment we had to choose specific numbers. The different treatments (no fine and fines of different sizes) are chosen such that the theoretical predictions would be interesting and reasonable. This is completely in line with standard experimental economics methodology (Fréchette and Schotter 2015).

5 Predictions

We calculate the optimal behavior in this game under the assumption of risk neutrality. In the control condition, all judges are challenged, either by the Blue party if there are four or less Blue signals, or by the red party if there are 5 or more Blue signals. The first judge in a trial are challenged by both parties in the Nash equilibrium, because of the 2 points one can earn by challenging the first judge (if the judge is likely to be biased in your favor, the other party will request a challenge and you can earn 2 points by also challenging). However, challenging when the judge seems to be on your side is a very risky strategy. For example, if the first judge provides 9 blue signals it is in expectation advantageous for Blue to challenge only if he expects that Red will challenge with a probability higher than 94.4 %.Footnote 6 In the treatment conditions, it is always best to challenge when it is certain that the review whether or not a challenge is evidently unfounded will be passed (0–3 signals of your own color). In the border cases of 4 or 5 blue signals, optimal behavior depends on the fine and the color: the Blue party will not challenge in the case with 4 Blue signals when the fine is 4 or 12, and the Red party will not challenge with 5 Blue signals when the fine is 12. As Table 2 shows, we expect as many challenges and replacements of judges in the control treatment and the fine = 0 treatment, but fewer when the fines are higher.

Table 2 Predictions for risk-neutral parties

These predictions are under the assumption of risk neutrality. Risk neutrality is a reasonable assumption in the field when the decision makers encounter the same problem on a regular basis. Risk aversion will lead to fewer challenges of the first judge in the control treatment when the signals indicate a judge in your favor (see previous paragraph and footnote 2). In all other cases a reasonable level of risk aversion has no effect on the prediction, with only one exception: in the fine = 4 treatment a risk-averse red party will not challenge when there are 5 blue signals.Footnote 7 This very limited effect of risk-attitude on the predictions can intuitively be explained by the fact that every decision is a choice between two risky situations: both the bias of the current judge and a new judge is uncertain.

There is abundant evidence for social preferences in laboratory experiments: decision-makers care about relative earnings. In games like the ultimatum or dictator game many decision-makers appear to be inequity-averse (Fehr and Schmidt 1999). In other cases, where the situation is more like a competition, decision-makers typically want to earn more than the other and “to win” (e.g. Bault et al. 2008). Our experimental setup, with two antagonistic parties meeting in court, is more like the second case and may enhance competitive attitudes. Competitive preferences would influence specifically the Red party, because a successful challenge transfers 5 points from the Blue to the Red party.

Finally, there are interesting empirical studies that show that fines may have an effect in the opposite direction than we expect. For example, Gneezy and Rustichini (2000) report a field experiment at day-care centers where some parents used to arrive late to collect their children. When introducing a monetary fine for parents who arrived more than 10 min late to pick up their children, they found that the number of late-coming parents increased significantly. After the fine was removed no reduction occurred. The explanation of this result is that the fine changed the parents’ perception of the social interaction in which they are involved. Without the fine the costs of coming late were not exactly defined. There were only abstract costs of breaking a social norm, i.e. the norm that it is socially unacceptable to let the teachers wait. By contrast, after the fine had been introduced the consequences of coming late were defined precisely. Probably parents now interpret the fine as the only cost for letting the teachers wait. Hence, the effect of a fine could lead to a different outcome than the deterrence hypothesis would predict. Similar behavior could take place in our experiment, where the social norm might be that the impartiality of judges is not to be put in doubt. While in our experiment the judges are virtual and not participants in the experiment, which makes the social context of the decisions less central, the experiment is deliberately set in a judicial frame. This frame triggers the legal and social norms of the courts, as participants imagine these.

6 Experimental design

The experiment was run in the laboratory of CREED at the University of Amsterdam in January 2013. In total, 80 students participated as subjects in four sessions and earned on average 30 euro in about 2 h. A within-subject design was employed: in the first part (10 periods) there are no checks whether or not the challenges are evidently unfounded (control) and in the second part (30 periods) unfounded challenges are refused and the requester has to pay a fine of 0, 4 or 12 points (treatment).Footnote 8 Each period could consist of a maximum of five sequential stages in which all subjects that were still in play had to make a decision whether or not to request a challenge. Before starting experiment subjects’ risk preference was measured using the Holt and Laury test.Footnote 9

Subjects were given a monetary incentive to perform: in each period depending on their choices and the choices of their opponent they could earn points, i.e. 1000 points equalled € 10. At the end of the experiment their earnings of all periods and the Holt and Laury test were totaled and paid out in cash anonymously.

In the experiment the following court case was sketched. Two business partners, Mr. Red and Mr. Blue, have decided to split up their company. However, they are unable to come to an agreement on how to split up the remaining 100 points and go to court. In each period two parties of opposing parties were randomly and anonymously assigned to play against each other. All participants were informed that they would never play the same opponent in consecutive periods.

6.1 Procedure

At the beginning of the experiment each of the subjects was assigned to a private cubicle and computer according to a randomly drawn code. Communication between subjects was not allowed. When all the subjects were seated the experiment started with the Holt and Laury (2002) test in which subjects have to choose ten times between two lotteries.

Subsequently, the instructionsFootnote 10 of the first part of the experiment (control treatment) were shown on the screen and were handed out on paper. Subjects had to answer three questions that checked their understanding of the experiment, i.e. they had to calculate a party’s payoff for an outlined situation that could occur during the experiment. It was stressed to the participants that the situations were random and that the mentioned choices are not necessarily wise. They were also given the opportunity to ask questions, which an experimenter answered privately.

The control treatment consisted of 10 periods with re-matching after each period.Footnote 11 At the beginning each period the subjects learned their color for that period and the nine signals, represented by red and blue balls, were shown. Hence, the participants could make inferences on the partiality of the sitting judge and the corresponding potential verdict. Based on this information both opposing lawyers simultaneously had to make a decision on whether or not to request a challenging procedure.

When both lawyers decided not to request a challenge, the judge would come to a verdict and the 100 points were distributed accordingly. When one or both lawyers decided to challenge, the partiality of the judge was checked. In case the judge turned out to be neutral, that judge would remain on the case and would give the verdict. However, in case the sitting judge turned out to be partial, this judge would be replaced by a new one. This would entail that the period would proceed to the second stage, which starts of—as do all stages—with nine signals from the new judge. In effect, this would return the game to the starting point when both parties again have to decide whether to request a challenge. In turn, the period could go on for five stages, i.e. the participants were informed that after the replacement of four judges, the fifth judge cannot be challenged and will immediately come to a verdict. When a judge has come to a verdict the period ends and the earnings—the division of the 100 points according to the verdict and gains or losses from the signals to clients and the delay of the court—are revealed.

After the end of part one of the experiment subjects received the instructions for the second part of the experiment. These were also shown on the computer screen. Again subjects had to answer control questions to check the understanding of the experiment.

The second part of the experiment consisted of 30 periods, with random re-matching after each period. At the start of each period the subjects learned the fine for an evidently unfounded challenging request (0, 4 or 12), which would stay constant during that period. As in the control treatment, subjects had to decide whether or not to put forward a challenging request after learning the nine signals. When both lawyers decided not to request a challenge, the verdict of the judge was revealed. When one or both lawyers decided to put forward a challenge, this request was firstly tested whether or not it is evidently unfounded. Hence, if the challenge was found to be unfounded the requester would have to pay a fine and the challenge of the sitting judge was refused. Subsequently, the judge would come to a verdict and both parties’ earnings for that period were revealed. However, if the request was found not to be unfounded, the partiality of the judge was checked. As in part one, if the judge turned out to be neutral, that judge would remain on the case and would give the verdict, and in case the sitting judge turned out to be partial, the judge would be replaced by a new one and the period would proceed to stage two. As in part one, the fifth judge could not be challenged.

After the end of part two, the earnings of the subjects were totaled and revealed to the participant. This also entailed that the computer randomly picked which lottery was played out of the ten chosen lotteries chosen earlier by the subjects in the Holt and Laury test. Subsequently, both the earnings from the Holt and Laury lottery and the rest of the experiment were revealed privately to the subject. Lastly, the participants were asked to give some personal information regarding their age, gender and study. Before leaving the lab all participants were paid out privately.

7 Results

To give a first impression, we consider the overall fraction of challenges in the treatments, compared with the predictions of Table 2 (Fig. 1). We have to distinguish between the first judge and subsequent judges, because of the 2 points premium for challenging the first judge, and between the behavior of Blue and Red parties. The introduction of a review whether or not a challenge is evidently unfounded without a fine decreases the challenges by Red parties substantially (see Table 3 for statistical tests and p values). For Blue parties we do not find a statistically significant effect for the first judge, but for subsequent judges the number of challenges significantly increases. The test communicates apparently also that challenges based on solid information are allowed. Interestingly, it can be concluded that a review without fine leads to an improvement of efficiency (less unfounded challenges) as well as a higher degree of legal protection (more not-unfounded challenges).

Fig. 1
figure 1

The overall fraction of challenges by the Red players (top) and Blue players (bottom), with next to it the predictions of Table 2

Table 3 First four rows: percentage of challenges with between brackets the number of choices

Introducing a fine of 4 decreases the number of challenges of both the Blue parties (in line with the prediction; p < 0.05 for first and p < 0.10 for subsequent judges) and Red parties (against the prediction, p < 0.05 for the first judge and n.s. for subsequent judges), compared with a 0 fine. Increasing the fine from 4 to 12 decreases the Red challenges of the first (p < 0.05) but not subsequent judges, and keeps the Blue challenges of the first judge unchanged and decreases the challenges of subsequent judges (p = 0.05).

Next we study the occurrence of challenging for the signals where challenges would be not founded, unfounded or uncertain. Figure 2 and Table 3 shows averages and statistical tests for the cases with a signal that indicate a negatively biased judge.

Fig. 2
figure 2

Percentage of challenge requests by Red (left) and Blue (right) players for the first (top) or subsequent judges in a period

When a challenge is well-founded, both the red party and the blue party are very likely to challenge in all treatments (94–98 % for red and 87–97 % for blue, see Fig. 2 or Table 3). This is in line with our prediction: introducing a review with or without fine does not influence the challenging behavior when the judge is clearly biased.

In the control treatment a party who receives favorable signals about the first judge, but is quite sure that the other party will challenge, should challenge too because of the 2 points premium. As explained in footnote 3, this can be quite risky, especially for the blue party. Indeed, the blue party is on average more reluctant than the red party to take that risk and challenges a favorable judge (borderline favorable 26 vs 79 %, 6–9 favorable signals 38 vs 57 %, see Table 3: Wilcoxon test on matching group level, two-sided p = 0.012 and p = 0.018).

In case of a favorable signal subsequent judges should not be challenged, according to our prediction. Indeed, blue parties do rarely challenge when there are 5 or more blue signals. However, red parties challenge subsequent judges with 0–4 blue signals in 22 % of the cases in the control treatment (see Fig. 2) and in 46 % when there are exactly 4 Blue signals (not in table). These challenges are considered evidently unfounded in the fine-treatments and become rare when the fine is larger than 0.

Now we turn to the borderline “uncertain” cases where a challenge is considered unfounded with 50 % chance. We first consider the Red parties when there are 5 Blue signals. The prediction is that the Red party will only cease challenging when the fine is increased to 12 points. For the first judge the percentages of challenges do not significantly differ between the control and the fine = 0 or fine = 4 treatments, but we find significantly fewer challenges in the fine = 12 treatment than the other three treatments (see Table 3). This change is in line with our predictions, however, there are still a considerable number of challenges when the fine is 12 (51 % for the first and 40 % for subsequent judges). This suggests competitive social preferences of at least some subjects.

When there are 4 Blue signals the Blue parties are predicted to challenge only in the control and the fine = 0 treatments. Surprisingly, in the control treatment they challenge in only 50 % (first judge) and 16 % (subsequent judges) of these cases. Challenges are more common in the fine = 0 treatment and increase to 63 % (n.s.) and 41 % (statistically significant), respectively. Defining in what cases challenging is unfounded implicitly defines challenging founded in all other situations and this may have increased the occurrence of challenging. However, learning is an alternative explanation of this result. Challenges decrease again with the introduction of fines: for the first judge from 63 % (fine = 0) to 29 % (fine = 4, comparison fine is 0 and 4: p = 0.069) to 7 % (fine = 12, comparison fine = 4 and 12: p = 0.043) and for subsequent judges from 41 % (fine = 0) to 29 % (fine = 4, comparison fine is 0 and 4: p = 0.025) to 10 % (fine = 12, comparison fine = 4 and 12: p = 0.063).

Finally, we turn to the effect of the risk attitude on the challenge decisions. Based upon the Holt and Laury test we divide the participants in risk-averse (≥6 A-choices, N = 43), risk-neutral (5 A-choices, N = 9) and risk loving (≤4 A choices, N = 19) subjects. Four subjects never switched from the A to the B-choice and 5 subjects made inconsistent choices (switched more than once): these participants are excluded for this analysis. We find no systematic effect of risk attitude on challenge decisions (see appendix 2). In the specific case where we predicted fewer challenges for risk averse participants (fine = 4, Red party, 5 Blue signals) the data are in the predicted direction (the first judge is challenged in 69 % of the cases by risk-averse subjects and 90 % and 100 % by respectively risk-neutral and risk-loving subjects; for the subsequent judges these numbers are 63.3, 100 and 94.4 % respectively) but these differences are not statistically significant (note however that the number of observations and thus the power of the tests are small).

Interestingly, it can be concluded that a summary review without fine leads to an improvement of efficiency (less unfounded challenges) as well as a higher degree of legal protection (more founded challenges). A small fine of 4 effectively reduces evidently unfounded challenges to close to zero and also reduces uncertain challenges that may or may not be deemed evidently unfounded by a judge. A fine of 12 leads only to a further reduction of uncertain challenges. It can be concluded that with the introduction of a fine a trade-off is created between efficiency and legal protection. This is especially the case with a fine of 12.

8 Conclusion

In this paper we have addressed the strategic (mis)use of the procedures for challenging judges because of alleged impartiality. Such procedures exist in most legal systems. The strategic use of these procedures, to gain time or to get a judge who might think differently about specific legal matters than the current judge, is an issue in several jurisdictions. Procedures for challenging judges exist to provide legal safeguards against the eventuality of judges that are not impartial. Legal protection is paramount. However, challenges prolong procedures and consume resources. An efficient mechanism is, therefore, also important. Ideally, such a mechanism would deter challenges for strategic reasons and without (substantial) evidence, but would not deter parties from challenging judges when (substantial) evidence is available. Evidently, there will be a trade-off between legal protection and efficiency, when it comes to determining the burden of proof. How this trade-off actually works out is an empirical matter that is difficult to research empirically, other than by experimental means.

We designed an experiment to capture key aspects of the interaction in a simplified and abstract manner. It was set up in such a way that rationally without counter measures judges are frequently challenged. This provides the opportunity to test procedural remedies. Subjects actually challenged judges very often. When there are strong indications of partiality and thus challenges are founded on fact (six or more signals out of nine are against a party, and thus three or less in favour), challenges are nearly always made: 94.4 % of initial judges and 97.8 % of subsequent judges who are called upon after a successful, initial challenge and possibly further successful challenges, in case of parties that benefit from delay (“red” parties, generally the defendants) or very often (86.8 % of first, and 83.7 % of subsequent judges) for the opposing parties that are disadvantaged by delay (“blue” parties, generally the claimants). Also, many challenges are made especially by the red parties (62.5 % of first and 22.0 % of subsequent judges) and, to a lesser degree, by the blue parties (34.8 % of first, 5.3 % of subsequent judges), even if the signals are in favour of these parties themselves (five or more signals in favour of a party and thus four against it). In between are uncertain situations with weak evidence of partiality (five signals against a party and four in favour of it). Then, red parties challenged 94.4 % of first judges (85.7 % of subsequent), blue parties only 50.0 % (15.9 %).

A summary review whether a challenge is evidently unfounded was introduced. If that was found to be the case, the challenge was dismissed without further consequences for the claimant. The outcome of the test is determined with certainty if the signals point in a clear direction as demarcated above. In the intermediate situations the decision can fall both ways. The impact of the introduction of the review is that the number of unfounded challenges is reduced from 62.5 to 23.9 %Footnote 12 (22.0–12.9 % for subsequent judges) for red parties and from 34.8 to 14.5 % (5.3–2.1 % for subsequent judges which is not significant) for blue parties. Firmly founded challenges are not affected significantly. Uncertain challenges by red parties are not affected, but these challenges by blue parties of first judges increase from 50.0 to 63.3 % (not significant) and of subsequent judges from 15.9 to 40.7 % (significant). The total number of challenges by red parties of first judges declines (79.8–58.4 %), while the challenges of subsequent judges do not significantly decline; the challenges by blue parties of first judges remain constant and increase for subsequent judges from 35.4 to 44.3 %. Overall, the total number of challenges declines, as mentioned, for first judges and remains constant for subsequent judges. To conclude, the review mechanisms leads to a reduction of unfounded challenges and an increase of challenges that stand a good chance of success. Overall, the number of challenges declines. The review mechanism serves both purposes: legal protection and efficiency.

Next, fines were attached to the review. If a challenge was ruled to be evidently unfounded, a fine was imposed. The impact of a small fine is a sharp decline of unfounded challenges relative to a review without fine: for red parties from 23.9 to 7.0 % of first judges, and 12.9–3.4 % of subsequent judges. And for blue parties from 14.5 to 3.2 % of first judges and constant for subsequent judges. The number of warranted challenges does not change significantly. The uncertain challenges, however, decrease: for red parties from 91.8 to 75.5 % of first judges (not significant) and from 95.7 to 76.5 % of subsequent judges, and for blue parties from 63.3 to 29.3 % of first judges (only marginally significant, p = 0.07) and from 40.7 to 28.8 % of subsequent judges. Overall, the total number of challenges declines, both for first and for subsequent judges. There is a trade-off between legal protection and efficiency.

Increasing fines threefold results, in particular, in a vast reduction of uncertain challenges. For all situations except blue parties and subsequent judges the number of challenges is far below the number without any intervention. For blue parties the number a challenges of subsequent judges is at the same (low) level as initially. Consequently, the total number of challenges decreases further. It was pointed out by an anonymous referee that, if courts are allowed to keep the proceeds from the fines, this may lead to or—we may add -raise the suspicion that it will lead to a stronger incentive to dismiss challenges. This potential incentive problem was not included in the experiment, but the issue can be easily avoided by allocating the proceeds elsewhere.

The results are clear and consistent. Introduction of a review to filter out evidently unfounded challenges increases the effectiveness of legal protection, as more parties with a substantial chance of success use the opportunity, and it reduces the number of challenges. This dominates the control condition, not even taking into account that it saves the judiciary time when challenges are summarily dismissed. Attaching a fine to the review increases the efficiency of the challenge mechanism, but reduces the effectiveness of the legal protection offered by the mechanism. These effects are more extreme for higher fines.

A clear policy advice can be given when strategic use of challenging procedures leads to inefficiencies. A review without a fine should be considered if the policy maker attaches foremost value to legal protection. This actually increases the effectiveness of legal protection, while at the same time increasing efficiency. A high fine should be considered, if efficiency considerations are dominant, and one therefore wants to confine legal protection to challenges that are without any uncertainty founded. There is a trade-off between legal protection and efficiency. More emphasis on legal protection requires a lower fine. It is up to the policy maker to decide on the balance. It should be noted that we focused on informed, strategic litigation, generally requiring knowledgeable council. Self-represented litigants are likely to display much more erratic, ill-informed and emotion based behaviour, and, while our policy advice may well affect the decisions of these litigants in the desired direction, the experiment does not address that behaviour. Still, in the vast majority of major criminal and civil cases lawyers play a central role, and judicial procedures must be capable of dealing with their strategic behaviour in an efficient manner.