All for one or some for all? Evaluating informative hypotheses using multiple N = 1 studies

Analyses are mostly executed at the population level, whereas in many applications the interest is on the individual level instead of the population level. In this paper, multiple N = 1 experiments are considered, where participants perform multiple trials with a dichotomous outcome in various conditions. Expectations with respect to the performance of participants can be translated into so-called informative hypotheses. These hypotheses can be evaluated for each participant separately using Bayes factors. A Bayes factor expresses the relative evidence for two hypotheses based on the data of one individual. This paper proposes to “average” these individual Bayes factors in the gP-BF, the average relative evidence. The gP-BF can be used to determine whether one hypothesis is preferred over another for all individuals under investigation. This measure provides insight into whether the relative preference of a hypothesis from a pre-defined set is homogeneous over individuals. Two additional measures are proposed to support the interpretation of the gP-BF: the evidence rate (ER), the proportion of individual Bayes factors that support the same hypothesis as the gP-BF, and the stability rate (SR), the proportion of individual Bayes factors that express a stronger support than the gP-BF. These three statistics can be used to determine the relative support in the data for the informative hypotheses entertained. Software is available that can be used to execute the approach proposed in this paper and to determine the sensitivity of the outcomes with respect to the number of participants and within condition replications.


Introduction
There is increasing attention for individual-centered analyses (e.g., Molenaar, 2004;Hamaker, 2012). For example, in personalized medicine, it is not relevant to find if a treatment works on average in a group of individuals but rather whether it works for any individual (Woodcock, 2007). This paper is concerned with individual-centered analyses in the form of multiple N = 1 studies. A core feature of this paper is that multiple hypotheses are formulated for each person. These hypotheses are first evaluated at the individual level and subsequently conclusions are formed at the group level. Specifically, this will be done in the context of a withinsubject experiment (see Kluytmans et al., 2014, for a pilot study into using informative hypothesis in the context of multiple N = 1 studies). In a within-subject experiment each person i = 1, ..., P is exposed to the same set of experimental conditions j = 1, . . . , J . By conducting R replications with a dichotomous outcome (0 = failure, 1 = success) in condition j the number of successes x i j of person i can be obtained. This can be modeled using a binomial model with R trials and unknown success probability π i j . This paper proposes a Bayesian method that evaluates informative hypotheses (Hoijtink, 2012) for multiple within-subject N = 1 studies. Researchers can formulate informative hypotheses based on (competing) theories or expectations. This can be achieved by using the relations '>' and '<' to impose constraints on the parameters π i = [π i 1 , . . . , π i J ]. E.g. 'π i 1 > π i 2 ' states that π i 1 is larger than π i 2 and reversely, 'π i 1 < π i 2 ' states that π i 1 is smaller then π i 2 . When a comma is used to separate two parameters, such as 'π i 1 , π i 2 ', no constraint is imposed between these parameters. For each person, multiple informative hypotheses can be evaluated by means of Bayes factors (Kass & Raftery, 1995). Using the Bayes factor, it can be determined for each person which hypothesis is most supported by the data. Here, our method departs from traditional analyses. Rather than evaluating hypotheses at the group level, the hypotheses are evaluated for each person separately. In social psychology, for example, it is often hoped or thought that if a hypothesis holds at the group level, this also applies to all individuals (see for example, Moreland & Zajonc, 1982;Klimecki, Mayer, Jusyte, Scheeff, & Schönenberg, 2016). Hamaker (2012) describes the importance of individual analyses using an example: Cross-sectionally, the number of words typed per minute and the percentage of typos might be negatively correlated. That is, people that type fast tend to be good at typing and thus make fewer mistakes than people that type slow. However, at the individual level, a positive correlation exists between these variables, i.e., if a fast typer goes faster than his normal typing speed, the number of mistakes will increase (Hamaker, 2012). Similarly, if multiple persons aim to score a penalty several times, we might find that the average success probability is smaller than 0.5, however this does not imply that each individual has a penalty scoring probability smaller than 0.5. Differently from Hamaker (2012) and Molenaar (2004), our approach does not stop at a single N = 1 study. Rather, when individual analyses have been executed, it is interesting to see if all individuals support the same hypothesis. Thus, when multiple hypotheses are evaluated for P individuals, two types of conclusions can be drawn. First, by executing multiple N = 1 studies, it can be determined for each person if any hypothesis can be selected as the best, and if so, which hypothesis this is. Second, it can be determined if the sample comes from a population that is homogeneous with respect to the support of the specified hypotheses, and if so, which hypothesis is supported most. This paper is structured as follows: First, the difference between analyses at the group level and multiple N = 1 analyses is elaborated upon by means of an example that will be used throughout the paper. Second, it will be described how informative hypotheses can be evaluated for one N = 1 study. Third, it will be explained how multiple N = 1 studies can be used to evaluate each hypothesis and detect if any can be selected as the best hypothesis for all individuals. The appropriate number of replications and the number of participants can be determined using a sensitivity analysis. The paper is concluded with a short discussion.

P-population and WP-population
An example of a within-subject experiment is Zedelius, Veling, and Aarts (2011). These researchers investigated the effect of interfering information and reward on memory. In each trial, participants were shown five words on a screen and asked to remember these for a brief period of time. During this time, interfering information was presented on the screen. Afterwards, they were asked to recall the five words verbally in order to obtain a reward. Three factors with two levels each were manipulated over the trials: Before each trial started, participants were shown a high (hr) or a low (lr) reward on the screen they would receive upon completing the task correctly. This reward could be displayed subliminally (sub), that is, very briefly (17 ms) or supraliminally (sup), that is for a longer duration of 300 ms. Finally, the visual stimulus interfering with the memory task was either a sequence of letters, low interference (li), or eight words that were different from the five memorized high interference (hi). Combining these factors results in eight conditions, for example hr-sub-hi and lr-sup-li. Seven trials were conducted in each condition, resulting in a total of 56 trials per participant. After each trial, the participant was given a score of 1 if all five words were recalled and 0 if not. Zedelius et al. (2011) specified expectations regarding the ordering of success probabilities that can be translated in many different hypotheses. One example of an informative hypothesis based on the expectations of Zedelius et al. (2011) is where hr-sup-li is π hr-sup-li , the success probability in condition hr-sup-li. For simplifications in the remainder of this paper, π is omitted in the notation of all examples using the conditions from Zedelius et al. (2011). Alternatively, for each person i the hypothesis could be formulated as: where hr-sup-li i is the success probability in condition hrsup-li of person i. To illustrate the difference between Eqs. 1 and 2 let us consider a population of persons (P-population from here on) and a within-person population (WP-population from hereon). Each individual in the P-population has their own success probabilities π i . The averages of these individual probabilities are the P-population probabilities π = [π 1 , ..., π J ], where π j = 1 P P i=1 π i j . Equation 1 is a hypothesis regarding the ordering of these P-population probabilities. Equation 2 is a hypothesis regarding the ordering of the WP-population probabilities for person i. Evaluating this hypothesis for person i is an example of an N = 1 study.
Many statistical methods are suited to draw conclusions at the P-population level. However, if a hypothesis is true at the P-population level, there is no guarantee that it holds for all WP-populations (Hamaker, 2012). Thus, a conclusion at the P-population level does not necessarily apply to each individual. Rather than π , this paper concerns the individual π i . If multiple hypotheses are formulated for each person i, it can be determined for each person which hypothesis is most supported. Furthermore, it can be assessed whether the sample of P persons comes from a population that is homogeneous with respect to the informative hypotheses under consideration.

N = 1: how to analyze the data of one person
This section describes how the data of one person can be analyzed. First, the general form of hypotheses considered for every person are introduced. Subsequently, the statistical model used to model the N = 1 data is introduced. Finally, the Bayes factor is introduced and elaborated upon.
Hypotheses Researchers can formulate informative hypotheses regarding π i . The general form of the informative hypotheses used in this paper is: where m, m = 1, ..., M(m = m ) is the label of a hypothesis, M is the number of hypotheses considered and m is another hypothesis than m, π i = [π i 1 , ...π i J ] and R m is the constraint matrix with J columns and K rows, where K is the number of constraints in a hypothesis. The constraint matrix can be used to impose constraints on (sets of) parameters. An example of a constraint matrix R for J = 4 is: which renders which specifies that the success probabilities π i are ordered from large to small. Note that the first row of R 1 specifies that 1 · π i 1 − 1·π i 2 + 0 · π i 3 + 0 · π i 4 > 0, that is, π i 1 > π i 2 . The constraint matrix renders the informative hypothesis H i 2 : which states that the average of the first two success probabilities is larger than the average of the last two. Hypotheses constructed using Eq. 3 are a translation of the expectations researchers have with respect to the outcomes of their experiment into restrictions on the elements of π i . Another hypothesis that is considered in this paper is the complement of an informative hypothesis: The complement states that H i m is not true in the WPpopulation. Stated otherwise, the reverse of the researchers' expectation is true. Finally, H i u denotes the unconstrained hypothesis: where each parameter is 'free'. An informative hypothesis H i m constrains the parameter space such that only particular combinations of parameters are allowed, comprises that part of the parameter space that is not included in H i m and the conjunction of H i m and is H i u . The difference in use of H i u and will be elaborated further in the section on Bayes factors. Zedelius et al. (2011) formulated several expectations concerning the ordering of success probabilities over the experimental conditions. The main expectation was that high-reward trials would have a higher success probability than low-reward trials. This main effect and the expectations regarding the other conditions (interference level and visibility duration) can be translated in various informative hypotheses (Kluytmans et al., 2014). A first translation of the expectations is which states that for any person i the success probabilities are ordered from high to low. To give some intuition for this hypothesis, Fig. 1 shows eight bars that represent the experimental conditions, and its height indicates the success probability in that condition, and the ordering of probabilities adheres to H i 1 . Substantively, this hypothesis specifies that all conditions with a high reward have a higher success probability than those with a low reward, which in Fig. 1 can be verified since all dark gray bars are higher than any light gray bar. Furthermore, H i 1 specifies that within this main reward value effect, that is, looking only at highreward success conditions or only at low-reward conditions, a supraliminally shown rewards (solid border) results in a higher success probability than a subliminally shown reward (dotted border). Finally, within the visibility duration effect, that is, looking only at conditions with the same reward and same visibility duration, low interference (no pattern) results in a higher success probability than high interference (diagonally striped pattern). Alternatively, two less-specific hypotheses can be formulated that include the main effect of reward and only one of the remaining main effects: and where hr-li i indicates the average success probability of the hr-sup-li i and hr-sub-li i conditions. In Fig. 1, both H i 2 and H i 3 are true. Different from H i 1 , these hypotheses do not state that any high-reward condition has a higher success probability than any low-reward condition, but rather that averaged over both interference level and visibility duration high-reward conditions have a higher success probability than low-reward conditions. Additionally, H i 2 further specifies that averaged over visibility duration, the success probability is always higher in high-reward conditions compared to low-reward conditions. Within this main effect of reward value, the success probability is higher for low interference than for high interference. Analogously, H i 3 states that averaged over interference level, the success probability is always larger in high-compared to low-reward conditions. Within this pattern, the success probability is larger for supraliminally compared to subliminally shown rewards.
A fourth hypothesis relates to the interaction effect between reward type and visibility duration: which states that the benefit of high reward over low reward is larger when the reward is shown supraliminally compared to when the reward is shown subliminally. This, too, is presented in Fig. 1 Zedelius et al. (2011) for person i in R replications in each condition j the density of the data is that is, in each condition j the response x i j is modeled by a binomial distribution. The prior distribution h(π i |H i u ) for person i is a product over Beta distributions where α 0 = β 0 = 1, such that h(π i | H i u ) = 1, that is, a uniform distribution. As will be elaborated upon in the next section, only h(π i | H i u ) is needed for the computation of the Bayes factors involving H i m , H i m and H i u (Klugkist, Laudy, & Hoijtink, 2005). The interpretation of α 0 and β 0 is the prior number of successes and failures plus one. In other words, using α 0 = β 0 = 1 implies that the prior distribution is uninformative. Consequently, the posterior distribution based on this prior is completely determined by the data. Furthermore, by using α 0 = β 0 = 1 for each π i the prior distribution is unbiased with respect to informative hypotheses that belong to an equivalent set (Hoijtink, 2012, p. 205). As will be elaborated in the next section, unbiased prior distributions are required to obtain Bayes factors that are unbiased with respect to the informative hypotheses under consideration.
The unconstrained posterior distribution is proportional to the product of the prior distribution and the density of the data: where As can be seen in Eq. 16, the posterior distribution is indeed only dependent on the data.

Bayes factor
We will use the Bayes factor to evaluate informative hypotheses. A Bayes factor (BF) is commonly represented as the ratio of the marginal likelihoods of two hypotheses (Kass & Raftery, 1995). Klugkist et al. (2005) and Hoijtink (2012, p. 51-52, 57-59) show that for inequality constrained hypotheses of the form presented in Eq. 3 the ratio of marginal likelihoods expressing support for H i m relative to H i u can be rewritten as The Bayes factor balances the relative fit and complexity of two hypotheses. Fit and complexity are called relative because they are relative with respect to the unconstrained hypothesis. In the remainder of this text, referrals to fit and complexity should be read as relative fit and complexity.
The complexity c i m is the proportion of the unconstrained Using Eq. 15 with α 0 = β 0 = 1 for each π i it is ensured that the prior distribution is unbiased with respect to hypotheses that belong to an equivalent set. Consider for example, H 1 : π 1 > π 2 > π 3 > π 4 and H 2 : π 1 > π 2 > π 4 > π 3 . These hypotheses, and the other 22 possible ordering of π i , are equally complex and should thus have the same complexity. Using Eq. 15, this complexity is computed as 1 24 for each of the set of 24 equivalent hypotheses (Hoijtink, 2012, p. 60).
The fit f i m is the proportion of the unconstrained posterior distribution in agreement with H i m : The Appendix describes how stable estimates of the complexity and fit can be computed using MCMC samples from the prior and posterior distribution, respectively. Since Eq. 17 is a ratio of two marginal likelihoods (one for H i m and one for H i u ) it follows that and that Three hypothetical N = 1 datasets with J = 4 and R = 7 are presented in Table 1. Three possible informative hypotheses regarding these data are H i 1 from Eq. 5, and H i 2 from Eq. 7. The table presents the complexity, fit and Bayes factors of these hypotheses. As can be seen in the table, the complexity of H i 1 is .04 = 1/24 and c i 2 = .5. The table illustrates that complexity depends on the hypotheses but not on the data: for each of the three data examples the complexities are the same.
The first example (Person 1) in Table 1 contains data that are in agreement with H i 1 , and therefore also with H i 2 , since H i 1 is a specific case of H i 2 . This is reflected by f 1 1 = .556 and f 1 2 = .996. Because H i 1 is quite specific, it can easily  conflict with the data. For example, based on x 1 2 = 5 and x 1 3 = 4, it is not very certain that π 1 2 > π 1 3 . In contrast, H i 2 is less specific, does not involve the constraint π 1 2 > π 1 3 , and therefore f 1 2 is larger than f 1 1 . Bayes factors balance complexity and fit of the hypotheses, resulting in BF 1 1u = 13.16, BF 1 2u = 2.00, BF 1 12 = 6.59 and . Interpreting the size of Bayes factors is a matter that needs some discussion. Firstly compared to H 1 2 has increased by a factor 6.6. However, BF i 12 is only a relative measure of support, that is, the best of the hypotheses involved may still be an inadequate representation of the within person population that generated the data. Note that BF i mu and are always both larger or smaller than 1. However, by definition BF i mu ranges from 0 to 1 c i m and ranges from 0 to infinity. Therefore, we prefer to interpret the latter to determine if the best of a set of hypotheses is also a good hypothesis. By computing , we can determine whether the best hypothesis, in this case H i m , is also a good hypothesis, because we get an answer to the question "is or isn't H i m supported by the data?". In Table 1, indicates that the data caused an increase in believe for H i m compared to , which implies that it is a good hypothesis. Note that this does not rule out the possibility of other, perhaps better, good hypotheses.
A second issue is the interpretation of the strength of Bayes factors. Although some guidelines have been provided (e.g. Kass & Raftery, 1995, interpret 3 as the demarcation for the size of BF ab , providing marginal and positive evidence in favor of H a ), we choose not to follow them. In the spirit of a famous quote from Rosnow and Rosenthal (1989), "surely God loves a BF of 2.9 just as much as a BF of 3.1", we want to stay away from cut-off values in order not to provide unnecessary incentives for publication bias and sloppy science (Konijn, Van de Schoot, Winter, & Ferguson, 2015). In our opinion, claiming that a Bayes factor of 1.5 is not very strong evidence and that a Bayes factor of 100 is strong evidence will not result in much debate. It is somewhere between those values that scientists may disagree about the strength. In this paper, we used the following strategy to decide when a hypothesis can be considered best for a person: a hypothesis m is considered the best of a set of M hypotheses if the evidence for H m is at least M − 1 times (with a minimum value of 2) stronger than for any other hypothesis m . This requirement ensures that the posterior probability for the best hypothesis is at least .5 if all hypotheses are equally likely a priori. For example, if two hypotheses are considered, one should be at least two times more preferred than the other, resulting in posterior probabilities of at least .66 versus .33. If three hypotheses are considered, the resulting posterior probabilities will be at least .50 versus .25 and .25, which corresponds to a twofold preference of one hypothesis over both alternatives. For four hypotheses the posterior probabilities should be at least .50 versus .16, .16 and .16, corresponding to relative support of at least 3 times more for the best hypothesis than for any other hypothesis. Note that, although these choices seem reasonable to us, other strategies can be thought of and justified.
For Person 2 in Table 1 H i 2 has gained slightly more belief than H i 1 , since BF 2 12 = .78 (BF 2 21 = 1.28). Based on this Bayes factor, H i 2 is not convincingly the better hypothesis of the two. It is important to note that Bayes factors for different persons do not necessarily express support in favor of one or the other hypothesis. It is very possible that Bayes factors for different persons are indecisive. Looking at and , H i 2 seems quite a good hypothesis, whereas H i 1 is not much more supported than its complement. Finally, Person 3 in Table 1 Table 1 show the variety in conclusions that can be obtained. There may or may not be a best hypothesis, and the best hypothesis may or may not be a good hypothesis.

Illustration
For Zedelius et al. (2011), the main goal was to select the best hypothesis from H i 1 , H i 2 , H i 3 and H i 4 presented in Eqs. 10, 11, 12 and 13. The Bayes factors presented in the first four columns of Table 2 can be used to select the best hypothesis for each person. If a best hypothesis is selected, it is also of interest to determine whether this hypothesis is a good hypothesis. The last four columns of Table 2 can be used to determine whether the best hypothesis is also 'good'.
For Person 1, H 1 3 is 1.98/.59 ≈ 3.36 times more supported than H 1 1 , 1.98/.93 ≈ 2.13 times more supported than H 1 2 and 1.98/.26 ≈ 7.62 times more supported than H 1 4 . Although H 1 3 is more supported than the other three hypotheses, a Bayes factor of 2.13 does not seem very convincing. Comparing the relative strength of the support for all informative hypotheses for Person 1 leaves us with the conclusion that no single best hypothesis could be detected. This implies that for Person 1, we would not be quite certain which hypothesis best describes the data Thus, we may conclude that for Person 1, it is difficult to select a best hypothesis.
For Person 8, none of the informative hypotheses is preferred over the unconstrained hypothesis. Thus, for each of the formulated hypotheses, our belief has decreased after obtaining the data.  These examples show that it differs per person whether a best hypothesis can be detected, which hypothesis this is, and how strong the evidence is relative to the other hypotheses. Based on Table 2, Zedelius et al. (2011) can conclude for each individual what the best hypothesis is, and whether it is a good hypothesis. We find that the sample contains persons for whom a best hypothesis can be detected, but this hypothesis is not a good hypothesis (Persons 20 and 21). Additionally, there are individuals for whom a best hypothesis can be detected and the best hypothesis is good (Persons 6,14,15,16,17,19,22,and 23). For the remaining individuals, no best hypothesis could be selected. Someone else evaluating these Bayes factors might come to slightly different conclusions, if they apply a different rule to decide what makes a hypothesis the best from a set.
The second goal of this paper was to determine whether the sample of individuals comes from a homogeneous population with respect to the support for the hypotheses of interest. The first impression gained from Table 2 is that this is not the case. However, this topic will be pursued in depth in the next section.

A P-population of WP-populations
Looking at the Bayes factors in Table 2 is a rather ad hoc manner to answer the question whether the sample comes from a population that is homogeneous in its support for the hypotheses under consideration and which hypothesis is the best. By aggregating the individual Bayes factors we can try to evaluate in more detail to what extent individuals are homogeneous with respect to a hypothesis. If H i m is evaluated for P independent persons the corresponding individual Bayes factors can be multiplied into a P-population Bayes factor (Stephan & Penny, 2007): which expresses the support for H m relative to H u , where which states that H i m holds for every person i = 1, . . . , P , and which is the union of H i u for i = 1, . . . , P . In this section, using the Bayes factor, H i m and H m are compared with H i u and H u , respectively. However, analogously, H i u could be replaced by H i m or rendering P-BF mm and , respectively. Note, that this is not the Bayes factor describing the relative evidence for Hm and Hm' with regard to the P-population parameters π. Individual data could be used to evaluate a Bayes factor with respect to the P-population π, but our focus here is on the collection of individual WPpopulations π i . Another way to interpret this P-BF is in the context of synthesis of knowledge with respect to the individual evaluated hypotheses H i m . Thus, it is a measure of the extent to which a hypothesis holds for every individual, rather than on average. Table 3 shows seven hypothetical sets of six individual Bayes factors comparing H i m to H i u . The P-BF is presented for each set. For example, Set 1 results in a P-BF of 64, indicating that it is 64 times more likely that H i m holds for all persons i, than that it does not hold for all persons. However, the table shows an undesirable property of P-BF, namely that it is a function of P . As can be seen, both in Set 1, 2 and 3, the P-BF is 64. Nevertheless, it is clear that all individual Bayes factors in Set 1 express stronger evidence than in Sets 2 and 3.
Stephan and Penny (2007) have suggested using the geometric mean of the product of individual Bayes factors to render a summary that is independent of P : which is a measure of the 'average' support in favor of H m relative to H u found in P persons. In other words, it can be interpreted as the Bayes factor that is expected for the P + 1 st individual sampled from the P-population. As can be seen in Table 3, the gP-BF mu does not depend on P . For example, in Set 1 the gP-BF is 8.00 and in the larger Sets 2 and 3, the average support for H m is 2.83 and 2.00, respectively, while the P-BF mu = 64 for each of these sets.
If multiple hypotheses are considered, gP-BF mm and can be derived similar as BF i mm and . It is important to keep in mind that the gP-BF mu is a summary measure and does not have the same properties as individual Bayes factors. Such a property is that BF i mu and are always both smaller or larger than 1. For example, if BF 1 1u = 0.2, then , and if BF 2 1u = 1.8 then . This is not true for gP-BF mu and . To continue the example based on the Bayes factors for persons 1 and 2, gP-BF 1u = 0.6 and . For interpretation of the gP-BF , it is important to keep in mind that gP-BF mu is a summary of all BF i mu , and thus cannot be translated into , which is a summary of all . Note that if a switch in direction occurs, both geometric Bayes factors are generally both close to 1, therefore not causing any very contradicting conclusions.
However, the gP-BF mu has another issue. Table 3 shows that different sets of individual Bayes factors can lead to the same gP-BF mu . For example, in Sets 3, 4, and 5 the same gP-BF is obtained. Set 3 contains only Bayes factors that are close to the gP-BF = 2 and all support H i m . Set 4 seems similar in the strength of support in the individual Bayes factors, although there seems to be more variation than in Set 3, and we find one Bayes factor that does not support H i m . Finally, Set 5 contains four Bayes factors that express support for H i u over H i m , while two Bayes factors express relatively strong support in favor of H i m over H i u . The fact that the Bayes factors from Sets 3 and 4 come from populations that are more homogeneous in their preference for H i u than Set 5 is not represented well by the gP-BF mu . Therefore, an additional measure, the evidence rate (ER mu ), is introduced that describes the consistency in the preferred hypothesis in multiple individual Bayes factors: There is still one issue that needs to be resolved. Set 6 and 7 result in the same gP-BF mu and ER mu as Set 3, but are not similar in individual contributions. Set 6 contains an outlier that expresses strong evidence for H i m , whereas all other cases express only weak support for H i m . Without this outlier, the gP-BF mu would be much lower. Set 7 contains two Bayes factors that express very little support for H i m , whereas the other four cases express stronger support for H i m . Without these two 'weak' cases, the gP-BF mu would be somewhat higher. In contrast, Set 3 contains Bayes factors that are rather constant around gP-BF, removing any of these cases would not affect the gP-BF mu too much. To describe presence and direction of skewness among individual Bayes factors with respect to the gP-BF mu , a final measure is introduced: the stability rate.
The stability rate (SR mu ) is a measure of skewness among individual Bayes factors with respect to the gP-BF mu . It can be written as: where I BF i mu <gP-BF mu = 1 if BF i mu < gP-BF mu and 0 otherwise. The SR mu describes the proportion of individual Bayes factors that expresses support stronger than the gP-BF for the hypothesis preferred by gP-BF mu . In Sets 1, 2, 3, and 4 of An SR mu smaller than .50, as in Set 5 and 6, indicates that less than half of the individual Bayes factors express stronger support for H i m than gP-BF. Consequently, the gP-BF mu is relatively large because of a minority of individual Bayes factors that are relatively large. The gP-BF mu is overestimated because of this minority. In Set 5 the gP-BF mu supports H i m , while the majority of individual Bayes factors support H i u . The gP-BF mu is no longer a representative 'average' support. Reversely, an SR mu larger than .50 indicates that only relatively few individual Bayes factors express weaker support than gP-BF (see Set 7). Thus, for SR mu > .50, the gP-BF mu is relatively close to 1 because of a minority of individual Bayes factors that express support that is relatively weak. As an effect, the strength of support is underestimated.
Thus, the gP-BF mu can be used to express the average support of the individual Bayes factors. In order to assess whether the individual Bayes factors come from a homogeneous population, the ER mu can be used. A high evidence rate indicates high agreement in preferred hypothesis among individual Bayes factors, and thus more homogeneity. Finally, the SR mu gives an indication of how the individual Bayes factors are distributed around the gP-BF mu . Note that the equations presented for the ER and SR describe those corresponding to gP-BF mu . If the interest is in gP-BF mm or , the ER and SR should be computed using the individual BF i mm s and . The individual Bayes factors are the relevant quantities in the ER and SR, and therefore these should be used. Table 2 the gP-BF mu , ER mu and SR mu can be computed for the data of Zedelius et al. (2011). The first row of Table 4 gives the gP-BF mu based on the individual Bayes factors from , it can be concluded that none of the hypotheses is convincingly the best description for all individuals and none of the hypotheses are clearly a better description of all individuals than their complement is.

Using the individual Bayes factors presented in
Additionally, we find that the ER mu for the comparison of H 1 with H u is .500, indicating that approximately half of the individual Bayes factors expresses support for H i 1 , while the other half expresses support for H i u . Similarly, ER 2u , ER 3u and ER 4u are .346, .615 and .423 indicating that for these hypotheses, too, there is little homogeneity among the individual Bayes factors. Only SR 1u is rather close to .50, and consequently, it is not likely that the gP-BF mu is affected by one or more influential cases having a (much) smaller BF than the majority. For the other hypotheses, there is indication that the strength of the gP-BF mu is affected by skewness among the individual Bayes factors.
Based on the gP-BF mu , ER mu , and SR mu , we can draw the following conclusions. Firstly, using the gP-BF mu no hypothesis could be selected as the best hypothesis from the set. The SR mu s indicate that for all hypotheses but H i 1 imbalance among individual Bayes factors was present. Furthermore, the relatively low ER mu s indicate that it is unlikely that the individuals come from a homogeneous population with respect to any of the specified hypotheses. Finally, none of the hypotheses appears to be a good description of the ordering of the individual success probabilities. Thus, based on these findings it seems unlikely the P-population is homogeneous with respect to the WPpopulation hypotheses that were considered.
A within-person experiment, such as conducted by Zedelius et al. (2011), is quite common in social and neuropsychological research. The theory and hypotheses for these Table 4 The gP-BF, ER and SR for the data of Zedelius et al. (2011)  Although WP-population hypotheses are formulated, the analyses are usually executed at the P-population level. In the original Zedelius et al. (2011) paper, the data were analyzed by means of a repeated measures ANOVA, which tests differences in the P-population means. The conclusions obtained from this analysis imply that H 2 holds at the P-population level. Often the, usually implicit, assumption is that if a hypothesis holds at the P-population level, it holds for all individuals. The current analysis shows that although H 2 is a reasonable hypothesis at the P-population level, it appears not to be the single best hypothesis under consideration and is not a good hypothesis for all individuals. The assumption that an average conclusion holds for all individuals is in this case violated. It is important that psychological researchers are aware of the fact that conclusions at the P-population level cannot be transferred to the individual level without testing this. Within-person experiments offer rich data that allow for the evaluation of individual hypotheses, through which the assumption that a hypothesis holds for everyone can be tested. This paper introduces an approach with which this can be done.

Determining the sample size and number of replications for a study
Say a researcher has a research question that he wants to test by means of an experiment. This research question defines which and how many conditions J should be considered and results in one or multiple hypotheses of interest. The researcher is then left with two choices regarding the experiment, namely, the number of replications R used in each trial and the sample size P . This section will describe a method to choose R and P .
In the previous section, a method to evaluate a set of individual Bayes factors has been introduced in the form of three measures: gP-BF mu , ER mu and SR mu . It is important to investigate the properties of these measures as a function of sample size and the number of replications. In other words, if indeed all individuals are homogeneous with respect to an individual informative hypothesis, which are the sample size and number of replications required for gP-BF mu , ER mu and SR mu to succeed in detecting this and, analogously, if individuals are not homogeneous, can this be derived from these measures?
Through a sensitivity analysis it can be determined for which sample size and number of replications the gP-BF mu can be expected to prefer the hypothesis that is in agreement with the true P-population, the ER mu is sufficiently high and SR mu is close to .5. The choice for what values the gP-BF mu , ER mu and SR mu behave as desired is subjective. In line with our reasoning for the interpretation of individual Bayes factors as described on page 14, the choice for when the strength of support in gP-BF is sufficient to prefer one hypothesis over another is subjective and no guidelines are provided. Additionally, we will consider .9 to be sufficiently high for the ER mu , that is, a maximum 10% of individual Bayes factors prefers a different hypothesis than the majority, and a .1 margin around .5 to be reasonable for the SR mu , that is, the proportion of individual Bayes factors expressing stronger support than gP-BF mu is between .4 and .6.
Using R version 3.3.1 (R Core Team, 2013), software has been developed with which such a study design analysis can be executed. 1 While discussing the options of this program, we focus on the evaluation of , in order to arrive at an appropriate study design to determine whether H i m holds for everyone in the P-population. The program can analogously be used for Study design analyses for gP-BF mm or gP-BF mu . The required input and the algorithm used are illustrated using Zedelius et al. (2011), as it could have been conducted before starting the data collection.
The R program requires as input the number of conditions J and hypotheses that a researcher wants to investigate. Additionally, the numbers of replications R and the sample sizes P that a researcher is willing to consider should be specified. Using this input, the following steps are executed: • For each hypothesis of interest H i m , three P -populations are specified, one where H i m is true for all WPpopulations, one where is true for all WPpopulations and a mixture of these two populations. In the next section these P-populations are specified in more detail for the example from Zedelius et al. (2011).
• For each P-population, the program generates 10,000 WP-populations, that is, parameter vectors π i of size J . • For each R specified by the user, This results in 10,000 individual Bayes factors for each combination P-population and R. For computational reasons, this set will be used as a surrogate for the true infinite P-population. For each surrogate P-population then the following steps are followed: • For each sample size P and number of replications R, 1000 sets of individual Bayes factors are sampled with replacement from the surrogate P-population. • For each set, the gP-BF mu , ER mu and SR mu are computed, resulting in 1000 values of each measure for every sample size P and number of replications R. • From these 1000 values of gP-BF mu , ER mu and SR mu the 2.5, 50 and 97.5 percentiles are obtained. The 50 percentile, the median, is used to summarize what values can be expected for each of these measures. The desired values of these expectations are, as described above subjectively defined, for the gP-BF mu , above .9 for the ER mu and within a .1 margin from .5 for the SR. The 2.5 and 97.5 percentiles indicate the range in which 95% of the sampled gP-BF mu , ER mu and SR mu lay. If this range is very wide and includes non-reasonable values the combination of R and P might not be appropriate even when the expected value is of a desired level.
In the next section we will illustrate how this information can be used to determine the R and P required to execute a study.

Illustration
This section describes a sensitivity analysis for the determination of the number of replications R and sample size P , where the setup of Zedelius et al. (2011) will be used as starting point. Of course, such an analysis should be executed prior to the data collection, which was already done by Zedelius et al. (2011). However, for the illustration we will do the analysis as if no data has been collected yet. This will provide us with the knowledge whether the eventually chosen R and P were sufficient according to the sensitivity analysis. The first step of the sensitivity analysis described in the previous section requires a research question leading to the number of conditions J and a set of hypotheses representing the researchers' expectations. The research question of Zedelius et al. rendered three hypotheses, Eqs. 10-12, about the ordering of success probabilities in the J = 8 experimental conditions. For this illustration, only H i 1 as in Eq. 2 is considered. This results in the following parameters for the sensitivity analysis: • Number of conditions. Zedelius et al. (2011) considered 8 different conditions, so J = 8. • Hypothesis. The hypothesis that will be considered for this illustration is H i 1 . From this hypothesis, three relevant P-populations are derived.
-P-population 1. In this P-population all individuals adhere to H i 1 . Using this population the median values of the gP-BF mu , ER mu and SR mu can be determined if H i m holds for everyone. To compute these median values, the individual parameters π i are repeatedly sampled from the prior distribution under H i 1 : where I π i ∈H i 1 = 1 if π i is in agreement with H i 1 and 0 otherwise. -P-population 2. In this P-population all individuals adhere to . Using this population, the expected values of the gP-BF mu , ER mu and SR mu can be determined if are sampled from the prior distribution under , that is: where if π i is in agreement with and 0 otherwise. -P-population 3. For the third P-population, a mixture of P-population 1 and 2 is considered. Using this population, the expected values of the gP-BF, ER and SR can be determined if H i m holds for a proportion θ of individuals in the Ppopulations, and holds for a proportion 1 − θ of individuals. The individual parameters π i are sampled from Eq. 28 if u i , sampled from U(0, 1) is smaller than or equal to the specified proportion θ , and sampled from Eq. 29 if u i is larger than θ : The proportion θ is set to .5, thus half of all individuals adheres to H i 1 and the other half adheres to . Next, the sample sizes P and number of replications R that the researchers want to consider should be chosen. Based on the choices made by Zedelius et al. (2011), the following values for P and R are considered for the sensitivity analysis: • Number of replications. Zedelius et al. 2011 used seven replications in their experiment. Additionally, it would be interesting whether more replications would result in better performance, therefore R = 7, 14, 21 are considered.
• Number of individuals. Zedelius et al. 2011 used 26 participants in their experiment. In order to mimic an a priori sensitivity analysis, the sample sizesP = 5,7,10,15,20,25,30,40,50 are considered. Figure 2 shows the results of the sensitivity analysis for the determination of sample size P and number of replications R. The results are presented for each of the three simulated P-populations described in the previous section. The first column of the figure shows the performance of the gP-BF mu , ER mu and SR mu if H i 1 is true for all individuals (P-population 1). As can be seen in the top left figure, already for small sample sizes the gP-BF mu expresses strong support for H 1 : the lower 2.5 percentile of the gP-BF mu is larger then 10 for R > 7 and P > 5. The lower 2.5th percentile of the ER mu only stabilizes above .9 for R = 7 and P > 30 and for R = 14, 21, this is already achieved for P > 10. Stated otherwise, if H i 1 holds for all  for the three generated true P-populations for J = 8. P-population 1 is described in Eq. 28, P-population 2 in Eq. 29, and P-population 3 in Eq. 30. Both the median and 95% interval are shown in the figures individuals, for samples larger than 30 it is likely that less then 10 per cent of individual Bayes factors express support for . Finally, the bottom panel shows that the SR mu stabilizes around .55, reflecting that it is reasonable to expect slightly more than half of the individual Bayes factors to express stronger support than gP-BF mu . This implies that the gP-BF mu is, on average, slightly more influenced by the 'weaker' and contradicting individual Bayes factors. The 2.5 and 97.5 percentiles are within a margin of .1 from the median gP-BF for P > 25. Furthermore, we see that from around P = 25 the median and 2.5 and 97.5 percentiles stabilize. Thus, if H i 1 is true for all individuals, with sample size P around 25 − 30 and R = 7, the gP-BF mu and ER mu perform as desired: the gP-BF mu shows strong evidence for the true hypothesis, the ER mu is high and the SR mu is around .5.

Results
In the middle column of figures in Fig. 2 is true for all individuals. For P > 10 and R > 7, the gP-BF is smaller than .01, indicating at least ten times more support for than for H i 1 . As R increases, so does the median support found in the data. The lower 2.5 percentile of the ER is above .9 for P > 30 and R = 14, 21 and close to .9 for R = 7. The median SR is almost exactly .5 for all R for P > 20, and the 2.5 and 97.5 percentiles are within .1 of the median for P > 30. Thus, for sample sizes of 30 and larger, the gP-BF mu , ER mu and SR mu behave as desired for R = 7 and even better for R = 14, 21.
Finally, Population 3, depicted in the right column in Fig. 2 was chosen to be a mixture of the first two populations. Here it can be seen that if H i 1 holds for 50% of the individuals in the population, generally, is preferred over H i 1 , although with less strength than when Population 2 was the true population. Note that this happens because it is more likely that a person coming from h(π i |H i 1 ) provides evidence in agreement with than vice versa. For example, if H 1 is true but if the ordering in the data is off by one order constraint, we are likely to prefer . However, if one of the orderings that comprises is true, a 'mistake' in one or more of the order constraints in the data does not necessarily lead to a preference for H 1 , but might point to one of the other orderings under . The complexity of H 1 is 2.48 × 10 −5 and the complexity of HÃg ancel1 is 1 − 2.48 × 10 −5 ≈ 1. Thus, even though θ = .5, is preferred because it has a higher complexity. The is of use here, indicating that there are multiple populations and stabilizing around .5 for P > 30. Although the median support found in the might indicate a preference for over H i 1 , the indicates inconsistency among individual Bayes factors. Finally, the median for this population is slightly below .5, and the 2.5 and 97.5 percentiles are further than .1 from this median until P is around 40, for R = 14, 21 or 50 for R = 7. Thus, if neither of the two hypotheses hold for everyone, this is reflected in the for every P and R that seemed reasonable if H i 1 or were true for everyone. Zedelius et al. (2011) eventually used 26 participants in their study and seven replications. This is slightly lower than the suggested 30 based on the sensitivity analysis. Consulting the figures, it seems that, if H i 1 is true and P = 26 and R = 7, is expected to be between 30 and 100, the is expected to be above .9 and the between .43 and .67. On the other hand, if is true for all individuals, the can be expected between 1000 and 10,000 in support of , with the similarly above .9 and the between .35 and .6. Consulting the results in Table 4, we find that , ER mu = .500 and SR mu = .436. These results do not seem in line with either Population 1 or 2, but consulting the right column figures in Fig. 2, they do seem in line with the mixture population. Of course, this is no evidence that indeed this mixture population with θ = .5 is the most likely true P-population. However, it does indicate that even though the shows some support for relative to H i 1 , it is not likely that holds for everyone in the P-population.

Discussion
After formulating within-person (WP) hypotheses, individual Bayes factors can be computed with which the support for a particular hypothesis can be derived for each person, or the best from a set of informative hypotheses can be selected. A method has been proposed to combine the individual Bayes factors of some, in order to draw conclusions for all -by answering the question whether an individual hypothesis holds for all persons in the population -and for one by determining the average support for H i m relative to H i m which describes what could be expected for a next individual. The geometric average of P individual Bayes factors (gP-BF) describes the average support for one hypothesis relative to another. It describes what individual Bayes factor could be expected for a next person. Together with the evidence rate and stability rate, the gP-BF can be used to assess whether one hypothesis is more supported than another for all individuals in a population. By means of a sensitivity analysis for a set of hypotheses, it can be determined for what sample size P and number of replications R in an experiment these measures behave desirable.
An R Shiny application has been developed with which a sensitivity analysis can be executed prior to data collection. By specifying hypotheses of interest, the behavior of gP-BF, ER and SR can be evaluated for various combinations of R and P . This allows researchers to collect the appropriate data for their question of interest. Besides an own sensitivity analysis, the data of the simulations used as examples in this paper can be accessed and viewed within the application. Furthermore, in the application data can be analyzed and the gP-BF, ER, and SR are computed. The application and manual can be accessed on https://github.com/fayetteklaassen/ OneForAll. Author Note This research was supported by a grant from The Netherlands Organisation for Scientific Research (NWO): NWO 406-12-001 (FK and HH). The last author is supported by the Consortium on Individual Development (CID) which is funded through the Gravitation program of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research (NWO grant number 024.001.003). The starting point for this paper was the Master thesis by Kluytmans et al. (2014).
FK wrote the paper, designed, programmed, and executed the simulation. HH provided input for the project. HH and FK further conceptualized this project. HH provided feedback on writing, concepts, and programming. CZ, HV and HA provided the data and hypotheses used in the illustration and feedback on the writing.

Appendix: Computation of fit and complexity through decomposition
In order to compute the Bayes factor that expresses the support in favor of H m : against the unconstrained hypothesis H u , the complexity and fit of H m should be computed. 2 Complexity and fit can be determined by taking samples from the unconstrained prior and posterior distribution respectively. A common approach is to take Q samples, and determine what proportion of the samples is in agreement with H m , such that f m = π ∈H m g(π |x, H u )δπ where π q is the qth sample from the unconstrained posterior and I π q ∈H m = 1 if π q is in agreement with H m , and 0 otherwise. The complexity can be computed analogously, with the difference that samples are taken from the prior distribution.
If H m concerns the ordering of 8 parameters, the complexity can be derived analytically and is 1/8! = 1/40, 320. Using Q = 100, 000 samples from the unconstrained prior only 2 or 3 samples of π are expected to adhere to the constraints under H m . This implies that the estimate of f m is very unstable. To obtain stable estimates impossibly huge samples are needed. Similarly, the fit of a hypothesis with 2 Note that for notational simplification the superscript i is dropped from the hypotheses, Bayes factors, and parameters in this Appendix. eight parameters might be too small to accurately approximate using 100,000 samples. One solution is to increase the number of samples which increases the computational time. Mulder, Hoijtink, and de Leeuw (2012) present another solution that makes use of a decomposition of the complexity and fit. This procedure determines decomposed fit and complexity for each constraint in a hypothesis. Equation 33 shows how the probability that all constraints hold, given H u and the data x, can be rewritten as a product of decomposed probabilities: where K is the number of constraints in hypothesis m, R (k) m is the k th row of R m , R (1:k−1) m are the first k − 1 rows of R m , f (k) m is the decomposed fit for the k th constraint, the indicator function I R (k) m π q >0 = 1 if R (k) m π q > 0 and 0 otherwise and π q is sampled from g(π |H u , x, R (1:k−1) m ).
Since each f (k) m is only defined by one constraint, it is never a small value and can be estimated with relatively few samples. The R Shiny application OneForAll belonging to this paper uses Q = 10, 000. By multiplying the decomposed fit components similar to Eq. 33 the total fit can be obtained accurately.
The complexity can be derived analogously: where c (k) m is the decomposed complexity conditional for the k th constraint and π q is sampled from h(π|H u , R (1:k−1) m ).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.