Evaluating false positive rates of standard and hierarchical measures of metacognitive accuracy

A key aspect of metacognition is metacognitive accuracy, i.e., the degree to which confidence judgments differentiate between correct and incorrect trials. To quantify metacognitive accuracy, researchers are faced with an increasing number of different methods. The present study investigated false positive rates associated with various measures of metacognitive accuracy by hierarchical resampling from the confidence database to accurately represent the statistical properties of confidence judgements. We found that most measures based on the computation of summary-statistics separately for each participant and subsequent group-level analysis performed adequately in terms of false positive rate, including gamma correlations, meta-d′, and the area under type 2 ROC curves. Meta-d′/d′ is associated with a false positive rate even below 5%, but log-transformed meta-d′/d′ performs adequately. The false positive rate of HMeta-d depends on the study design and on prior specification: For group designs, the false positive rate is above 5% when independent priors are placed on both groups, but the false positive rate is adequate when a prior was placed on the difference between groups. For continuous predictor variables, default priors resulted in a false positive rate below 5%, but the false positive rate was not distinguishable from 5% when close-to-flat priors were used. Logistic mixed model regression analysis is associated with dramatically inflated false positive rates when random slopes are omitted from model specification. In general, we argue that no measure of metacognitive accuracy should be used unless the false positive rate has been demonstrated to be adequate.

Unfortunately, such a vast number of methodological options may be harmful to progress in the field: If different measures do not converge to the same results, it is unclear which measure of metacognitive accuracy researchers should trust.Moreover, if there is a large number of analysis options, some researchers may be tempted to run multiple analysis and report only those analyses that 'worked′, thus publishing effects that in fact do not exist (Gelman & Loken, 2014;Simmons et al., 2011;Steegen et al., 2016).Many researchers have recently accepted meta-d′/d′ as the gold standard to measure metacognitive accuracy (e.g.Alkan et al., 2020;Barrientos et al., 2022;Davies et al., 2018).Meta-d′/d′ is widely believed to allow for a more straightforward interpretation of the results than other methods because meta-d′/d′ was designed to control for task performance, choice bias, and confidence criteria (Maniscalco & Lau, 2012, 2014).Unfortunately, it has recently been demonstrated that the control meta-d′/d′ provides is not necessarily effective (Boundy-Singer et al., 2022;Guggenmos, 2021;Rahnev, 2023;Rausch et al., 2023;Shekhar & Rahnev, 2021;Zhu et al., 2023).While controlling for theoretically irrelevant variables is an important consideration when choosing an adequate measure of metacognitive accuracy, the statistical properties of measures of metacognitive accuracy have not yet received the same amount of attention from the field.The present study examined one criterial statistical property of previously proposed measures of metacognitive accuracy: the false positive rate, i.e., the probability that a measure of metacognitive accuracy will lead to the detection of an effect that does in fact not exist.

Summary-statistics vs. hierarchical measures of metacognitive accuracy
Measures of metacognitive accuracy fall into two categories depending on how the method deals with the clustered statistical structure of data typical of studies of metacognition.Participants in studies measuring metacognitive accuracy typically perform multiple trials of a perceptual, memory, or cognitive task and report their confidence in being correct in each single trial.At a consequence, datasets in metacognition research consist of many observations which are clustered within subjects and conditions.We refer to the two approaches to deal with the clustered nature of the data as summary-statistics and hierarchical analysis.Summary-statistics are by far the most common analysis approach in Psychology (Judd et al., 2017) and can be used with gamma correlation, confidence slopes, type 2 ROCs, meta-d′, and meta-d′/d′.For the summary-statistics approach, two levels of analysis are required (McNabb & Murayama, 2021): For the first level of analysis, a coefficient quantifying metacognitive accuracy is computed separately for each participant in each condition.For the second level of analysis, the coefficients of metacognitive accuracy obtained during the first analysis step are subjected to a standard statistical test, such as a t-test or an ANOVA.However, when the data deviate substantially from normal distributions, it is possible for standard statistical tests to produce false positive rates other than the nominal alpha level.In particular for meta-d′/d′ and meta-d a /d a , two ratio-based measures, the distribution is expected to be non-normal, hence log transformation is recommended (Fleming & Lau, 2014).To our knowledge, it has not been investigated whether the various summary statistic-based measures of metacognitive accuracy tend to produce normal distributions and, if not, whether the deviations are such that the false positive rates are no longer at the nominal alpha level.
For hierarchical analysis, the statistical test is performed on the level of single observations, not on the summary statistics.The clusters in the data caused by different participants are accounted for by specifying fixed and random effects in the regression model: Fixed effects are factors whose levels are experimentally determined or whose interest lies in the specific effects of each level.Fixed effects are represented by parameters that are assumed to be constant within one condition.In contrast, random effects are factors whose levels are assumed to be sampled from a larger population, or whose interest lies in the variation among them rather than the specific effects of each level.Random effects are represented by parameters that assess the variability associated with the random effect (Bolker et al., 2008).
In principle, hierarchical analysis seems to be well-suited to account for the statistical properties of common data sets in metacognition studies (Fleming, 2017;Kristensen et al., 2020;Murayama et al., 2014;Paulewicz & Blaut, 2020).In line with this intuition, simulations showed that when there was a random effect of item, the false positive rate of mixed-model logistic regression is robust, but the false positive rate of gamma correlation coefficients is inflated (Murayama et al., 2014).In general, hierarchical analyses is more powerful to detect true effects because they seperate between trial-to-trial variance and between-subject-variance, whereas analyses based on summary-statistics misinterpret trialto-trial variance as between-subject variance and as a consequence are biased against the alternative hypothesis (Boehm et al., 2018).However, hierarchical models are only able to maintain an acceptable false positive rate if the random effect structure is correctly specified (Barr et al., 2013;Hesselmann, 2018).Specifically, there is a risk of increased false positive rates if random slopes are omitted from model specification (Oberauer, 2022).Unfortunately, there is no consensus on how complex the random effect structure needs to be (Barr et al., 2013;Matuschek et al., 2017;McNabb & Murayama, 2021).However, while model misspecification in hierarchical models is associated with a risk of false positives in general, to our knowledge, it is not known if this risk applies to data sets with statistical properties characteristic for the study of metacognition.In addition, for many previously proposed hierarchical and summary-statistic based measures of metacognitive accuracy, it has never been empirically investigated if these measures maintain an acceptable false positive rate or not.

Rationale of the present study
The aim of the present study is to investigate empirically the false positive rates associated with summary-statistics and hierarchical measures of metacognitive accuracy.To assess the false positive rate associated with each measure of metacognitive accuracy, we performed simulations where a grouping variable and a continuous predictor variable were randomly sampled independently from confidence and responses.A previous study used a mathematical model to randomly to generate data to investigate false positive rates of gamma correlations and logistic mixed model regression (Murayama et al., 2014), but there are many different proposals how to model confidence judgments (Aitchison et al., 2015;Boundy-Singer et al., 2022;Guggenmos, 2022;Hellmann et al., 2023;Mamassian & de Gardelle, 2021;Moran et al., 2015;Pereira et al., 2021;Peters et al., 2017;Pleskac & Busemeyer, 2010;Ratcliff & Starns, 2009, 2013;Rausch et al., 2018Rausch et al., , 2020;;Reynolds et al., 2020;Shekhar & Rahnev, 2021, 2022) and a clear consensus is still pending (Rahnev et al., 2022).In addition, the statistical properties of confidence judgments may vary heavily across different datasets (Rahnev et al., 2020).Therefore, the present study used hierarchical bootstrapping (Saravanan et al., 2020) based on the confidence database, an online collection of datasets (Rahnev et al., 2020), to simulate datasets with realistic statistical properties.Thus, each simulated dataset represents a justified guess what kind of data may be expected in new experiments without subscribing to a specific mathematical model of confidence.
We estimated the false positive rate for those measures that can be computed in an experimental design with a binary stimulus, a binary response, and a confidence judgment and for which computation is sufficiently robust and efficient to perform a large number of simulations: gamma correlation coefficients (Nelson, 1984), confidence slopes (Yates, 1990), type 2 receiver operating characteristics (Fleming et al., 2010), meta-d′ and meta-d′/d′ (Maniscalco & Lau, 2014), meta-d a and meta-d a /d a (Maniscalco & Lau, 2012), HMeta-d (Fleming, 2017) and logistic mixed model regression (Sandberg et al., 2010).It should be noted that the authors of the meta-d′ and meta-d a method argued later on that meta-d′ should be preferred over meta-d a (Maniscalco & Lau, 2014); nevertheless, meta-d a was included into our analysis as well because the false positive rate is a crucial piece of information for the interpretation of already published studies using meta-d a .For logistic mixed-model regression, we tested two different model specifications, one with a random effect of participant on the intercept (Sandberg et al., 2010) and one with random slopes (Wierzchoń et al., 2019).

The confidence database
The confidence database is a collection of openly available datasets with a broad range of experimental paradigms, participant populations, and fields of study (Rahnev et al., 2020).All data sets from the confidence database include a measurement of participants' confidence in their response to a task.To compare the false positive rate of different measures of metacognitive accuracy based on identical datasets, we restricted our simulations to a subset of datasets because not all datasets of the confidence database allow the computation of all measures of metacognitive accuracy.Specifically, we used only those datasets where stimulus and task response can be classified into two categories, because binary stimulus and response categories are required to compute meta-d′, meta-d a , and HMeta-d.Finally, we included only those datasets where there were at least 20 participants with at least 20 trials each, because we considered 20 observations to be the bare minimum to represent the variance between subjects as well as within-subjects.Overall, 46 datasets of the database met the inclusion criteria and were used for our simulations.The majority of these 46 studies involved a perceptual task (e.g. a masked orientation discrimination task; Rausch et al., 2018), but there were also cognitive (e.g. a general knowledge task; Mazancieux et al., 2020), and memory tasks (e.g.old vs. new word recognition task; Kantner & Lindsay, 2012).

Hierarchical resampling
The 46 datasets from the confidence database were used to simulate 5000 experiments.
In each simulation, we first randomly selected one of the datasets.Second, if the study included multiple experimental manipulations or difficulty, we randomly selected only one of these conditions.Then, we created two groups of simulated participants: For each group, we sampled from the participants of the original study as many subjects as in the original study with replacement.For each simulated participant, we sampled as many trials as there were in the original study from the data of the corresponding participant in the original study.To keep computation time manageable, we capped the number of subjects per simulated study at 200 and the number of trials per subject at 500.In addition, we simulated a continuous between-subject predictor variable by randomly sampling one value for each participant from an identical Gaussian distribution.Then, we computed measures of metacognitive accuracy (see below) and performed two analyses per simulated data set.First, we tested if there was a difference between the two simulated groups with respect to each measure of metacognitive accuracy.Second, we tested if the continuous predictor variable was associated with metacognitive accuracy.

Gamma correlation
The gamma correlation coefficient, which was proposed by Nelson (1984) as a measure of metacognitive accuracy, is a nonparametric measurement of association between two binary or ordered variables.To compute the gamma coefficient, it is necessary to categorize all possible pairs of trials as concordant pair, disconcordant pair, or tied pair.A pair is concordant if the ordering between the two trials with respect to accuracy of the primary task response is consistent with the ordering of the same two trials with respect to confidence.A pair is disconcordant if the ordering between the two trials with respect to accuracy of the primary task response is inconsistent with the ordering of the same two trials with respect to confidence.Finally, a pair is tied if the two trials have the same value either in terms of accuracy or in terms of confidence.Then, the frequency N s of concordant pairs and the frequency N d of disconcordant pairs are counted, and Gamma is computed as G, like a standard correlation coefficient, takes on values between -1 and 1, with values larger than zero indicating a positive association between confidence and accuracy.

Confidence slopes
The confidence slope proposed by Yates (1990) assumes a linear regression to predict confidence rating with dichotomous variable of correct/incorrect as a predictor.It is calculated according to the expression with S=R the mean of all confidence ratings when the task response was correct, and S≠R the mean of all confidence ratings when the task response was incorrect.

Area under the type 2 ROC curve
The area under type II ROC curve is a measure derived from type II signal detection theory (Clarke et al., 1959;Galvin et al., 2003;Pollack, 1959), an extension of signal detection theory (Green & Swets, 1966;Peterson et al., 1954;Tanner & Swets, 1954).The aim of type II ROC curves is to quantify participants' ability to differentiate between correct and incorrect trials irrespective of rating criteria (Fleming et al., 2010;Fleming & Lau, 2014; but see Shekhar & Rahnev, 2021).To construct a type II ROC curve, it is necessary to determine type II hit rates and type II false alarm rates associated with multiple criteria.Type II hit rate is defined the proportion of high confidence trials when the participants is correct, and type II false alarm rate is the proportion of high confidence trials when the participants is incorrect.Type II hit rates and false alarm rates based on multiple criteria are obtained by means of rating scales with multiple confidence levels: It is assumed that there is a criterion that separates each confidence level from the adjacent confidence levels on the rating scale.For example, for a four-level confidence scale, there is a liberal criterion that assigns low confidence only to the first confidence level and high confidence to the other three confidence levels, then a higher criterion that assigns low confidence to the first and the second confidence level, and so on.For each criterion, type II hit rate and type II false alarm rate are plotted as individual points with type II hit rate plotted on the y-axis and type II false alarm rate on the x-axis.The curve that passes through these different points is referred to as type II ROC curve.The area under the type 2 ROC curve is a measure of the ability to differentiate between correct and incorrect trials; it is 1 if confidence (2) slope = S=R − S≠R ratings differentiate perfectly between correct and incorrect trials and 0.5 if confidence ratings do not differentiate between correct and incorrect trials at all.The area under the type II ROC curve A ROC can be calculated based on the formula with h i the type II hit rate associated with confidence level i, f i the type II false alarm rate associated with confidence level i, and n the number of confidence levels measured by the confidence rating scale (Song et al., 2011).

Meta-d′, meta-d′/d′, meta-d a and meta-d a /d a
The conceptual idea of meta-d′ and meta-d a is to quantify metacognitive accuracy in terms of discrimination sensitivity in a hypothetical signal detection model inferred from confidence judgments assuming participants had perfect access to the sensory evidence underlying the discrimination choice and were perfectly consistent in placing their confidence criteria (Maniscalco & Lau, 2012, 2014).Meta-d′ and meta-d a can therefore be directly compared to d′ and d a respectively, the corresponding measures of task performance: If meta-d′ equals d′ or meta-d a equals d a , it means that metacognitive accuracy is exactly as good as expected from task performance.If meta-d′ is lower than d′, it means that metacognitive accuracy is worse than expected from task performance.The computation meta-d′ and meta-d a is based on a hypothetical signal detection model of confidence judgments (Maniscalco & Lau, 2014).The underlying model assumes that observers select a binary response R ∈ {0, 1} about a stimulus characterized by two classes S ∈ {0, 1} and as well as a confidence rating out of an ordered set of confidence categories C ∈ 1, 2, … , C max .To estimate meta-d′ and meta-d a , we used an R implementation of matlab code provided by Brian Maniscalco (http:// www.colum bia.edu/ ~bsm21 05/ type2 sdt, last accessed 2021-09-20).

Meta-d'
The algorithm to compute meta-d′ involved the following computational steps: First, frequency of each confidence category was determined depending on the stimulus class and the accuracy of the response.To correct for extreme proportions, 1∕(2C max ) was added to each cell of the frequency table.Second, discrimination sensitivity d′ and discrimination criterion c were calculated using standard formulae with n S1 the number of trials when S = 1 , n S0 the number of trials when S = 0 , n S1R1 the number of trials when S = 1 and R = 1 , n S0 the number of trials when S = 0 , n S0R1 the num- ber of trials when S = 0 and R = 1, and Φ −1 the quantile function of the standard normal distribution.For this purpose, a maximum likelihood optimization procedure was used with respect to the confidence data given stimulus, and response as well as the parameters determined at previous steps, i.e. d′ and c .The model included a free parameter for meta-d′ To enforce that the criteria were ordered, all free criteria were parametrized as the log of the distance to the adjacent criterion.The probability for a specific confidence rating given stimulus and response can be computed as where Φ indicates the cumulative gaussian density function with mean variance of 1, 0 is −∞ and 2Cmax is ∞ .Finally, meta-d′ is equal to the parameter d meta .

Meta-d a
The computation of meta-d a was similar to the computation of meta-d′ but with the following differences: First, to determine the parameter a , which quantifies the ratio of the standard deviations of the signal associated with the two classes of the stimulus, an auxiliary signal detection rating model was fitted to the binary response and confidence data using a maximum likelihood maximation procedure.The signal detection rating model included the parameters d , a , and a set of criteria 1 , 2 , …, 2C max −1 as free param- eters.The probability of the data according to the auxiliary signal detection rating model was calculated as where Φ indicates the cumulative gaussian density function with mean and the standard deviation , 0 is −∞ and 2Cmax is ∞.To enforce that the criteria were ordered, all criteria except for 1 were parametrized as the log distance to the adjacent more negative criterion during the fitting procedure.Having obtained a , we computed discrimination sensitivity d a , as well as the discrimination criterion c 1 , The next step included fitting the meta-d a model, which included the same parameters as the meta-d′ model, but C max was fixed at d meta × c 1 ÷ d 1 .According to the meta-d a -model, the probability for a specific confidence rating given stimulus and response is calculated as Finally, meta-d a can be calculated from d meta and a as follows:

HMeta-d
HMeta-d provides an estimate of the meta-d′/d′ ratio based on a hierarchical Bayesian model (Fleming, 2017) and thus its computation is closely related to the computation of meta-d′.To estimate HMeta-d, we used R code provided by Steve Fleming (https:// github.com/ metac oglab/ HMeta-d, last accessed 2020-08-04), which relies on the free software jags to sample from the posterior distribution (Plummer, 2003).Sampling was performed in three separate Markov Chains to allow computation of Gelman and Rubin's convergence diagnostic R (Gelman & Rubin, 1992).When R ≥ 1.1, the corresponding data set was dis- carded from the analysis.
According to the HMeta-d method, just as for standard meta-d′, discrimination performance d′ and discrimination criterion c were computed first using formulae (4) and ( 5) separately for each participant and then submitted to Jags as constants.The hierarchical estimation procedure was used only for the meta-d′/d′ and confidence criteria.For this purpose, the absolute frequency of each confidence rating of participant j given stimulus and response f (C|S, R) was modelled as a multinomial distribution M, where n SR is the number of trials with stimulus S and response R, and p(C|S, R) calculated using formula (6).However, for HMeta-d, unlike meta-d′, C max is fixed at c .p C i |S, R depends on the free parameters d meta and a set of criteria .The priors placed on the param- eters depended on whether we examined the false positive rare associated with a grouping variable or with a continuous predictor variable. (10) HMeta-d with a grouping variable According to the computer code published by Fleming, the effect of group on HMeta-d can be assessed by using the HMeta-d algorithm to sample from the posterior for the average meta-d′/d′ separately for each group.Then, each sample of the posterior of group 2 is subtracted from the corresponding sample of the posterior of group 1 to obtain a posterior of the group difference in terms of mean meta-d′/d′.For this purpose, on the level of a single participant j, the priors of the parameters were specified as follows: where jR0 indicates the confidence criteria of participant j when the response was 0, jR1 indicates the confidence criteria of participant j when the response was 1, trN indicates a truncated Gaussian distribution with the location parameter and the scale parameter , lower bound a, and upper bound b, , and are the parameters of the prior distribution of criteria on the group level, M is the mean of the prior distribution of log d meta∕ d � on the group level, controls the variability of log d meta ∕d � on the group level, and M and j are redundant multiplicative parameters to facilitate sampling from the posterior via parameter expansion.On the group level, priors were specified as follows: Placing independent prior distributions on both groups may not be considered a valid approach because the prior biases estimates of single participants towards the group average and thus the difference between the two groups is increased artificially.Therefore, we repeated this analysis using a prior placed on the difference between groups.For this purpose, we recoded the two groups as -0.5 and 0.5 and used the computer code (including default priors) Fleming proposed for continuous predictor variables (see below): The regression coefficients obtained in this way can therefore be interpreted as the difference between the groups in terms of the logarithm of meta-d′/d′.

HMeta-d with a continuous predictor variable
According to the default implementation of HMeta-d for continuous predictor variables, on the level of a single participant j, the priors of the parameters were specified as follows: where again jR0 indicates the confidence criteria of participant j when the response was 0, jR1 indicates ´the confidence criteria of participant j when the response was 1, trN indicates the truncated gaussian distribution with location parameter , scale parameter , lower bound a, and upper bound b; x j is the continuous predictor variable, j is the subject- specific effect of the continuous predictor variable on log d meta ∕d � , M j is the subject-spe- cific intercept, M and M are the parameters of the distribution of M j on the group level, quantifies the overall effect of the continuous predictor variable on log d meta ∕d � , and quantifies the variability of the effect across subjects.
On the group level, we used three different sets of hyperpriors.First, we used the standard priors for HMeta-d as specified in the computer code provided by Fleming, which uses informative priors for M and and relatively flat priors for all other parameters.
Second, we used close-to-flat prior for all parameters.
Finally, we examined the false positive rate with a set of more narrow prior distributions:

Logistic mixed-model regression
Logistic regression is a specific case of a generalized linear regression model (Bolker et al., 2008).In general, it is a method to quantify the relationship between a binary outcome variable and one or several dichotomous or continuous predictors.There is quite a variety of different logistic regression models to measure metacognitive accuracy (Barthelmé & Mamassian, 2009;Murayama et al., 2014;Rahnev et al., 2020;Sandberg et al., 2010;Wierzchoń et al., 2019), which is why we selected two logistic regression models for the present study with wider applicability (Sandberg et al., 2010;Wierzchoń et al., 2019).In both models, the probability of being correct in the primary task p(T) is modelled as ( 17) a linear function of a confidence rating C .A linear relationship between confidence and accuracy is obtained by transforming the probability of being correct into the logarithm of the odds of the primary response being correct to being incorrect.The two logistic regression models used in the present study differed in terms of the random effect structure: According to the regression model proposed by Sandberg et al. (2010), the clustered nature of the data is modelled by a random effect of participant on the intercept: where 0 is the overall intercept, j is the random intercept, and 1 is the slope of the effect of confidence and considered to be fixed across participants.In contract, according to the regression model by Wierzchoń et al. (2019), there is not only a random effect on the intercept, but also on the slope: where 0 is the overall intercept, 0j is the random intercept, 1 is the overall slope of the effect of confidence, and 1j is the random slope of participant.All logistic regression models were fit using the lme4 library in R (Bates et al., 2015).

Statistical analysis
All analyses were conducted using the free software R (R Core Team, 2020).
For each simulated data set, we first excluded simulated subjects whose performance was below chance level.Summary-statistics to measure metacognitive accuracy were computed separately for each participant.We tested whether the summary-statistics were normally distributed using a series of Kolmogorov-Smirnov tests.Then we compared the two simulated groups using two-sample t-tests.For meta-d′/d′ and meta-d a /d a , we performed two additional tests because ratio distributions are often non-gaussian, a non-parametric Mann-Whitney-U-test as well as t-tests on log-transformed meta-d′/d′ and meta-d a /d a .For the logistic regression models, we used Wald z-tests to test the interaction effect between the fixed effect of group and the fixed effect of confidence.For all measures except for HMeta-d, the false positive rate was estimated by dividing the number of simulations with a significant effect of group by the total number of simulations.For HMeta-d, a false positive was defined as a simulated experiment where the 95% CI interval of the difference between the two simulated groups in terms of mean meta-d′/d′ excluded zero.
For each simulation, we also tested if the Pearson correlation between the measure of metacognitive accuracy and the continuous predictor variable was significant.For meta-d′/d′ and meta-d a /d a , we also tested Spearman's as well as Pearson's r with log-transformed meta-d′/d′ and meta-d a /d a .For the logistic regression models, we again used Wald z-tests to test the interaction effect between the fixed effect of predictor and the fixed effect of confidence.For all measures except for HMeta-d, the estimated false positive rate was obtained by dividing the number of simulations with a significant effect of predictor by the number of simulations after exclusion of convergence errors.For HMeta-d, a false positive was defined as a simulated experiment where the 95% CI interval of the regression coefficient excluded zero.
Statistical evidence if the false positive rate was different from the nominal alpha level of 0.05 was quantified using Bayes factors using default priors and the BayesFactor package (Morey & Rouder, 2015).Bayes factors were interpreted according to standard guidelines (Lee & Wagenmakers, 2013).We assumed a logistic prior distribution of logit-transformed false positive error rate with a location parameter corresponding to a false positive rate of 0.05 and a scale parameter of 0.5.The prior distribution implied a 95% prior probability that the empirical false positive rate would fall between 0.8% and 25.0%.This prior distribution represents the belief that false positive error rates close to the nominal alpha level of 0.05 are more likely a priori than more extreme false positive error rates.The same prior distribution was used to construct posterior distributions of the false positive rate for each measure of metacognitive accuracy.As explanatory analysis, we assessed the relationship between false positives and trial number as well as between false positives and subject number using logistic regression, which we converted into Bayes factors using the BFpack package (Mulder et al., 2021).

Testing summary-statistic based measures for normality
The distributions of meta-d′/d′ and meta-d a /d a were often heavily leptokurtic (Kurtosis for meta-d′/d′ Mdn = 8.53 and for meta-d a /d a Mdn = 8.89) but only slightly skewed (skewness for meta-d′/d′ Mdn = 0.68 and for meta-d a /d a Mdn = 0.58, see Supplementary Fig. 1 for example distributions).We detected a significant deviation from the normal distributions in 47.54% of simulations for meta-d′/d′ and 51.46% for meta-d a /d a (see Fig. 1).Log-transforming meta-d′/d′ and meta-d a /d a reduced the kurtosis to Mdn = 5.63 for meta-d′/d′ and Mdn = 5.87 for meta-d a /d a , without strongly affecting skewness (meta-d′/d′ Mdn = -1.01;meta-d a /d a Mdn = -0.92),resulting in a reduced number of significant deviations from normality (28.24% for meta-d′/d′ and 30.06% for meta-d a /d a ; see Supplementary Fig. 2).Gamma correlations also frequently deviated from normality (18.84%).The deviations appeared to be related to the fact that Gamma is bounded between -1 and 1 (see Supplementary Fig. 3).For the other measures, kurtosis ranged between Mdn = 2.67 and Mdn = 3.12 and skewness between Mdn = 0.00 and Mdn = 0.31, resulting in between 1.26% (meta-d a ) and 5.66% (slopes) significant deviations from normality.

Grouping variable
Figure 2 shows the posterior distribution of the false positive rate associated with different summary-statistics and hierarchical measures of metacognitive accuracy, assuming comparisons between two independent groups.
Table 1 provides Bayes factors quantifying the evidence if the observed false positive rate of different measures of metacognitive accuracy is consistent with an alpha frequency of 5%.There was moderate evidence that the empirical false positive rate is identical to 5% for gamma correlations, type 2 ROC curves, meta-d′, and meta-d a .For confidence slopes the evidence was not conclusive.
Concerning meta-d′/d′, there was moderate evidence that the empirical false positive rate is below 5% when significance testing was performed using t-tests.There was moderate evidence that the false positive rate is 5% when meta-d′/d′ was log-transformed before Table 1 Estimated false positive rates for different measures of metacognitive accuracy in comparisons between two separate groups BF 10 is a Bayes factor quantifying the evidence that the empirical false positive rate is different from 5%, assuming a logistic prior distribution of the logit-transformed false positive error rate with a location parameter that corresponded to a false positive rate of 5%, as well as a scale parameter of 0.5.95% credible intervals were based on the same prior.p H 0 |Data is the posterior probability that the false positive rate is 5% given the data, assuming that the prior probability that the false positive rate is 5% is equal to 0.5 using t-tests and strong evidence that the false positive rate is 5% when significance testing was based on the Mann-Whitney-U-test.Likewise, for meta-d a /d a , there was moderate evidence that the empirical false positive rate is lower than 5% when t-tests were used.However, there was moderate evidence that the empirical false positive rate is 5% when metad a /d a was log-transformed before using t-tests and when Mann-Whitney-U-tests were used.For hierarchical measures, there was extremely strong evidence that the false positive rate of logistic mixed model regression with a random effect on intercepts is larger than 5%, but there was also moderate evidence that the false positive rate is 5% when logistic mixed model regression included random slopes.There was extremely strong evidence that the false positive rate is larger than 5% for HMeta-d when independent priors were placed on both groups.However, when a prior was placed on the difference between groups in terms of meta-d′/d′, there was a moderate amount of evidence that the false positive rate is 5%.

Method
An exploratory analysis investigated if false positives are associated with mean accuracy in a simulated experiment, the number of trials and the number of participants.Supplementary Table S1 shows that the correlation with mean accuracy, the number of trials and the number of subjects was negligibly small for all measures, but with one exception: For logistic mixed model regression with fixed slopes, there was very strong evidence that the probability for a false positive increased with average accuracy and with trial number.

Continuous predictor variables
Figure 3 shows the posterior distribution of the false positive rate associated with different summary-statistics and hierarchical measures of metacognitive accuracy when assessing the relationship to a continuous predictor variable.
Table 2 provides Bayes factors quantifying the evidence that the observed false positive rate of different measures of metacognitive accuracy is consistent with 5%.Concerning summary-statistic based measures, there was moderate evidence that the empirical false positive rate is 5% for type 2 ROC curves, confidence slopes, gamma correlations, meta-d′, meta-d′/d′ (tested using Pearson's r), and log-transformed meta-d a /d a .For meta-d a , log-transformed meta-d′/d′, meta-d′/d′ (tested using Spearman's ρ), and meta-d a /d a (both when tested using Pearson's r and Spearman's ρ), there was even strong evidence that the observed false positive rate is 5%.
Table 2 Estimated false positive rates for different measures of metacognitive accuracy in in simulations with a continuous predictor BF 10 is a Bayes factor quantifying the evidence that the empirical false positive rate is different from 5%, assuming a logistic prior distribution of the logit-transformed false positive error rate with a location parameter that corresponded to a false positive rate of 5%, as well as a scale parameter of 0.5.95% credible intervals were based on the same prior.p H 0 |Data is the posterior probability that the false positive rate is 5% given the data, assuming that the prior probability that the false positive rate is 5% is equal to 0.5 Concerning hierarchical measures, there was extremely strong evidence that the false positive rate of logistic mixed model regression with a random effect on intercepts is larger than 5%, but there was also moderate evidence that the false positive rate of logistic mixed model regression with random slopes is identical to 5%.We also found strong evidence that HMeta-d using default priors is associated with a false positive rate below 5%.HMetad with stronger priors on the parameters of HMeta-d resulted an even smaller false positive Fig. 4 Phi-correlations between the outcome of the statistical test between each pair possible pair of metacognitive accuracy.Note.The outcome of the statistical test was coded as 1 if the 95% CI excluded zero (HMetad) or if the test was significant (all other measures) and zero otherwise.(A) Correlation between the outcome of statistical tests with respect to the grouping variable.(B) Correlation between the outcome of statistical tests with respect to the continuous predicting variable rate.In contrast, the evidence was not conclusive if the false positive rate was different from 5% when we tested a variant of HMeta-d with close-to-flat priors.

Method
Again, we performed an exploratory analysis to see if false positives are associated with mean accuracy, the number of trials or the number of participants.Supplementary Table S2 shows that only for logistic mixed model regression with fixed slopes, the probability of a false positive increased with mean accuracy and with trial number.For all other measures, we detected no association with the probability of a false positive.

P-hacking by selecting measures of metacognitive accuracy
Next, we examined if it is possible to p-hack results by computing several measures of metacognitive accuracy and select the measure of metacognitive accuracy based on the outcome of the statistical test without correction of the alpha level for multiple comparisons.P-hacking by post-hoc selection of a measure of metacognitive accuracy is possible if the results of the corresponding statistical tests are not perfectly correlated.Figure 4 shows that for most pairs of measures of metacognitive accuracy, the correlations between the outcomes of statistical tests is only moderate.
Finally, we examined the false positive rate if researchers compute n different measures of metacognitive accuracy and report only those that yielded a significant result without correction for multiple comparisons.Because the purpose of the analysis was specifically the impact of switching between different measures of metacognitive accuracy and not switching between statistical methods, we did not include all variants of testing meta-d′/d′ and meta-d a /d a for this specific analysis; instead, we assumed for this specific analysis that meta-d′/d′ and meta-d a /d a were tested using Mann-Whitney-U-tests or Spearman's rho, respectively.We also excluded measures for which we had evidence that the false positive rate is not 5%.Figure 5 shows that the false positive rate is close to 5% when only one measure is randomly selected.However, randomly selecting two different measures Fig. 5 Frequency of false positive results if researchers compute n measures of metacognitive accuracy and report only favorable outcomes of metacognitive accuracy without correction is sufficient to increase the false positive to approximately 8%.When all nine different measures are computed, the false positive rate increases to 16.6% and 17.9%, respectively.Figure 5 also shows that unreported changes of the measure of metacognition do not increase the false positive rate to any lesser extent when the sample size is at least 100 subjects.

Discussion
We found that the false positive rates associated with most summary-statistic based measures of metacognitive accuracy were not distinguishable from 5%, including gamma correlations, type 2 ROC curves, meta-d′, and meta-d a .For group comparisons using meta-d′/d′ or meta-d a /d a , the false positive rate is too low when meta-d′/d′ or meta-d a /d a is directly subjected to a t-test.However, the problem can be solved by either log transformation or by using the Mann-Whitney-U-test.In contrast, hierarchical measures of metacognitive accuracy did not perform consistently well: For logistic mixed model regression, the false positive rates are dramatically inflated when random slopes are omitted from the model specification.For HMeta-d, the false positive rate is slightly increased for group comparisons with independent priors on both groups, but the false positive rate is adequate when the prior is placed on the difference between groups.When the standard priors of HMeta-d are used with a continuous predictor variable, the false positive rate is below 5%; the false positive rate cannot be distinguished from 5% if close-to-flat priors are used.We also found that false positive rates increase dramatically when researchers choose a measure of metacognitive accuracy based on the results of statistical tests without correction of the alpha level for multiple corrections.In general, the present study demonstrates that it should not be assumed a priori that the false positive rate associated with a measure of metacognitive accuracy is acceptable; we strongly recommend that only measures for which it has been demonstrated that the false positive rate is appropriate should be used.

Robustness of measures of metacognitive accuracy
In the present study, none of the summary-statistic based measures of metacognitive accuracy were associated with an increased false positive rate.Thus, we did not replicate the increased false positive rate of gamma correlations reported in a previous simulation study (Murayama et al., 2014).In the present study, only hierarchical measures were found to have an increased false positive rate.First, the error rates associated with logistic mixed model regression with fixed slopes are extremely high.Previous studies have already shown that omitting random slopes from the model specification carries the risk of an increased false positive rate (Oberauer, 2022).Our study shows that such a high false positive rate not only occurs in ad-hoc simulations, but is also found in simulations with very similar characteristics to real datasets.
The error rate associated with HMeta-d in simulations with two groups and independent priors is slightly above 5%, which is within the range of empirical false positive rates that are considered to be robust (Bradley, 1978).Nevertheless, from the perspective of the ongoing reproducibility crisis in psychology (Nelson et al., 2018;Nosek et al., 2022;Pashler & Harris, 2012), it seems to us that even such a small excess in empirical false positive rate is too large: If the true false positive rate is 6.7% instead of 5%, the risk of published studies reporting statistical evidence for a non-existent effect increases by about 25%.For this reason, we recommend placing a prior on the between-group difference, which in our stimulation is sufficient to avoid an increased false positive rate.

Considerations beyond false positive rates
The false-positive rate is only one factor that researchers need to consider when choosing a measure of metacognitive accuracy.A more commonly discussed factor is the validity of different measures of metacognitive accuracy: Measures of metacognitive accuracy should not be confounded by other theoretically important variables.Specifically, previous studies have examined whether measures of metacognitive accuracy are contaminated by task performance, task criteria, and confidence criteria (Barrett et al., 2013;Guggenmos, 2021;Masson & Rotello, 2009;Rahnev, 2023;Rausch & Zehetleitner, 2017;Rausch et al., 2023;Shekhar & Rahnev, 2021).Measures based on type II signal detection theory have become very popular because it was promised that type 2 signal detection theory isolates measures of metacognitive accuracy from some of these confounds (Fleming & Lau, 2014).For example, type 2 ROC curves have been designed to control for the criteria that participants apply when reporting their level of confidence (Fleming et al., 2010).Unfortunately, type 2 ROC curves do not control for task performance and task criteria (Fleming & Lau, 2014), and even the control for confidence criteria is not necessarily robust (Shekhar & Rahnev, 2021).A reanalysis of seven experiments showed a medium-sized correlation between type 2 ROC curves and task performance and only a very small correlation between type 2 ROC curves with task criteria and with confidence criteria (Rahnev, 2023).Meta-d′/d′ was designed to explicitly control task performance, task criterion, and confidence criteria simultaneously without assuming a specific generative model underlying confidence judgments (Maniscalco & Lau, 2014).Unfortunately, it has been repeatedly shown that the control provided by meta-d′/d′ is not necessarily effective (Guggenmos, 2021;Rahnev, 2023;Shekhar & Rahnev, 2021) and depends on the generative model underlying confidence judgments (Boundy-Singer et al., 2022;Rausch et al., 2023;Zhu et al., 2023).In addition, meta-d′/d′ is influenced by the dynamics of the decision process (Desender et al., 2022).However, a recent study showed that the correlation between meta-d′/d′ and task performance and between meta-d′/d′ and task criteria is very small, although there is a medium-sized correlation between meta-d′/d′ and confidence criteria (Rahnev, 2023).Others have proposed measures of metacognitive accuracy that depend on a specific generative model of confidence (Boundy-Singer et al., 2022;Desender et al., 2022;Guggenmos, 2022;Mamassian & de Gardelle, 2021;Shekhar & Rahnev, 2021, 2022); however, there is currently no consensus on which model best describes human confidence for the maximum number of data sets (Rahnev et al., 2022).Overall, the appropriate choice for a measure of metacognitive accuracy depends on the theoretical variables necessary to control for a specific research question, as well as the statistical properties of confidence judgments in a specific experiment.
A second factor that researchers need to consider is statistical power.Although the present study did not explicitly address statistical power, the present results are informative about the statistical power of meta-d′/d′ and meta-d a /d a in group designs with t-tests, and of HMeta-d with continuous predictors and standard priors, because the false positive rate was even lower than 5%.An empirical false positive rate below the nominal alpha level indicates that the statistical power is too low (Bradley, 1978).This is because the alpha level is a criterion that researchers use to make statistical inferences, controlling both power and false positive rate.If researchers are willing to accept a higher false positive rate by increasing the alpha level, they are more likely to detect effects if they exist.Likewise, if researchers keep the false positive rate excessively small, there will also be no large power to detect effects even though they exist.The low power of meta-d′/d′ and meta-d a /d a can be explained by a high number of stimulations in which meta-d′/d′ and meta-d a /d a were not normally distributed.Importantly, low statistical power also paradoxically implies that the probability that a published significant research finding is true is low (Ioannidis, 2005).Thus, previously reported results based on meta-d′/d′ and meta-d a /d a in group designs (assuming parametric statistics and without log-transformation) or results based on HMetad with regression analysis and standard priors should be interpreted with caution.
For group designs, the false positive rate of meta-d′/d′ and meta-d a /d a is just as expected when the non-parametric Mann-Whitney-U-test is used or when meta-d′/d′ and meta-d a /d a are log-transformed, so we recommend using one of these procedures in future studies.For HMeta-d, the false positive rate was closer to 5% when a close-to-flat prior was used, but a close-to-flat prior may not be considered as reasonable choice as it gives too much weight to extremely large meta-d′/d′ ratios that are not plausible from a theoretical point of view.Instead, it is preferable to ensure sufficient statistical power of HMeta-d through sample size planning using Bayesian power analysis (Kruschke, 2014).

Preregistration is recommended
Regardless of the measure researchers choose to use, we recommend that researchers preregister their choice of measure of metacognitive accuracy prior to data collection, especially in cases where more than one choice for a measure of metacognitive accuracy is defensible to a sceptical audience.The reason is that the false positive rate is unacceptability high if the measure of metacognitive accuracy is chosen based on the outcome of the statistical test.It is well known that undisclosed flexibility in data analysis causes an unacceptable high number of false positive results (Simmons et al., 2011).The present study demonstrates that the measure of metacognitive accuracy is a choice that gives researchers the opportunity to engage in so-called p-hacking (Nelson et al., 2018).This means that it is crucial that researchers select measures of metacognitive accuracy independently of the outcome of their statistical tests.The best way to demonstrate that analysis procedures have been selected independently from the results is by preregistration before data collection (Chambers, 2013;Nosek & Lakens, 2014;Wagenmakers et al., 2012).Unfortunately, if measures disagree if an effect is present or not, researchers may be tempted to convince oneself that the significant measure is the more accurate or powerful measure.Thus, preregistration is an important tool for researchers to try and be objective about their own results (Oberauer & Lewandowsky, 2019).

Conclusion
The present analysis showed that gamma correlations, confidence slopes, type 2 ROC curves, meta-d′, and meta-d a are all adequate in terms of false positive rate.The false positive rate associated with meta-d′/d′ is adequate when meta-d′/d′ is log-transformed or when significance is assessed by a non-parametric statistical test.For logistic regression, the false positive rate is too high if the random effect of slope is omitted from the model.For HMeta-d, the false positive rate is slightly to high if independent priors are placed on two different groups.Selecting measures of metacognitive accuracy based on the outcome of statistical tests inflates false positive rates, which is why preregistration is recommended for future studies investigating metacognitive accuracy.Finally, because it cannot be assumed a priori that newly proposed measures of metacognitive accuracy are adequate in terms of false positive rates, we recommend that any new method to assess metacognitive accuracy should be carefully validated in terms of false positive rate.

Fig. 1
Fig. 1 P-values of Kolmogorov-Smirnov tests for normality of summary-statistic based measures of metacognitive accuracy.Note.The expected null distribution was simulated by sampling from a Gaussian distributions using the same sample sizes as in the other simulations

Fig. 2
Fig. 2 Posterior distributions of false positive rate in comparisons between with two groups associated with different summary-statistics and hierarchical measures of metacognitive accuracy.Note.Colours indicate Bayes factors quantifying the strength of evidence that the false positive rate is different from 5%

Fig. 3
Fig. 3 Posterior distributions of false positive rate associated with different summary-statistics and hierarchical measures of metacognitive accuracy in in simulations with a continuous predictor variable.Note.Colours indicate Bayes factors quantifying the strength of evidence that the false positive rate is different from 5%