Detecting and diagnosing prior and likelihood sensitivity with power-scaling

Determining the sensitivity of the posterior to perturbations of the prior and likelihood is an important part of the Bayesian workflow. We introduce a practical and computationally efficient sensitivity analysis approach using importance sampling to estimate properties of posteriors resulting from power-scaling the prior or likelihood. On this basis, we suggest a diagnostic that can indicate the presence of prior-data conflict or likelihood noninformativity and discuss limitations to this power-scaling approach. The approach can be easily included in Bayesian workflows with minimal effort by the model builder and we present an implementation in our new R package priorsense . We further demonstrate the workflow on case studies of real data using models varying in complexity from simple linear models to Gaussian process models.


Introduction
Bayesian inference is characterised by the derivation of a posterior from a prior and a likelihood.As the posterior is dependent on the specification of these two components, investigating its sensitivity to perturbations of the prior and likelihood is a critical step in the Bayesian workflow (Depaoli et al., 2020;Gelman, Vehtari, et al., 2020;Lopes & Tobias, 2011).Along with indicating the robustness of an inference in general, such sensitivity is related to issues of prior-data conflict (Al Labadi & Evans, 2017;Evans & Moshonov, 2006;Reimherr et al., 2021) and likelihood noninformativity (Gelman et al., 2017;Poirier, 1998).Historically, sensitivity analysis has been an important topic in Bayesian methods research (e.g.Berger, 1990;Berger et al., 1994;Canavos, 1975;Hill & Spall, 1994;Skene et al., 1986).However, the amount of research on the topic has diminished (Berger et al., 2000;Watson & Holmes, 2016) and results from sensitivity analyses are seldom reported in empirical studies employing Bayesian methods (van de Schoot et al., 2017).We suggest that a reason for this is the lack of sensitivity analysis approaches that are easily incorporated into existing modelling workflows.
In this work, we present a sensitivity analysis approach that fits into workflows in which modellers use probabilistic programming languages, such as Stan (Stan Development Team, 2021) or PyMC (Salvatier et al., 2016), and employ Markov chain Monte Carlo (MCMC) methods to estimate posteriors via posterior draws (e.g.workflows described in Gelman, Vehtari, et al., 2020;Grinsztajn et al., 2021;Schad et al., 2021).The number of active users of such frameworks is currently estimated to be over a hundred thousand (Carpenter, 2022).We provide examples with models that are commonly used by this community, but the general principles are not tied to any specific model or prior families.Furthermore, as the approach focuses on MCMC-based workflows, analytical derivations that would rely on conjugate priors or specific model families are not the focus and are not presented here.
A common workflow is to begin with a base model with template or 'default' priors, and iteratively build more complex models (Gelman, Vehtari, et al., 2020).Recommended template priors, and default priors in higher-level interfaces to Stan and PyMC, such as rstanarm (Goodrich et al., 2020), brms (Bürkner, 2017), and bambi (Capretto et al., 2022), are designed to be weakly informative and should work well when the data is highly informative so that the likelihood dominates.However, the presence of prior and likelihood sensitivity should still be checked, as no prior can be universally applicable.Considering the prevalence of default priors, a tool that assists in checking for prior (and likelihood) sensitivity is a valuable contribution to the community.
User-guided sensitivity analysis can be performed by fitting models with different specified perturbations to the prior or likelihood (Spiegelhalter et al., 2003), but this can require substantial amounts of both user and computing time (Jacobi et al., 2018;Pérez et al., 2006).Using more computationally efficient methods can reduce the computation time, but existing methods, while useful in many circumstances, are not always applicable.They are focused on particular types of models (Hunanyan et al., 2022;Roos et al., 2021) or inference mechanisms (Roos et al., 2015), rely on manual specification of perturbations (McCartan, 2022), require substantial or technically complex changes to the model code that hinder widespread use (Giordano et al., 2018;Jacobi et al., 2018), or may still require substantial amounts of computation time (Bornn et al., 2010;Ho, 2020).
We present a complementary sensitivity analysis approach that aims to • be computationally efficient; • be applicable to a wide range of models; • provide automated diagnostics; • require minimal changes to existing model code and workflows.
We emphasise that the approach should not be used for repeated tuning of priors until diagnostic warnings no longer appear.Instead, the approach should be considered as a diagnostic to detect accidentally misspecified priors (for example default priors) and unexpected sensitivities or conflicts.The reaction to diagnostic warnings should always involve careful consideration about domain expertise, priors, and model specification.Here, the prior is power-scaled, and the effect on the posterior is shown.In this case the prior is normal(0, 2.5) and the likelihood is equivalent to normal(5, 1).Power-scaling the prior by different α values (in this case 0.5 and 2.0) shifts the posterior (shaded for emphasis), indicating prior sensitivity.
Our proposed approach uses importance sampling to estimate properties of perturbed posteriors that result from power-scaling (exponentiating by some α > 0) the prior or likelihood (see Figure 1).We use a variant of importance sampling (Pareto smoothed importance sampling; PSIS) that is self-diagnosing and alerts the user when estimates are untrustworthy (Vehtari et al., 2022).We propose a diagnostic, based on the the change to the posterior induced by this perturbation, that can indicate the presence of prior-data conflict or likelihood noninformativity.Importantly, as long as the changes to the priors or likelihood induced by power-scaling are not too substantial, the procedure does not require refitting the model, which drastically increases its efficiency.The envisioned workflow is as follows (also see Figure 2): (1) Fit a base model (either a template model or a manually specified model) to data, resulting in a base posterior distribution.(2, 3) Estimate properties of perturbed posteriors that result from separately power-scaling the prior and likelihood.(4, 5) Evaluate the extent the perturbed posteriors differ from the base posterior numerically and visually.
(6) Diagnose based on the pattern of prior and likelihood sensitivity.
(8) Continue with use of the model for its intended purpose.

Power-scaling perturbations
The proposed sensitivity analysis approach relies on separately perturbing the prior or likelihood through power-scaling (exponentiating by some α > 0 close to 1).This power-scaling is a controlled, distributionagnostic method of modifying a probability distribution.Intuitively, it can be considered to weaken (when α < 1) or strengthen (when α > 1) the component being power-scaled in relation to the other.Although power-scaling changes the normalising constant, this is not a concern when using Monte Carlo approaches for estimating posteriors via posterior draws.Furthermore, while the posterior can become improper when α approaches 0, this is not an issue as we only consider values close to 1.For all non-uniform distributions, as α diverges from 1, the shape of the distribution changes.However, it retains the support of the base distribution (if the density at a point in the base distribution is zero, raising it to any power will still result in zero; likewise any nonzero density will remain nonzero).In the context of prior perturbations, these properties are desirable as slight perturbations from power-scaling result in distributions that likely represent similar implied assumptions to those of the base distribution.A set of slightly perturbed priors can thus be considered a reasonable class of distributions for prior sensitivity analysis (see Berger, 1990;Berger et al., 1994).For the likelihood, power-scaling acts as an approximation for decreasing or increasing the number of (conditionally independent) observations, akin to data cloning (Lele et al., 2007) and likelihood weighting (Agostinelli & Greco, 2013;Greco et al., 2008).
The power-scaling approach is not dependent on the form of the distribution family and will work providing that the distribution family is non-uniform (distributions with parameters controlling the support will only be power-scaled with respect to the base support).To provide intuition, we present analytically how power-scaling affects several exponential family distributions commonly used as priors (Figure 3 and Table 1).

Base
Power-scaled normal(0, 1) beta(2, 1) gamma(2, 2) 0 1 2 3 4 5 -5 -2.5 0 2.5 5 0 0.25 0.5 0.75 1 0 1 2 3 4 5 Figure 3: The effect of power-scaling on exponential family distributions commonly used as priors.In each case, the resulting distributions can be expressed in the same form as the base distribution with modified parameters.The power-scaling approach is not tied to specific distribution families, and these formulations are provided for intuition.
For instance, a normal distribution, normal 2 ), when power-scaled by some α > 0 simply scales the σ parameter by Power-scaling, while effective, is only able to perturb a distribution in a particular manner.For example, it is not possible to directly shift the location of a distribution via power-scaling, without also changing other aspects.Like most diagnostics, when power-scaling sensitivity analysis does not indicate sensitivity, this only means that it could not detect sensitivity to power-scaling, not that the model is certainly well-behaved or insensitive to other types of perturbations.Nevertheless, power-scaling remains an intuitive perturbation as it mirrors increasing or decreasing the strength of prior beliefs or amount of data.

Power-scaling priors
In order for the sensitivity analysis approach to be independent of the number of parameters in the model, generally all priors are power-scaled simultaneously.However, in some cases, certain priors should be excluded from this set or others selectively power-scaled.For example, in hierarchical models, power-scaling both top-and intermediate-level priors can lead to unintended results.To illustrate this, consider two forms of prior, a non-hierarchical prior with two independent parameters p(θ) p(ϕ) and a hierarchical prior of the form p(θ | ϕ) p(ϕ).In the first case, the appropriate power-scaling for the prior is p(ϕ) α p(θ) α , while in the second, only the top level prior should be power-scaled, that is, p(θ | ϕ) p(ϕ) α .If the prior p(θ | ϕ) is also power-scaled, θ will be affected by the power-scaling twice, directly and indirectly, perhaps even in opposite directions depending on the parameterisation.

Estimating properties of perturbed posteriors
As the normalizing constant for the posterior distribution can rarely be computed analytically in real-world analyses, our approach assumes that the base posterior is approximated using (Markov chain) Monte Carlo draws (workflow step 1, see Figure 2).These draws are used to estimate properties of the perturbed posteriors via importance sampling (workflow steps 2 and 3, see Figure 2).Importance sampling is a method to estimate expectations of a target distribution by weighting draws from a proposal distribution (Robert & Casella, 2004).After computing these weights, there are several possibilities for evaluating sensitivity.For example, different summaries of perturbed posteriors can be computed directly, or resampled draws can be generated using importance resampling (Rubin, 1988).
Importance sampling as a method for efficient sensitivity analysis has been previously described by Berger et al. (1994), Besag et al. (1995), O'Neill (2009), and Tsai et al. (2011).However, one limitation of importance sampling is that it can be unreliable when the variance of importance weights is large or infinite.Hence, as described by Berger et al. (1994), relying on importance sampling to estimate a posterior resulting from a perturbed prior or likelihood, without controlling the width of the perturbation class (e.g. through a continuous parameter to control the amount of perturbation, α in our case) is likely to lead to unstable estimates.
To further alleviate issues with importance sampling, we use Pareto smoothed importance sampling (PSIS; Vehtari et al., 2022), which stabilises the importance weights in an efficient, self-diagnosing and trustworthy manner by modelling the upper tail of the importance weights with a generalised Pareto distribution.In cases where PSIS does not perform adequately, weights are adapted with importance weighted moment matching (IWMM; Paananen, Piironen, et al., 2021), which is a generic adaptive importance sampling algorithm that improves the implicit proposal distribution by iterative weighted moment matching.The combination of using a continuous parameter to control the amount of perturbation, along with PSIS and IWMM, allows for a reliable and self-diagnosing method of estimating properties of perturbed posteriors.

Calculating importance weights for power-scaling perturbations
Consider an expectation of a function h of parameters θ, which come from a target distribution f (θ): (1) In cases when draws can be generated from the target distribution, the simple Monte Carlo estimate can be calculated from a sequence of S draws from f (θ): h(θ (s) ), where θ (s) ∼ f (θ). (2) As an alternative to calculating the expectation directly with draws from f (θ), the importance sampling estimate instead uses draws from a proposal distribution g(θ) and the ratio between the target and proposal densities, known as the importance weights w.The self-normalised importance sampling estimate does not require known normalising constants of the target or proposal.Thus, it is well suited for use in the context of probabilistic programming languages, which do not calculate these: , where θ (s) ∼ g(θ). (3) In the context of power-scaling perturbations, the proposal distribution is the base posterior and the target distribution is a perturbed posterior resulting from power-scaling.If the proposal and target distributions are expressed as the products of the prior p(θ) and likelihood p(y | θ), with one of these components raised to the power of α, the importance sampling weights only depend on the density of the component being power-scaled.For prior power-scaling, the importance weights are Analogously, the importance weights for likelihood power-scaling are As the importance weights are only dependent on the density of the power-scaled component at the location of the proposal draws, they are easy to compute for a range of α values.See Appendix B for practical implementation details about computing the weights.

Measuring sensitivity
There are different ways to evaluate the effect of power-scaling perturbations on a posterior (workflow steps 4 and 5, see Figure 2).Here we present two options: first, a method that investigates changes in specific posterior quantities of interest (e.g.mean and standard deviation), and second, a method based on the distances between the base marginal posteriors and the perturbed marginal posteriors.These methods should not be considered competing, but rather allow for different levels of sensitivity analysis, and depending on the context and what the modeller is interested in, one may be more useful than the other.Importantly, the proposed power-scaling approach is not tied to any particular method of evaluating sensitivity.These methods are our suggestions, but once quantities or weighted draws from perturbed posteriors are computed, a multitude of comparisons to the base posterior and other posteriors can be performed.

Quantity-based sensitivity
In some cases it can be most useful to investigate sensitivity of particular quantities of interest.Expectations of interest for a perturbed posterior can be calculated from the base draws and the importance weights using Equation (3).Other quantities that are not expectations (such as the median and quantiles) can be derived from the weighted empirical cumulative distribution function (ECDF).Computed quantities can then be compared based on the specific interests of the modeller, or local sensitivity can be quantified by derivatives with respect to the perturbation parameter α.

Distance-based sensitivity
We can investigate the sensitivity of marginal posteriors using a distance-based approach.Here, we follow previous work which has quantified sensitivity based on the distance between the base and perturbed posteriors (Al Labadi et al., 2021;Kurtek & Bharath, 2015;O'Hagan, 2003).In principle, many different divergence or distance measures can be used, although there may be slight differences in interpretation (see, for example Cha, 2007;Lek & van de Schoot, 2019), however, the cumulative Jensen-Shannon divergence (CJS; Nguyen & Vreeken, 2015) has two properties that make it appropriate for our use case.First, its symmetrised form is upper-bounded, like the standard Jensen-Shannon divergence (Lin, 1991), which aids interpretation.Second, instead of comparing probability density functions (PDFs) or empirical kernel density estimates, as the standard Jensen-Shannon divergence does, it compares cumulative distribution functions (CDFs) or ECDFs, which can be efficiently estimated from Monte Carlo draws.Although PDFs could be estimated from the draws using kernel density estimation and then the standard Jensen-Shannon distance used, this relies on smoothness assumptions and may require substantially more draws to be accurate, and lead to artefacts otherwise (for further discussion of the benefits of ECDFs, see, for example Säilynoja et al., 2022).
Given two CDFs P (θ) and Q(θ), As a distance measure, we use the symmetrised and metric (square root) version of CJS, normalised with respect to its upper bound, such that it is bounded on the 0 to 1 interval (for further details see Nguyen & Vreeken, 2015): As CJS is not invariant to the sign of the parameter values, CJS(P (θ)∥Q(θ)) ̸ = CJS(P (−θ)∥Q(−θ)), we use max(CJS dist (P (θ)∥Q(θ)), CJS dist (P (−θ)∥Q(−θ))) to account for this and ensure applicability regardless of possible transformations applied to posterior draws that may change the sign.
In our approach, we compare the ECDFs of the base posterior to the perturbed posteriors.The ECDF of the base posterior is estimated from the base posterior draws, whereas the ECDFs of the perturbed posteriors are estimated by first weighting the base draws with the importance weights.The ECDF is a step-function derived from the draws.In an unweighted ECDF, the heights of each step are all equal to 1/S, where S is the number of draws.In the weighted ECDF, the heights of the steps are equal to the normalised importance weights of each draw.As described by Nguyen and Vreeken (2015), when using ECDFs, the integrals in Equations ( 6) and ( 7) reduce to sums, which allows for efficient computation.

Local sensitivity
Both distance-based and quantity-based sensitivity can be evaluated for any α value.It is also possible to obtain an overall estimate of sensitivity at α = 1 by differentiation.This follows previous work which defines the local sensitivity as the derivative with respect to the perturbation parameter (Giordano et al., 2018;Gustafson, 2000;Maroufy & Marriott, 2015;Sivaganesan, 1993).For power-scaling, we suggest considering the derivative with respect to log 2 (α) as it captures the symmetry of power-scaling around α = 1 and provides values on a natural scale in relation to halving or doubling the log density of the component.
Because of the simplicity of the power-scaling procedure, local sensitivity at α = 1 can be computed analytically with importance sampling for certain quantities, such as the mean and variance, without knowing the analytical form of the posterior.This allows for a highly computationally efficient method to probe for sensitivity in common quantities before performing further sensitivity diagnostics.For quantities that are computed as an expectation of some function h, the derivative at α = 1 can be computed as follows.We denote the power-scaling importance weights as p ps (θ (s) ) α−1 , where p ps (θ (s) ) is the density of the power-scaled component, which can be either the prior or likelihood depending on the type of scaling.Then ln(p ps (θ (s) )) .
Consider for example that we are interested in the sensitivity of the posterior mean of the parameters θ.For prior scaling, the derivative of the mean with respect to log 2 (α) at α = 1 is then As with quantity-based sensitivity, distance-based sensitivity can also be quantified by taking the corresponding derivative.CJS dist increases from 0 as α diverges from 1 (approximately linearly in log scale) and its derivative is discontinuous at α = 1.As a measure of local power-scaling sensitivity, we take the average of the absolute derivatives of the divergence in the negative and positive α directions, with respect to log 2 (α).We approximate this numerically from the ECDFs with finite differences: where P1 (θ) is the ECDF of the base posterior (when α = 1), P1/(1+δ) (θ) is the weighted ECDF when α = 1/(1 + δ) and P1+δ (θ) is the weighted ECDF when α = 1 + δ.For implementation we use δ = 0.01.

Diagnostic threshold
We consider D CJS ≥ 0.05 to be a reasonable indication of sensitivity.Distance metrics (and corresponding sensitivity diagnostics) can be calibrated and transformed with respect to perturbing a known distribution, such as a standard normal (e.g.Roos et al., 2015).While we do not transform the value of D CJS in such a way, a comparison with the normal distribution can aid interpretation: For a standard normal, a D CJS of 0.05 corresponds to the mean being shifted by more than approximately 0.3 standard deviations, or the standard deviation differing by a factor greater than approximately 0.3, when the power-scaling α is changed by a factor of two.A graphical depiction of this distance is shown in Figure S1 in Appendix A. However, depending on how concerned a modeller is with sensitivity, this threshold can be adapted to reflect what constitutes a meaningful change in the specific model.

Diagnosing sensitivity
Sensitivity can be diagnosed by comparing the amount of exhibited prior and likelihood sensitivity (workflow step 6, see Table 2).When a model is well-behaved, it is expected that there will be likelihood sensitivity, as power-scaling the likelihood is analogous to changing the number of (conditionally independent) observations.In hierarchical models, it is important to recognise that this is analogous to changing the number of observations within each group, rather than the number of groups.As such, in hierarchical models, lack of likelihood sensitivity based on power-scaling does not necessarily indicate that the likelihood is weak overall.As there can be relations between parameters, the pattern of sensitivity for a single parameter should be considered in the context of others.Cases in which the posterior is insensitive to both prior and likelihood power-scaling (i.e.uninformative prior with likelihood noninformativity) will likely be detectable from model fitting issues, and are not further addressed by our approach.Figure 4: A weakly informative normal(0, 10) prior and a well-behaving normal(5, 1) likelihood lead to likelihood domination.This is indicated by little to no prior sensitivity and expected likelihood sensitivity.This is the outcome that many default priors aim for, as the prior has little influence on the posterior.Top row: the prior is power-scaled; bottom row: the likelihood is power-scaled.Note that in the figure the likelihood and posterior densities are almost completely overlapping.
Likelihood domination (the combination of a weakly informative or diffuse prior combined with a wellbehaving and informative likelihood) will result in likelihood sensitivity but no prior sensitivity.This indicates that the posterior is mostly reliant on the data and likelihood rather than the prior (see Figure 4).This is the outcome that default priors aim for, as the prior has little influence on the posterior.
In contrast, prior sensitivity can result from two primary causes, both of which are indications that the model may have an issue: 1) prior-data conflict and 2) likelihood noninformativity.In the case of prior-data conflict, the posterior will exhibit both prior and likelihood sensitivity, whereas in the case of likelihood noninformativity (in relation to the prior) there will be some marginal posteriors which are not as sensitive to likelihood power-scaling as they are to prior power-scaling (or not at all sensitive to likelihood power-scaling).
Prior-data conflict (Evans & Moshonov, 2006;Nott, Wang, et al., 2020;Walter & Augustin, 2009) can arise due to intentionally or unintentionally informative priors disagreeing with, but not being dominated by, the likelihood.When this is the case, the posterior will be sensitive to power-scaling both the prior and the likelihood, as illustrated in Figure 5.When prior-data conflict has been detected, the modeller may wish to modify the model by using a less informative prior (e.g., Evans & Jang, 2011;Nott, Seah, et al., 2020) or using heavy-tailed distributions (e.g., Gagnon, 2022;O'Hagan & Pericchi, 2012).
The presence of prior sensitivity but relatively low (or no) likelihood sensitivity is an indication that the likelihood is weakly informative (or noninformative) in relation to the prior.This can occur, for example, when there is complete separation in a logistic regression.The simplest case of complete separation occurs when there are observations of only one class.For example, suppose a researcher is attempting to identify the occurrence rate of a rare event in a new population.Based on previous research, it is believed that the rate is close to 1 out of 1000.The researcher has since collected 100 observations from the new population, all of which are negative.As the data are only of one class, the posterior will then exhibit prior sensitivity as the likelihood is relatively weak.In the case of weakly informative or noninformative likelihood, the choice Figure 5: Conflict between t 4 (0, 2.5) prior and t 4 (5, 1) likelihood, results in the posterior (shaded for emphasis) being sensitive to both prior and likelihood power-scaling.Top row: the prior is power-scaled; bottom row: the likelihood is power-scaled.
of prior will have a direct impact on the posterior and is therefore of a greater importance and should be considered carefully.In some cases, the likelihood (or the data) may not be problematic in and of itself, but if the chosen prior is highly informative and dominates the likelihood, the posterior may be relatively insensitive to power-scaling the likelihood.As such, when interpreting sensitivity it is important to consider both the prior and the likelihood and the interplay between them (see related discussion by Gelman et al., 2017).

Sensitivity for parameter combinations and other quantities
As discussed, sensitivity can be evaluated for each marginal distribution separately in a relatively automated manner.This approach may lead to interpretation issues when individual parameters are by definition not informed by the likelihood, or are not readily interpretable.In the case when the likelihood may be informative for a combination of parameters, but not any of the parameters individually, it can be useful to perform a whitening transformation (such as principle component analysis) (Kessy et al., 2018) on the posterior draws and then investigate sensitivity in the compressed parameter space.This can indicate which parameter combinations are sensitive to likelihood perturbations, indicating that they are jointly informed by the likelihood, and which are not.This whitening approach works when there are few parameters, but as the number of parameters grows, the compressed components can be more difficult to interpret.Instead, in more complex cases, we suggest the modeller focus on target quantities of interest.For example, in the case of Gaussian process regression or models specifically focused on predictions, it can be more useful to investigate the sensitivity of predictive distributions (Paananen, Andersen, & Vehtari, 2021;Paananen et al., 2019) than posterior distributions of model parameters.

Software implementation
Our approach for power-scaling sensitivity analysis is implemented in priorsense (https://github.com/n-kall/priorsense), our new R (R Core Team, 2022) package for prior sensitivity diagnostics.The implementation focuses on models fit with Stan (Stan Development Team, 2021), but it can be extended to work with other probabilistic programming frameworks that provide similar functionality.The package includes numerical diagnostics and graphical representations of changes in posteriors.These are available for both distance-and quantity-based sensitivity.Further details on the usage and implementation are included in Appendix B.

Simulations
Here we present two simulations demonstrating how the diagnostic D CJS performs in two scenarios: (a) when the likelihood corresponds to the true model, but the data realisation may weakly inform some parameters, and (b) when the prior is changed to be in increasing conflict with the likelihood.We show that the diagnostic can detect these two cases.

Separation simulation
We generated 1000 data realisations of N = 25 observations, with the following structure: We then fit a Bernoulli logit model to each realisation as follows: ).The model is correctly specified and matches the data generating process.However, a single realisation of 25 observations may be weakly informative due to complete or near-complete separation.
For each data realisation, we compare the a measure of separation n complete (Christmann & Rousseeuw, 2001), to the power-scaling sensitivity diagnostic D CJS .n complete is defined as the minimum number of observations that need to be removed to result in complete separation.As shown in Figure 6, separability induced high sensitivity.When the data is completely or nearly separable, the prior sensitivity is high and when the data is far from completely separable, the prior sensitivity is low.
We then transform each data realisation such that x 1,i ← x 1,i /c and x 2,i ← x 2,i /c, for c ∈ {1, 2, 4, 6, 8} to change the scale of the x 1 and x 2 variables and the corresponding coefficients, but not the values of y.We then fit the following model to each transformed data set:  β0 ∼ t3(0, 2.5), β k ∼ normal(0, 1).
The model is well specified in the sense that the parameter space of the model includes the parameter value of the data generating process.
As c is increased, the priors on β 1 and β 2 will begin to conflict with the likelihood from finite data.We investigate the effect of this increase on the power-scaling sensitivity diagnostic D CJS for each regression coefficient.
As shown in Figure 7, the coefficients for the scaled predictors (β 1 , β 2 ) exhibit different degrees of sensitivity depending on the degree of scaling.Prior sensitivity increases as the scaling factor increases, indicating prior-data conflict.Importantly, likelihood sensitivity decreases when c = 4, indicating that the prior is beginning to dominate the likelihood.As expected, the other coefficients (β 3 , β 4 ), do not exhibit sensitivity or changes in sensitivity.

Case studies
In this section, we show how priorsense can be used in a Bayesian model building workflow to detect and diagnose prior sensitivity in realistic models fit to real data (corresponding data and code are available at https://github.com/n-kall/powerscaling-sensitivity).We present a variety of models and show sensitivity diagnostics for different quantities, including regression coefficients (Sections 5.1 and 5.2), scale parameters (Sections 5.1, 5.3, 5.4), model fit (Section 5.5), and posterior predictions (Sections 5.4 and 5.6).
We use the brms package (Bürkner, 2017), which is a high-level R interface to Stan, to specify and fit the simpler regression models and Stan directly for the more complex models.Unless further specified, we use Stan to generate posterior draws using the default settings (4 chains, 2000 iterations per chain, half discarded as warm-up).Convergence diagnostics and effective sample sizes are checked for all model fits (Vehtari et al., 2021), and sampling parameters are adjusted to relieve any identified issues before proceeding with sensitivity analysis.As the quantitative indication of sensitivity, we use D CJS and the threshold of 0.05 as described in Section 2.4, but we also present graphical checks.

Body fat (linear regression)
This case study shows a situation in which prior-data conflict can be detected by power-scaling sensitivity analysis.This conflict results from choosing priors that are not of appropriate scales for some predictors.For this case study, we use the bodyfat data set (Johnson, 1996), which has previously been the focus of variable selection experiments (Heinze et al., 2018;Pavone et al., 2022).The aim of the analysis is to predict an expensive and cumbersome water immersion measurement of body fat percentage from a set of thirteen easier to measure characteristics, including age, height, weight, and circumferences of various body parts.
We begin with a linear regression model to predict body fat percentage from the aforementioned variables.By default, in brms the β 0 (intercept) and σ parameters are given data-derived weakly informative priors, and the regression coefficients are given improper flat priors.Power-scaling will not affect flat priors, so we specify proper priors for the regression coefficients.We specify the same prior for all coefficients, normal(0, 1), which does not seem unreasonable based on preliminary prior-predictive checks.We arrive at the following model: From the marginal posteriors, there do not appear to be issues, and all estimates are in reasonable ranges (Figure S2).Power-scaling sensitivity analysis, performed with the powerscale_sensitivity function, however, shows that there is both prior sensitivity and likelihood sensitivity for one of the parameters, β wrist (Table 3).This indicates that there may be prior-data conflict.We then check how the ECDF of the posterior is affected by power-scaling of the prior and likelihood.In priorsense, this is done creating a sequence of weighted draws (for a sequence of α values) using powerscale_sequence, and then plotting this sequence with powerscale_plot_ecdf (Figure 8, left).We see that the posterior is sensitive to both prior and likelihood power-scaling, and that it shifts right (towards zero) as the prior is strengthened, and left (away from zero) as the likelihood is strengthened.This is an indication of prior-data conflict, which can be further seen by plotting the change in quantities using powerscale_plot_quantities (Figure 9).Prior-data conflict is evident by the 'X' shape of the mean plot, as the mean is shifting in opposite directions.As there is prior sensitivity arising from prior-data conflict, which is unexpected and unintentional as our priors were chosen to be weakly informative, we consider modifying the priors.On inspecting the raw data, we see that although the predictor variables are all measured on similar scales, the variances of the variables differ substantially.For example, the variance of wrist circumference is 0.83, while the variance of abdomen is 102.65.This leads to our chosen prior to be unintentionally informative for some of the regression coefficients, including wrist, while being weakly informative for others.To account for this, we refit the model with priors empirically scaled to the data, , where s y is the standard deviation of y and s x k is the standard deviation of predictor variable x k .This corresponds to the default priors used for regression models in the rstanarm package (Goodrich et al., 2020), as described in Gelman, Hill, and Vehtari (2020) and Gabry and Goodrich (2020).We refit the model and see that the posterior mean for β wrist changes from -1.45 to -1.86, indicating that the base prior was indeed unintentionally informative and in conflict with the data, pulling the estimate towards zero.Power-scaling sensitivity analysis on the adjusted model fit shows that there is no longer prior sensitivity, and there is appropriate likelihood sensitivity (Table 3, Figure 8 right).This is a clear example of how power-scaling sensitivity analysis can detect and diagnose prior-data conflict.Unintentionally informative priors resulted in the conflict, which could not be detected by only inspecting the posterior estimates of the base model.Once detected and diagnosed, the model could be adjusted and analysis could proceed.It is important to emphasise that the model was modified as the original priors were (Left) Original prior; There is both prior and likelihood sensitivity, as the ECDFs are not overlapping.(Right) Adjusted prior; There is now no prior sensitivity, as the ECDFs are overlapping, whereas there is still likelihood sensitivity.
unintentionally informative.If the original priors had been manually specified based on prior knowledge, it may not have been appropriate to modify the priors after observing the sensitivity, as the precise prior specification would be an inherent part of the model.prior-data conflict, as power-scaling the prior and likelihood have opposite directional effects on the posterior mean.Bottom: adjusted prior; there is no longer prior or likelihood sensitivity for the mean, indicating no prior-data conflict.Likelihood sensitivity for the posterior standard deviation remains, indicating that the likelihood is informative.

Banknotes (logistic regression)
This case study is an example of using power-scaling sensitivity analysis to detect and diagnose likelihood noninformativity.We use the banknote data set (Flury & Riedwyl, 1988) available from the mclust package (Scrucca et al., 2016), which contains measurements of six properties of 100 genuine (Y = 0) and 100 counterfeit (Y = 1) Swiss banknotes.We fit a logistic regression on the status of a note based on these measurements.For priors, we use the template priors normal(0, 10) for the intercept and normal(0, 2.5/s x k ) for the regression coefficients, where s x k is the standard deviation of predictor k.The model is then Power-scaling sensitivity analysis indicates prior sensitivity for all predictor coefficients (Table 4).Furthermore, most exhibit low likelihood sensitivity, indicating a weak likelihood.In a Bernoulli model, this may arise if the binary outcome is completely separable by the predictors.This can be confirmed using the detectseparation package (Kosmidis & Schumacher, 2021), which detects infinite maximum likelihood estimates (caused by separation) in binary outcome regression models without fitting the model.Indeed, according to this method, the data set is completely separable and the prior sensitivity will remain, regardless of choice of prior.As shown in the simulation study in Section 4.1, this is not necessarily an indication that the model is misspecified or problematic, but rather the complete separation in the data realisation may be causing issues for estimating the regression coefficients.

Bacteria treatment (hierarchical logistic regression)
Here, we use the bacteria data set, available from the MASS package (Venables & Ripley, 2002) to demonstrate power-scaling sensitivity analysis in hierarchical models.This data has previously been used by Kurtek and Bharath (2015) in a sensitivity analysis comparing posteriors resulting from different priors.We use the same model structure and similar priors and arrive at matching conclusions.Importantly, we show that the problematic prior can be detected from the resulting posterior, without the need to compare to other posteriors (and without the need for multiple fits).The data set contains 220 observations of the effect of a treatment (placebo, drug with low compliance, drug with high compliance) on 50 children with middle ear infection over 5 time points (week).The outcome variable is the presence (Y = 1) or absence (Y = 0) of the bacteria targeted by the drug.We fit the same generalised linear multilevel model on the data as Kurtek and Bharath (2015), based on an example from Brown and Zhou (2010): We try different priors for the precision hyperparameter τ .We compare the sensitivity of the base model, with prior τ ∼ gamma(0.01,0.01), to the comparison priors.Three of which are considered reasonable, τ ∼ normal + (0, 10), Cauchy + (0, 100), gamma(1, 2), and one is considered unreasonable, τ ∼ gamma(9, 0.5).These priors are shown in Figure 10.We fit each model with four chains of 10000 iterations (2000 discarded as warmup) and perform power-scaling sensitivity analysis on each.As discussed in Section 2.2, only the top-level parameters in the hierarchical prior are power-scaled (i.e. the prior on V i is not power-scaled).Posterior quantities and sensitivity diagnostics for all models are shown in Appendix D. It is apparent that the τ parameter is sensitive to the prior when using the gamma(9, 0.5) prior.This indicates that the prior may be inappropriately informative.Although there is no indication of power-scaling sensitivity for the µ and β parameters, comparing the posteriors for the models indicates differences in these parameters for the unreasonable τ prior compared to the other priors.This is an important observation, and highlights that power-scaling is a local perturbation and may not influence the model strongly enough to change all quantities, yet can indicate the presence of potential issues.

Motorcycle crash (Gaussian process regression)
Here, we demonstrate power-scaling sensitivity analysis on model without readily interpretable model parameters.We fit a Gaussian process regression to the mcycle data set, also available in the MASS package and show the sensitivity of predictions to perturbations of the prior and likelihood.For a primer on Gaussian process regression, see Seeger (2004).
The data set contains 133 measurements of head acceleration at different time points during a simulated motorcycle crash.It is further described by Silverman (1985).We fit a Gaussian process regression to the data, predicting the head acceleration (y) from the time (x).We use two Gaussian processes; one for the mean and one for the standard deviation of the residuals.The model is ρ f ∼ normal + (0, 1), ρ g ∼ normal + (0, 1), σ f ∼ normal + (0, 0.05), σ g ∼ normal + (0, 0.5).For K 1 and K 2 we use Matérn covariance functions with ν = 3/2.These functions are controlled by the ρ and σ parameters.The ρ parameters are the length-scales of the processes and define how close two points x and x ′ must be to influence each other.The σ parameters define the standard deviations of the noise.For efficient sampling with Stan, we use Hilbert space approximate Gaussian processes (Riutort-Mayol et al., 2022;Solin & Särkkä, 2020).The number of basis functions (m f = m g = 40) and the proportional extension factor (c f = c g = 1.5) are adapted such that the posterior length-scale estimates ρf and ρg are above the threshold of that which can be accurately approximated (see Riutort-Mayol et al., 2022).We can then focus on the choice of priors for the length-scale parameters (ρ f , ρ g ) and the marginal standard deviation parameters (σ f , σ g ).It is known that for a Gaussian process, the ρ and σ parameters are not well informed independently (Diggle & Ribeiro, 2007), so the sensitivity of the marginals may not be properly representative as there may be prior sensitivity no matter the choice of prior.We first demonstrate the sensitivity of the marginals before proceeding with a focus on the sensitivity of the model predictions, in accordance with Paananen, Andersen, and Vehtari (2021).
As expected, there is prior sensitivity in the marginals (Table 5).The prior and likelihood sensitivity for the parameters is high, which may be an indication of an issue, however it is difficult to determine based on the parameter marginals alone.Instead we follow up by plotting how the predictions are affected by power-scaling.As shown in Figure 11 (top), the predictions around 20 ms exhibit sensitivity to both prior and likelihood power-scaling.The prediction interval widens as the prior is strengthened (α > 1), and narrows as it is weakened (α < 1).Likelihood power-scaling has the opposite effect.This indicates potential prior-data conflict from an unintentionally informative prior.Widening the prior on σ f from normal(0, 0.05) to normal(0, 0.1) alleviates the conflict such that it is no longer apparent in the predictions (Figure 11,bottom).Plotting the predictions with the raw data indicates a good fit (Figure S3).However, there remains sensitivity in the parameters, although it is lessened (Table 5).This further demonstrates that depending on the model, prior sensitivity may be present, but is not necessarily an issue.We advise modellers to pay attention to specific quantities and properties of interest, particularly when performing sensitivity analyses on more complex models, rather than focusing on parameters without clear interpretations.Figure 11: Sensitivity of posterior predictions to prior and likelihood power-scaling in the motorcycle case study.Shown in the plots are the mean, 50% and 95% credible intervals for the posterior predictions.Top: original prior σ f ∼ normal(0, 0.05).There is clear prior and likelihood sensitivity in the predictions around 20 ms after the crash.Bottom: alternative prior σ f ∼ normal(0, 0.1).There is now no prior sensitivity and minimal likelihood sensitivity for the predictions.

US Crime (linear regression with shrinkage prior)
Here, we show how sensitivity can be analysed with respect to model fit.We fit a regression to the UScrime data set, available from the MASS (Venables & Ripley, 2002) package, and use a joint prior on the regression coefficients based on a prior on the model fit, Bayesian R 2 (Gelman et al., 2019).Such a prior structure can be used to specify a weakly informative prior on the model fit to prevent overfitting (Gelman, Hill, & Vehtari, 2020).We use the R2-D2 prior (Zhang et al., 2022) as implemented in brms and check for sensitivity of the posterior R 2 to changes to the prior on R 2 .The data has observations from 47 US states in the year 1960.See Clyde et al. (2022) for further details on the data set.We model the crime rate y from 15 predictors x k using a logNormal observation model.All continuous predictors are log transformed, following Venables and Ripley (2002).We use the brms default weakly informative priors on the intercept β 0 and residual standard deviation σ.The full model, including the R2-D2 prior is specified as where s 2 x k is the sample variance of predictor x k .We contrast two prior specifications, prior 1: R 2 ∼ beta(3, 7) and prior 2: R 2 ∼ beta(0.45,1.05), shown in Figure 12.The sensitivity analysis indicates that prior 1 may be informative and affecting the posterior.Indeed, the posterior for R 2 is lower with prior 1 than prior 2 (Table 6).
To follow up this, we perform leave-one-out cross validation on both models to compare predictive performance using the elpd loo metric (Vehtari et al., 2020).The results, also shown in Table 6, indicate that prior 1 leads to lower predictive performance than prior 2, and induces a lower effective number of parameters (p loo ).This further corroborates the results of the sensitivity analysis, and shows that the power-scaling sensitivity diagnostic can be used as an early indication of issues that can influence predictive performance.

COVID-19 interventions (infections and deaths model)
In this case study, we evaluate the prior and likelihood sensitivity in a model of deaths from the COVID-19 pandemic (Flaxman et al., 2020).The Stan code and data for this model are available from PosteriorDB (Magnusson et al., 2021).We focus on the effects of power-scaling the priors on three parameters of the model: τ , ϕ and κ.Due to the complexity of the model, we separately power-scale each prior to determine their individual effects.We evaluate the sensitivity of predictions (expected number of deaths due to COVID-19) in 14 countries over 100 days.For a full description of the model, see Flaxman et al. (2020).The parts of the model which we focus on are as follows.The prior on ϕ which partially controls the variance of the negative binomial likelihood of observed daily deaths D t,m , modelled from the expected deaths due to the virus d t,m for a given day t and country m: m is a function of R 0,m and c 1,m . . .c 6,m (among other parameters).The prior on κ which controls the variance of the baseline reproductive number R 0 of the virus for each country m: R 0,m ∼ normal + (3.28, κ), κ ∼ normal + (0, 0.5), and the prior on τ which affects the number of seed infections (infections in the six days following the beginning of the seed period, which is defined as the 30 days before a country observes a total of ten or more deaths): c 1,m , ..., c 6,m ∼ exponential(1/τ ), τ ∼ exponential(0.03).
Here we focus on a subset of four countries, but results for all 14 countries are presented in Appendix F. The results shown in Figure 13 indicate that there is likelihood sensitivity throughout the time period, indicating the data is informative, as seen in Figure 14.Furthermore, there is clear sensitivity to the κ prior, and some sensitivity to the τ prior.This is most pronounced in the predictions of deaths from day 30 to 70, shortly after the first major governmental interventions.Sensitivity is particularly high for the predictions for Germany.Following this up by plotting the sensitivity of predictions on day 50 in Germany (Figure 15), there is an indication that the prior is in conflict with the data, as the mean is shifted in opposite directions by prior and likelihood scaling.
These results are an indication that the chosen prior on κ may be informative and in conflict with the data, and the justification for this prior should be carefully considered.As the prior on R 0 for each country is centred around a specific value, 3.28, based on previous literature (Liu et al., 2020), some sensitivity to the prior on κ may be expected, however the finding that it may be in conflict with the data is nevertheless important and may warrant further attention.This is an example for how a more complex model can be checked for prior and likelihood sensitivity by selectively perturbing priors and focusing on predictions.

Discussion
We have introduced an approach and corresponding workflow for prior and likelihood sensitivity analysis using power-scaling perturbations of the prior and likelihood.The proposed approach is computationally efficient and applicable to a wide range of models with minor changes to existing model code.This will allow automated prior sensitivity diagnostics for probabilistic programming languages such as Stan and PyMC, and higher-level interfaces like brms, rstanarm and bambi, and make the use of default priors safer as potential problems can be detected and warnings presented to users.The approach can also be used to identify which priors may need more careful specification.The use of PSIS and IWMM ensures that the approach is reliable while being computationally efficient.These properties were demonstrated in several simulated examples and case studies of real data, and our sensitivity analysis workflow easily fits into a larger Bayesian workflow involving model checking and model iteration.
Rather than fixing the power-scaling α values, it could be possible to include the α parameters in the model and place hyperpriors on them.However, this naturally complicates the model by adding additional levels of hierarchy.In addition, the question of sensitivity to the choice of hyperprior would then be raised, which may require further sensitivity analysis, or additional levels of hierarchy, the parameters of which become less and less informed by the data (Goel & DeGroot, 1981).Instead, the power-scaling sensitivity approach can be seen as a controlled method for automatically comparing alternative priors, which are interpretable by the modeller.
We have demonstrated checking the presence of sensitivity based on the derivative of the cumulative Jensen-Shannon distance between the base and perturbed priors with respect to the power-scaling factor.While this is a useful diagnostic, power-scaling sensitivity analysis is a general approach with multiple valid variants.Future work could include further developing quantity-based sensitivity to identify meaningful changes in quantities and predictions with respect to power-scaling, and working towards automated guidance on safe model adjustment after sensitivity has been detected and diagnosed.Other extensions include developing additional perturbations that affect different aspects of distributions, such as inducing a mean shift via exponential tilting (Siegmund, 1976).Additionally, it is possible to use the same framework to investigate the influence of specific observations, by power-scaling the likelihood contribution of a single or subset of observations.Finally, we emphasise that the presence of prior sensitivity or the absence of likelihood sensitivity are not issues in and of themselves.Rather, context and intention of the model builder need to be taken into account.We suggest that the model builder pay particular attention when the pattern of sensitivity is unexpected or surprising, as this may indicate the model is not behaving as anticipated.We again emphasise that the approach should be coupled with thoughtful consideration of the model specification and not be used for repeated tuning of the priors until diagnostic warnings disappear.Figure S3: Prediction plot for the adjusted model with the data superimposed.Shown in the plot are the mean, 50% and 95% credible intervals for the posterior predictions.The predictions capture the raw data well, indicating that we have arrived at a reasonable model.

Figure 1 :
Figure1: Example of our power-scaling sensitivity approach.Here, the prior is power-scaled, and the effect on the posterior is shown.In this case the prior is normal(0, 2.5) and the likelihood is equivalent to normal(5, 1).Power-scaling the prior by different α values (in this case 0.5 and 2.0) shifts the posterior (shaded for emphasis), indicating prior sensitivity.

Figure 6 :
Figure6: Relationship between data separability and sensitivity in the separation simulation.n complete is the minumum number of observations that need to be removed to result in complete separation.Each point represents the mean over the data realisations for which the n complete were equal.

Figure 7 :
Figure 7: Relation between scaling the covariates and the sensitivity.Each point represents the mean of 100 model fits (using different data realisations).

Figure 8 :
Figure 8: Power-scaling diagnostic plot of marginal ECDFs for posterior β wrist in the body fat case study.(Left)Original prior; There is both prior and likelihood sensitivity, as the ECDFs are not overlapping.(Right) Adjusted prior; There is now no prior sensitivity, as the ECDFs are overlapping, whereas there is still likelihood sensitivity.

Figure 9 :
Figure9: Posterior quantities of β wrist as a function of power-scaling for the body fat case study.With this plot, we can compare the effect of prior and likelihood power-scaling on specific quantities.Shown as dashed lines are ±2 Monte Carlo standard errors (MCSE) of the base posterior quantity, as guides to whether an observed change is meaningful.Top: original prior; The pattern of the change in the mean indicates prior-data conflict, as power-scaling the prior and likelihood have opposite directional effects on the posterior mean.Bottom: adjusted prior; there is no longer prior or likelihood sensitivity for the mean, indicating no prior-data conflict.Likelihood sensitivity for the posterior standard deviation remains, indicating that the likelihood is informative.

Figure 10 :
Figure 10: Priors for the hyperparameter τ in the bacteria case study.Priors considered reasonable for this application are shown on the left while priors considered unreasonable are shown on the right.

Figure 12 :
Figure 12: The priors and corresponding posteriors for the model fit (R 2 ) in the US Crime case study.

Figure 13 :Figure 14 :Figure 15 :
Figure 13: Likelihood sensitivity of posterior predictions (expected deaths due to COVID-19) for four countries.Vertical lines indicate the onset of major governmental intervention.The dotted lines indicate the sensitivity threshold of 0.05, above which we consider sensitivity to be present.

Figure S2 :
Figure S2: Marginal posteriors for the body fat case study.Points show means, thick and thin lines correspond to 50% and 95% credible intervals.

Table 1 :
Forms of power-scaled distributions for common distributions.

Table 2 :
The interplay between prior sensitivity and likelihood sensitivity can be used to diagnose the cause.

Table 3 :
Sensitivity diagnostic values for the body fat case study.

Table 4 :
Sensitivity diagnostic values for the bank notes case study.

Table 5 :
Prior and likelihood sensitivity in the motorcycle crash case study using the original prior.

Table 6 :
Power-scaling sensitivity and predictive model performance for prior specifications in the US crime case study.

Table 7 :
Sensitivity diagnostic values for the bacteria case study.