Recommended conventions for reporting results from direct dark matter searches

The field of dark matter detection is a highly visible and highly competitive one. In this paper, we propose recommendations for presenting dark matter direct detection results particularly suited for weak-scale dark matter searches, although we believe the spirit of the recommendations can apply more broadly to searches for other dark matter candidates, such as very light dark matter or axions. To translate experimental data into a final published result, direct detection collaborations must make a series of choices in their analysis, ranging from how to model astrophysical parameters to how to make statistical inferences based on observed data. While many collaborations follow a standard set of recommendations in some areas, for example the expected flux of dark matter particles (to a large degree based on a paper from Lewin and Smith in 1995), in other areas, particularly in statistical inference, they have taken different approaches, often from result to result by the same collaboration. We set out a number of recommendations on how to apply the now commonly used Profile Likelihood Ratio method to direct detection data. In addition, updated recommendations for the Standard Halo Model astrophysical parameters and relevant neutrino fluxes are provided. The authors of this note include members of the DAMIC, DarkSide, DARWIN, DEAP, LZ, NEWS-G, PandaX, PICO, SBC, SENSEI, SuperCDMS, and XENON collaborations, and these collaborations provided input to the recommendations laid out here. Wide-spread adoption of these recommendations will make it easier to compare and combine future dark matter results.


Introduction and Purpose of this Paper
The nature of dark matter (DM) is one of the highestpriority topics in high energy particle physics. Many collaborations around the world are building exquisitely sensitive detectors to search for dark matter particles, often in direct competition with each other, and in the future, collaborations may wish to combine data from complementary targets to draw even stronger conclusions about dark matter models, especially in light of neutrino backgrounds [1] and model uncertainties [2].
In going from data to a final dark matter result, or even in projecting the potential sensitivity of a proposed experiment, direct detection collaborations make a series of choices, ranging from how to model the dark matter halo in the Milky Way to which test statistic to use to perform statistical inference. Different approaches can lead to significant differences in the interpretation of a result even if the underlying data are the same, complicating comparisons and combinations of results. In a recent example, the LUX collaboration deployed a power constrained limit [3] (discussed in Sec. 2.2.1) for their dark matter limits [4,5], but chose a different power threshold in the two results; making the same choice in Ref. [4] as in Ref. [5] would have changed the resulting limit by a factor of ∼2. Similarly, the XENON1T collaboration presented a first result by approximating their likelihood ratio with an asymptotic distribution [6], an approximation that led incorrectly to a ∼50% more sensitive result. For their second science run, XENON1T corrected this treatment [7].
Background modeling is another area where collaborations make choices with potentially significant implications on inferred results. While many backgrounds are unique to each detector, there are some elements that are shared by all direct detection experiments, such as those induced by astrophysical neutrinos. To model solar or atmospheric neutrino backgrounds, collaborations rely on external data, with varying possible interpretations of the rates in dark matter detectors. As direct detection experiments increase in exposure, measurements of these astrophysical neutrino fluxes will be among the primary determinants of sensitivity [8].
Dating back to the paper of Lewin and Smith [9], dark matter collaborations have mostly (but not entirely) used similar assumptions about the phase-space distribution of dark matter. However, the community has not converged on a similar consensus regarding how to make statistical inferences from direct detection data. In this paper, we lay out recommendations for statistical methods aimed primarily at the Profile Likelihood Ratio (PLR) method, now commonly used in searches for weak-scale dark matter candidates, although some of these recommendations do apply more generally. We recognize that not all analyses lend themselves to the PLR and we hope that collaborations will follow the spirit of these recommendations when applicable. We take the opportunity to make updated recommendations for modeling the distribution of dark matter in our galaxy, as well as to discuss neutrino backgrounds that will be observed by many experiments in the near future.
This effort grew out of a Phystat-DM workshop [10] held in Stockholm, Sweden, in August 2019, under the umbrella of the Phystat conference series. The authors of this note include members of the DAMIC, Dark-Side, DARWIN, DEAP, LZ, NEWS-G, PandaX, PICO, SBC, SENSEI, SuperCDMS, and XENON collaborations, and these collaborations provided input to the recommendations laid out here. Our approach is similar in spirit to that of the ATLAS and CMS experiments in the period prior to the discovery of the Higgs, when the two collaborations agreed in advance on what statistical treatment to use in combining Higgs data sets [11], although we make different recommendations that we feel are more appropriate for our application.
In writing this white paper, we recognize the large influence of chance when analysing dark matter data; due to the low backgrounds, the expected statistical fluctuations for direct detection upper limits are around twice as large as those in the Higgs discovery [12,13]. Nevertheless, settling on common standards will enable more accurate comparisons of projections and results from different experiments and technologies as well as statistical combinations across collaborations. If, as we hope will be the case, this work is used as a reference in future dark matter publications, the underlying works on which our recommendations are based should also be cited. In addition to the specific recommendations given here, we suggest that collaborations continue to communicate with each other on these topics and adapt as necessary when new results are released.
The paper is organized as follows: Section 2 discusses Profile Likelihood Ratio analyses, Section 3 discusses astrophysical models, with Section 3.1 focusing on the dark matter halo distribution, summarized in Table 1, and Section 3.2 focusing on astrophysical neutrinos, summarized in Table 4. An overall summary of our recommendations is provided in Section 4.

Profile Likelihood Ratio Analyses
Frequentist hypothesis testing has traditionally been the preferred method in direct dark matter searches to place constraints on regions of parameter space. Our recommendations are developed for analyses deploying the profile likelihood ratio (PLR) method [14][15][16], although some can be applied more generally. Using a likelihood-based test statistic like the PLR has the advantage that experimental uncertainties can conveniently be accounted for by parameterizing them as nuisance parameters. The PLR method has been described in great detail elsewhere, and we follow the discussion and notation of Ref. [15]. We strongly recommend readers to review Sec. 2 of Ref. [15] and Ref. [16] as we do not attempt to cover the subject fully here.
For a set of parameters of interest, µ, and a collection of nuisance parameters, θ, the profile likelihood ratio is defined as with L as the likelihood function. The maximum of L is found atμ andθ, the maximum likelihood (ML) estimators of the parameters, while the maximum for a given µ is found atθ. By construction, the parameter λ(µ) is constrained between 0 and 1, and values of λ close to 1 are indicative of a good agreement between the data and the hypothesized value of µ. Direct dark matter searches most often take the hypothesis under test to be a signal model (generally the signal strength or cross section σ) at a single dark matter mass, M , and then 2D curves are constructed by computing significance and confidence intervals for each mass separately. In this strategy, known as a "raster scan", only a single parameter of interest is constrained and therefore, µ = µ (as the name suggests, the procedure is typically repeated for different, fixed values of other signal parameters such as particle mass). An alternative 2D approach would be to constrain σ and M at the same time. As discussed in Ref. [17], the raster scan looks for the best region of σ at each mass M separately, while the 2D approach searches for a region around optimal values of σ and M . For the reasons laid out in Ref. [17] and in keeping with convention to date, we advocate following the raster scan approach in most of what follows, but we return to this question in Sec. 2.4.
Given λ(µ), one can define which is distributed between 0 and infinity. As originally shown in Ref. [18], Wilks' theorem states that the distribution of t µ approaches a chi-square distribution in the asymptotic limit of infinite data. Several conditions must be fulfilled for Wilks' theorem to hold, including that the true value of all parameters should be in the interior of the parameter space, that the sample should be sufficiently large, that no parameter may vary without the model changing, and that the hypotheses are nested [19]. The level of disagreement between the observed data and the hypothesis under test (a given value of µ) is usually quantified via the p-value. This corresponds to the probability for getting a value of t µ for a given µ as large, or larger, than the one observed: where f (t µ |µ) is the probability density function for t µ .
In the case of dark matter, the sought-after signal can only increase the data count (i.e. µ is defined strictly positive). One can modify Eq. 2 to becomẽ which takes into account that forμ < 0, the maximum likelihood estimator will always be µ = 0. Note that here we follow the prescription in Ref. [15], which treatŝ µ as an effective estimator that can take negative values even if the condition µ ≥ 0 is required by the physical model.

Discovery
The primary objective for direct detection experiments is to search for the presence of new signal processes. In this case, the background-only null hypothesis, H 0 , with µ = 0, is the crucial hypothesis to test. With signals expected to lead to an excess of events over the background, a special case of the test statistic in Eq. 4 evaluated at µ = 0,t 0 (also called q 0 in Ref. [15]), should be used to assess the compatibility between the data and H 0 : The level of disagreement with the background-only hypothesis is computed as where f (t 0 |0) is the probability distribution oft 0 under the assumption of the background-only hypothesis, µ = 0. The background-only hypothesis is rejected if p 0 falls below a predefined value, indicating that the data are not compatible with the no-signal hypothesis.

Discovery claims
As is conventional in particle physics, this p-value can be expressed as a discovery significance in terms of the number of standard deviations Z above the mean of a Gaussian that would result in an upper-tail probability equal to p 0 , Φ −1 (1 − p 0 ), where Φ −1 is the inverse of the cumulative distribution function of the normal Gaussian distribution. In this formulation, a 3σ significance corresponds to a p-value of p 3σ = 1.4 × 10 −3 and a 5σ significance to a probability of p 5σ = 2.9 × 10 −7 . Following the convention in particle physics, we recommend that a global p-value smaller than p 3σ is required for a claim of "evidence." Section 2.1.2 details the difference between the global p-value, which takes into account the effect of searching for several signals, and the local, p 0 , which is computed only with reference to a fixed signal model. We recommend always reporting the smallest observed p 0 regardless of the presence or absence of any claim. Lastly, we recommend making available in supplementary material or upon request a plot of p 0 as a function of particle mass. We do not make a recommendation regarding the level of significance needed to claim "discovery."

Look Elsewhere Effect
The "look elsewhere effect" (LEE) is a well-known phenomenon in searches for new physics [20] where, if testing the null hypothesis on the same set of data with respect to multiple alternatives, such as signal hypotheses featuring differing particle masses, the p-value needs to be corrected to account for the fact that a statistical fluctuation might be observed for any of the possible signal hypotheses 1 . Failing to account for the LEE can lead to an overestimate in the apparent significance of a result. The size of the effect can be quantified by calculating the trial factor, the ratio of the p-value for observing an excess in a particular region to the global p-value for observing an excess for any of the signalhypotheses. The size of this effect depends on the number of alternative models to the null hypothesis tested and the ability of the analysis to distinguish between them -high-resolution peak searches will feature a large trial factor, while a counting experiment that cannot distinguish what signal model produces an excess will have a trial factor of 1. Therefore, the class of hypotheses considered when applying this correction should be included in reporting results.
The LEE has not historically been evaluated in direct dark matter searches, except for a recent XENON1T 1 Technically, the condition is that one or more parameters of the signal hypothesis are degenerate under the null hypothesis publication [21]. To test the necessity of the LEE for dark matter searches, we follow the prescription of Ref [22]. Toy Monte-Carlos (MC) are generated and the discovery significance for every candidate mass M is computed for each data set. The smallest local p-value for each toy data set, p = min M (p 0 (M )), is recorded to estimate a "probability distribution of smallest local pvalues," called f (p). The global significance p global data for an observed excess with the smallest p-value, p data , is then: Figure 1 shows the LEE evaluated for various types of searches in a simplified model of a LXe-TPC, demonstrating that the LEE can be significant when considering a range of dark matter masses and detector resolution typical for LXe-TPC searches. For searches restricted to masses above about 40 GeV/c 2 , the LEE is less important as the predicted recoil spectra are almost degenerate in the observable space. A search for monoenergetic peaks (such as an axion search), which effectively scans a range of statistically independent regions, leads to global p-values that are an order of magnitude greater than the minimum local p-value. Given the large computational cost associated with calculating the LEE, we propose that it be accounted for only if a local excess approaching or exceeding 3σ is observed. If the computation needed to reach the relevant significance is unfeasible, alternative methods may be deployed if they can be shown to correctly account for the size of the effect (see for instance Ref. [20]).

Limit Setting
Confidence intervals and upper limits may be constructed via repeated hypothesis testing of a range of µ values, setting the endpoints of the interval (which may be oneor two-sided) at the critical point such that the p-value is equal to a predetermined value α, also called "the size of the test," or equivalently that the confidence level (CL) is 1 − α. Deciding which possible observations should be included in the confidence band for a certain true parameter value is referred to as choosing an "ordering parameter," a test statistic with which to compute p-values that are used to define the confidence interval. This test statistic may or may not be the same test statistic that is used to compute discovery significance. Using the log-likelihood ratio as the test statistic to define the confidence interval yields the "unified" or Feldman-Cousins intervals [23], or, if there are nuisance parameters that are profiled over, the "profile construction" [16,24]. In this way, for a two-sided  [21,22]. A search for spinindependent recoils with WIMP masses between 50 GeV/c 2 and 1000 GeV/c 2 shows a negligible trial effect, as the spectra are almost degenerate. A search for spin-independent WIMPs with masses from 4 GeV/c 2 to 1000 GeV/c 2 , representing a typical LXe-TPC mass range is shown in blue. The green line shows the trial factor for a search for 29 monoenergetic nuclear recoil peaks between 1 keV and 30 keV.
Here, the interval endpoints µ 1 and µ 2 are random variables that depend on the experiment, with µ true the true, unknown value of the parameter of interest. A confidence interval method that fulfills Eq. 8 for all possible signal hypotheses is said to have coverage -a fraction (1 − α) of the confidence intervals would contain the true value over repeated experiments. For direct detection of dark matter, α is most often 0.1, leading to 90% confidence levels, although a value of 0.05 (95% CL) is sometimes used.
The two-sided test statistic most often used in dark matter limit setting ist µ of Eq. 4. An alternative onesided test statistic is: where λ(µ) is the profile likelihood ratio defined in Eq. 1.
With this definition, only the caseμ ≤ µ is regarded as incompatible with the null hypothesis (see Ref. [15] for more details). Comparison of the upper limit obtained using a onesided (orange) and a two-sided (black) test statistic, respectively, for the same data set. The relative difference is indicated in the lower panel, and for this study, it can can be as large as 20% for some masses.
It is important to note that the choice of one-sided or two-sided test statistic can change the inferred result by a significant fraction for the same data set, as shown in Fig. 2. Different direct dark matter experiments have used either the one-or two-sided PLR test statistic in their science papers (see, for instance, Refs. [21,25,26]). Here, we recommend the two-sided construction of Eq. 4. This decision is motivated by the desire to use the same test statistic for limit setting as for discovery (recall that q 0 of Eq. 5 is a limiting case of Eq. 4), with the only difference being the size of the test. If an excess is observed, the two-sided interval will naturally "lift off" from a value of µ = 0, rejecting cases where µ is too small, while results compatible with the background still yield an upper limit. Using a single, unified Neyman construction [23] that provides both these results as a function of the data avoids the potential of an experiment flip-flopping between several constructions 2 . Equation 4 corresponds to the profile construction described by the Particle Data Group [16], and, in the absence of nuisance parameters, is equivalent to that of Feldman&Cousins [15,23]. The cost of choosing the two-sided construction is a marginally weaker upper limit (see Fig. 2). We argue that this is acceptable if the recommendation is widely adopted among dark matter collaborations, as no "unfair" advantage in the apparent limit can be gained by switching from two-sided to one-sided. We also note that assessing the viability of a particular physics model in light of a published upper limit is subject to hidden uncertainties that dominate the difference between the two test statistics; in any case, such assessments should always be undertaken with caution.
We recommend the use of MC techniques to construct the test-statistic distributions (see Sec. 2.3), as opposed to assuming that these distributions follow an asymptotic approximation. We also recommend performing coverage checks to show that the actual coverage of the hypothesis test is similar to the nominal confidence, including if the true values of nuisance parameters differ from those assumed in the construction of the confidence interval; in the presence of nuisance parameters, coverage is not guaranteed, but practice has shown that it generally provides correct coverage. In [21], the coverage was checked with MC simulations assuming a different true nuisance parameter value than that assumed for the profile construction, investigating the robustness of the method to errors in the estimated nuisance parameters.
Because limits are commonly set at 90% CL, in the two-sided construction it is not unlikely that a data set will result in a non-zero lower limit on the parameter of interest despite not satisfying the requirement that the statistical significance is at least 3σ to claim evidence of a positive signal. This is a natural consequence of frequentist hypothesis testing. As an example of such a case, the top panel of Figure 3 shows 90% CL upper and lower limits from a hypothetical background-only experiment. Because α = 0.10, the backgrounds will fluctuate to give a lower bound in 10% of cases. The lower panel presents the p-value versus WIMP mass, to show that these data do not approach a 3σ significance.
For a case like this, we recommend that collaborations should decide in advance on a significance threshold for reporting of a lower limit; for example, in the recent XENON1T publication [7], if a result was less than 3σ significant, no lower limit would be shown. As stated previously, we recommend publication of the smallest observed p-value for the background-only hypothesis in addition to an upper limit in all cases, even if that pvalue is not significant. We also recommend collaborations publish the expected sensitivity of a result by showing a median expected limit with an uncertainty band (often called the "Brazil band").

Cases with limited power
Sometimes confidence interval constructions may yield upper limits corresponding to signals much smaller than the ones to which the detector has any appreciable sensitivity or discovery power 3 . In the case of an upper- limit only construction, this is purely an effect of the requirement to not cover even arbitrarily small signals a fraction α of the time. As an example, Figure 4 shows in gray an expected distribution of upper limits from the XENON1T experiment, with the distribution of upper limits extending to signal expectations of less than 2 events due to downward fluctuations of the background.
A number of alternatives have been developed to address this concern [3, [27][28][29], and LUX [26], PandaX-II [25], and XENON1T [7] have all at times applied the "power-constraint limit" (PCL) of Ref. [3] to their upper limits, while the LHC community settled on the CLs construction of Refs. [27,28]. Either of these constructions will cause overcoverage at very low signals, illustrated for example in Fig. 11 of [21].
Here, we recommend applying the power constraint to limits obtained following Sec. 2.2. We choose the PCL over the alternatives for its conceptual simplicity and because the CLs overcovers to higher quantiles. The principle behind power-constrained limits (PCL) is to use the power of the experiment, π(µ), defined as the probability of rejecting the null, background-only hypothesis in the presence of a signal, as the metric to decide on a smallest signal that an experiment will exclude. Setting a minimal discovery power, π crit = π(µ crit ), gives a minimal signal µ crit . Upper limits that would otherwise fall below this threshold are then set to µ crit .
The publication of Ref. [3] led to vigorous discussion on potential limitations of PCL in the literature and at various Phystat workshops, for example Ref. [29]. A significant concern was whether increasing systematic uncertainties could lead to more stringent limits for certain choices of π crit away from the median. For our purposes, an increase in systematic uncertainty not only widens the expected sensitivity bands of the limit but also raises (makes less sensitive) the median limit. The result is that both the median limit and, for example, the −1σ band move up when a systematic is increased, and conservatism is maintained. For this reason, we feel comfortable moving forward with the PCL. For an upper-limit-only construction, the PCL method gives a coverage of 1 for µ < µ crit , and 1 − α otherwise, where α is the size of the test. Reference [3] includes an example using a Gaussian measurement, and shows that an intuitive rule of thumb exists in this case; for α = 0.05, and a power threshold of π crit = Φ −1 (−1σ) = 0.159, µ crit corresponds to the −1σ line of the sensitivity plot.
The power threshold used in the power constraint is a fiducial choice in the analysis. A more conservative analysis might choose a higher threshold, such as the first LUX analysis [4], which demanded π crit = 0.5. However, given the large random variation in results of rare event searches (about a factor 2 around the median upper limit, see the difference between the median and 1σ lines in green in Figure 4), this choice would somewhat arbitrarily limit the ability of experiments to constrain a considerable swath of parameter space. The most recent publications by LUX, PandaX-II and XENON1T used a power threshold of π crit = Φ −1 (−1σ) = 0.159 to maximize sensitivity while preserving the original purpose of the power constraint.
For unified intervals, the intuitive properties of onesided intervals are not exactly retained. For a Gaussian, with mean bounded to be non-negative with α = 0.1, the power that corresponds to the 15.9th percentile of upper limits is π crit = 0.32, as shown in Fig. 5. The overcoverage varies with the signal, but does not exceed two percentage points for the Gaussian case. We therefore recommend using π crit = 0.32 when using PCL in conjunction with the test statistic of Eq. 4.

Asymptotic approximations
Asymptotic formulae for test statistic distributions exist in the limit of infinite data [15], and using the asymptotic approximation is a reasonable decision to save on computing time. In many cases, the approach to the asymptotic limit can be swift; for example, a counting experiment will reasonably approach the asymptotic result even for moderate expectation values (∼5 events for α = 0.1). However, given the large background discrimination power in direct detection experiments, even results with hundreds of events may not converge to the asymptotic case because the expectation value in the signal region after discrimination is O(1) or less. Figure 6 shows the distribution of a test statistic (solid colors) compared to an asymptotic approximation (dashed black) as the signal size increases for a simplified but representative simulation of a 1000 GeV/c 2 dark matter search, similar to what is shown in Ref. [7]. For small numbers of signal events (darker colors), the asymptotic result poorly approximates the true test statistic distribution which is needed to compute discovery significances and confidence intervals.
If, as is often the case, toy MC simulations are used to estimate the distribution of the test statistic, a very significant result may require commensurately significant computational power to generate, for instance, the > 10 7 toy simulations needed to characterize a 5σ result. Nevertheless, we recommend that any usage of the asymptotic approximation be supported by MC studies to show its validity. In general, we recommend that sensitivity be calculated directly using adequate simulation, with the MC studies cross checked against uncertainty in the simulated values of the nuisance parameters. If a set procedure for this computation is in place, the actual simulation may need only be performed in the case that a highly significant result is seen. In the absence of adequate computing power to do full MC studies, arguments must be presented to justify whatever alternative methods are deployed.

Contours in the Event of Discovery
In the discussion so far, we have assumed the hypothesis under test to be a signal model at a single dark matter mass. For excesses that approach discovery significance, however, collaborations may wish to perform parameter estimation of both the mass M and cross section σ in a vector-like parameter of interest µ to form a 2D confidence contour. In such a case, we do recommend that collaborations set a significance threshold including the LEE effect before completing the analysis and removing any bias mitigation steps (see Sec. 2.6) to determine whether a mass-cross-section contour should be included in a publication. The pre-determined threshold should be set high enough that flip-flopping between the per-mass cross sections and the 2D contour would introduce minimal bias (most likely satisfied by the requirement for a significant excess in the first place). Even if a 2D contour is reported, the per-mass confidence limit should still be included.

Modeling Backgrounds and Detector Response
One of the requirements of the PLR method is that the model of detector response and backgrounds is correct. Modelling these, and the validation thereof, is highly detector-specific and outside the scope of this paper. However, we believe it is essential for experiments to satisfactorily demonstrate goodness-of-fit for their background and detector models in order to properly utilize the methods presented here. This includes setting criteria for background model acceptance prior to an analysis, and clear presentation of those criteria in any eventual publication. One example of a goodness-of-fit criterion is the recent XENON publication, which required a background model p-value ≥ 0.05 in a validation region in order to search for DM and solar 8 B neutrino events in their data [30]. Whenever possible, models should be validated, both on calibration data or side-bands, and computing the goodness-of-fit of the best-fit model. The power of the goodness-of-fit test to detect impactful deviations from the assumed model should ideally be investigated. Uncertainties in the background model, when quantifiable, should be incorporated directly into the likelihood function as nuisance parameters and acknowledged as such in any publications.

Experimenter Bias Mitigation
Experimenter bias is an effect which can, in general, drive a reported, measured value of a quantity away from its true value. In this case, the choices that the analyzer makes regarding cuts and cut thresholds, analysis methods, and when to stop searching for errors in an analysis, are influenced by the quantitative result of the analysis. Numerous examples in the historical physics literature have been identified in which new measurements of a physical quantity appear to be scattered around previous measurements, instead of being scattered around what we currently accept as the true values of those quantities [31].
Methods to control for experimenter bias share a common approach: all choices an analyzer makes are taken without the analyzer knowing what effect those choices have on the final result. In the case of DM experiments, four approaches have been employed, listed below. We make no specific recommendations regarding bias mitigation, and leave such choices to the authors of a given result.
• Signal blinding: A plot is generally made in which observed events fall into various regions of parameter space characterized as more or less signal-like. Often, DM experiments plot an electronic/nuclear recoil discriminant versus energy, and the low-energy "nuclear-recoil-like" area of the plot is considered to be the signal region. In signal blinding, this region is masked for science data, but not for calibration data. Only after all details of the analysis are frozen is this mask removed. In this way, analysis details cannot be tuned based on the number of DM-like events that were observed. The benefit of this type of bias mitigation strategy is that it is robust and simple to implement. The drawback is that rare backgrounds might exist in the data which will not be discovered until after the mask is removed. Many examples of DM searches using signal blinding exist in the literature, including Refs. [7,[32][33][34][35][36]. • Veto blinding: A rare-event search such as a DM experiment will often entail the use of veto signals, which can identify when an event definitely does not result from the process under study. Examples of such veto signals are ones which can tag cosmic rays in nearby materials, or acoustic sensors which can tag alpha decays in bubble chambers. If such a veto signal uniquely identifies background signals, one can choose to blind analyzers to that signal until all analysis details are finalized. This provides analyzers a view of the signal region, but they are not able to know which events are signal and which are not. The benefit of this type of approach is that analyzers have the opportunity to discover rare backgrounds because the signal region is viewable. The drawback is that the background signals vetoed by such a tag may often not look quite like true signals, and therefore this technique may not be viable for some experiments. Examples of this technique in use can be found in Refs. [37,38]. • Salting: An approach similar to that of veto blinding, salting is a technique where fake signal events are injected into the data stream. Analyzers may explore the signal region, but the identity, quantity, and distribution of these fake, injected events are kept blind to the analyzers. The identities of the fake events are revealed only after the details of the analysis are finalized. In this way, like veto blinding, this technique provides benefit of allowing the analyzer to identify rare backgrounds while being ignorant of the effect that analysis details have on the signal result. The drawback to this approach is that it can be difficult to generate a collection of fake signal events that are believable. The LIGO experiment has been able to inject fake gravitational waves by the use of hardware actuators [39]; the LUX experiment constructed fake signal events from a sequestered calibration data set [26]. • Data Scrambling: An experiment may randomly smear data so that data in a control region and data in the signal region are randomly mixed. As an example, Antares introduced a random time-offset to each event when searching for neutrinos from dark matter in the Sun [40]-without removing this offset, it was impossible to determine if an event came from the Sun or another location on the sky. Similarly to salting, this allows all real events to be scrutinised before unblinding.
While one may, and often should, take steps to control for experimenter bias, it is important to note that this is not the only effect which can adversely influence the results of an analysis. A holistic view, in which all systematic features are considered, is warranted. hypothesis. Their galactic-frame velocity distribution, f ( v gal ), is usually assumed to be an isotropic Maxwell-Boltzmann distribution whose velocity dispersion σ 0 is defined by the local standard of rest at the location of the Sun, | v 0 | = √ 2σ 0 , the Sun's peculiar velocity, v , and the Earth's velocity relative to the Sun, v . Requiring that dark matter be gravitationally bound in the galaxy imposes an additional cut-off at the galactic escape speed, v esc . These assumptions result in a galactic-frame velocity distribution, where ρ χ and m χ are the local WIMP density and the WIMP mass, respectively, v lab is the lab-frame WIMP velocity, and Θ(x) is the Heaviside step function. This "Standard Halo Model" (SHM) speed distribution is illustrated in the lab-frame in Figure 7.
The time-dependent velocity of the Earth relative to the Sun is calculated in Refs. [41,42]. Defining the velocity vector as (v r , v φ , v θ ), with r pointing radially inward and φ in the direction of the Milky Way's rotation, this can be written as, where ω = 0.0172 d −1 and ∆t is the number of days since March 22, 2018 (an arbitrary date, and the choice of year has little effect). The average speed of the Earth is | v | , given in Table 1.
The variation of the lab-frame WIMP speed due to the time evolution of v (t) is illustrated in Figure 7. For most analyses not looking for annular modulation effects, it is sufficient to use the distribution averaged over the full year, which comes out to be approximately equivalent to the distribution evaluated at March 9, v (March 9) = (29.2, −0.1, 5.9) km/s. (12) With the exception of m χ , the parameters in Eq. 10 constitute the SHM astrophysical parameters. Since the model used to describe the flux of WIMPs influences the exclusion curves that are drawn, a unified approach to excluding WIMPs requires a consistent treatment of these parameters. Recommended values for them are given in Table 1. The rationales for these parameter choices are discussed below.
Other authors have suggested updates to the SHM that differ from those presented here, including Refs. [43,44] among others. We recommend to report results with respect to the nominal halo model, which will give a common point of comparison, while knowledge of the dark matter halo continues to improve, unless new measurements significantly alter the expected spectra, particularly at high masses. Shape variations in the velocity distribution can have an appreciable effect for limits on WIMPs that produce signals near the energy threshold of an experiment, but is otherwise not expected to have a major effect. Changes that affect the signal normalization but not the shape of the signal distribution, such as variations in the local dark matter density, can be easily accounted for by scaling published limits.
The SHM WIMP speed distribution is illustrated in Figure 7, where the effects of varying the SHM parameters over the range of values motivated by galactic survey analyses are shown. In general, these effects tend to be comparable to or much smaller than the variation of the speed distribution over the course of a year. The effects of varying these parameters on XENON1T's limits [6] are explored in Ref. [49].
Recent observations call into question the adequacy of the SHM itself, as evidence for several kinematically distinctive substructures have emerged from studies of data released by the Gaia mission [50] and the Sloan Digital Sky Survey (SDSS) [51]. These substructures are likely the result of the Milky Way's formation history and merger events with other galaxies, and may include the Gaia Sausage (or Gaia Enceladus) [44, 52- [2,62]. Additionally, N -body simulations of the Large Magellanic Cloud indicate that its passage through the Milky Way could have produced a significant fraction of the local dark matter above the galactic escape speed [63]. Due to these effects, quoted uncertainties on the SHM parameters do not accurately reflect the uncertainties in the dark matter halo, nor do they represent likelihood distributions that can be meaningfully profiled over. The authors of this document therefore suggest that these parameters be fixed to clearly stated values, so that they can be reinterpreted under varying halo models. We note that this is the approach followed by most collaborations in the field over the last decade.
Most of the values suggested in Table 1 are consistent with those already in common use for WIMP direct detection experiments [5,64,65]. The most significant change suggested here is an updated value of v 0 . We emphasize here again that if these parameter values are adopted, the relevant references should always be cited.

Local dark matter density: ρ χ
Values for ρ χ vary significantly between different measurements, typically in the range 0.2 to 0.6 GeV/c 2 /cm 3 . The range of past and proposed measurements are best described in Refs. [66,67]. This parameter normalizes the overall flux, but does not affect the predicted velocity distribution or the resulting WIMP-nucleon recoil spectra; as such, the total number of WIMP events expected in a direct detection experiment scales directly with ρ χ , and the net effect of changing its value is to linearly scale exclusion curves with the same factor by which ρ χ changed. Interpreting current limits in terms of different values of this parameter is therefore trivial, and the recommended value is the one most commonly used in direct detection experiments, as suggested by Ref. [9].

Galactic escape speed: v esc
The galactic escape speed was measured by the RAVE survey [68] and later improved with the additions of SDSS [51] and Gaia [50] data. Measurements of v esc are summarized in Table 2. While some recent measurements seem to be trending towards somewhat lower values of v esc , the values in Table 2 are broadly consistent with each other and with a value around 550 km/s. This value is also consistent with the value estimated in Ref. [69], using the Gaia DR2 dataset. As such, the recommendation put forth in this document is to use v esc = 544 km/s to maintain consistency with assumptions used for existing WIMP-nucleon cross section limits.

Average galactocentric Earth speed: | v |
In this document, we support the use of as suggested in Ref. [41], along with the accompanying time-evolving definition of v (t) approximately summarized in Eq. 11. This value of | v | is consistent with the one suggested in Ref. [9].

Solar peculiar velocity: v
The Sun's peculiar velocity was determined in [46], by fitting data from the Geneva-Copenhagen Survey [77]. Based on this analysis, the authors of Ref. [ with additional systematic uncertainties of (1, 2, 0.5) km/s. We support using this value in dark matter searches. We note that the velocity in the galactic plane is faster than had been reported by previous measurements, based on an analysis of the Hipparcos catalog [78], which reported a value of v = (10.00 ± 0.36, 5.25 ± 0.62, 7.17 ± 0.38) km/s [79]. The decision to support the more recent measurement over the older one is based on the arguments in Ref. [46]. Tab. 2: Reported values of galactic escape speed. The measurement reported in [70]* is a re-analysis of the data set using the same priors used in [71]. In Ref. [80], the proper motion of Sagittarius A * was measured to high precision, implying that the angular velocity of the Sun around the center of the galaxy is given by where v and v 0 give the components of v and v 0 in the galactic plane (the φ component), and R is the distance from the Sun to the galactic center. Uncertainties in most previous estimates of v 0 were driven by uncertainties in R . This distance was recently reported as R =8275 ± 9(stat.) ± 33(syst.) pc [48], implying | v + v 0 | =250.2 ± 1.4 km/s. Combined with measurements of the Sun's peculiar velocity, v , this velocity implies that the local standard of rest has a speed of 238.0 ± 1.5 km/s. We note that this velocity is consistent with the independently measured circular speed of 240 ± 8 km/s suggested in Ref. [81], and the value 229 ± 11 km/s in Ref. [82], albeit with smaller uncertainties. Uncertainties in v 0 are driven by uncertainties in the Sun's peculiar velocity.
Previous limits on WIMP-nucleon cross sections used a value of 220 km/s [5,7,65], as suggested by Refs. [83][84][85][86], which quote an uncertainty around ±20 km/s. We recommend updating this parameter to 238 km/s; while this new value is within the uncertainty of the old one, the choice of this parameter and its smaller uncertainty can have a material impact on dark matter searches [87].

Astrophysical Neutrinos
Astrophysical neutrinos are expected to be an important background for the next generation of direct detection experiments. There are several sources of neutrinos arriving at Earth [88], but not all of them are relevant for direct dark matter searches. In this section we outline the dominant neutrino background sources  Table 4. and make some recommendations that are pertinent to direct detection experiments. Figure 8 shows the neutrino fluxes that populate the relevant energy range for direct detection experiments. Low energy neutrinos from the pp and 8 Be solar reactions give rise to neutrino-electron scattering, which can become a prominent source of low energy electronic recoils. Nevertheless, the ultimate background might come from neutrino-induced nuclear recoils created by coherent neutrino-nucleus scattering, a process that has been recently confirmed experimentally by COHERENT [89]. For example, in a xenon target 8 B and hep solar neutrinos can mimic a WIMP signal with a mass of approximately 6 GeV/c 2 , while atmospheric neutrinos and neutrinos from the diffuse supernova neutrino background (DSNB) will mimic a WIMP signal for masses above 10 GeV/c 2 . Next, we describe each of these neutrino sources separately.

Solar neutrinos
Our current understanding of the processes happening inside the Sun is best summarised by the Standard Solar Model (SSM), which originated more than three decades ago [90]. According to the SSM, the Sun produces its energy by fusing protons into 4 He via the pp chain (∼99%) and the CNO cycle (∼1%). The SSM has been under constant revision since then, as more precise measurements and calculations of the solar surface composition and nuclear reaction rates become available. However, when modelling the solar interior based on a new generation of solar abundances [91], the recent SSMs have failed to reproduce helioseismology data [92][93][94][95]; this is the so-called "solar abundance problem".
The new generation of solar models are usually classified as high-Z and low-Z models, which reflect their different assumptions on the solar metallicity (Z). In the present work we adopt the most recent solar models developed in Ref. [96]. Figure 8 shows the main contributions from the pp chain and CNO cycle, and the overall normalization values are shown in Table 3 for a high-Z (B16-GS98) and a low-Z (B16-AGSS09met) model, respectively. There is also a neutrino component arising from electron capture on the 13 N, 15 O and 17 F nuclei [97,98], but their expected fluxes are very low. Since the CNO cycle has a strong dependence on the assumed metallicity, high-Z and low-Z models will predict different CNO neutrino fluxes. Also, a low-metallicity Sun will have a moderately cooler core, lowering the expected flux from the most temperature-sensitive neutrinos, such as those from the 8 B and 7 Be reactions [99,100]. Note that more precise measurements of the neutrino fluxes, by solar neutrino experiments or a nextgeneration liquid noble detector, will be crucial to resolve the solar abundance problem.
The photon luminosity of the Sun has been measured to a precision of less than 1% [101]. The solar output is distributed into photon and neutrino channels, which introduces a direct constraint to the neutrino fluxes based on the measurement of the photon luminosity, most commonly known as the "luminosity constraint" [102]. Since the pp and 7 Be reactions are dominant, their neutrino fluxes will also be dominant and the predicted uncertainty will be small to satisfy this constraint [103,104]. The CNO fluxes are also affected by this constraint, but on a smaller scale. For a recent discussion on this topic, see Ref. [105].
Experimental measurements are also indicated in the last column of Table 3. These measurements are not entirely model-independent, and correlations between CNO and pp chain neutrinos must be taken into account. There are two notable exceptions: the mea-surements of the 8 B and 7 Be neutrino fluxes. In the former case, the SNO experiment observed 8 B neutrinos via three different reactions: neutral current (NC), charged current (CC), and elastic scattering (ES) [107]. Due to this favourable situation, the only theoretical input required for this analysis was the shape of the 8 B energy spectrum, with the overall normalization being constrained by the NC measurement. In the latter case, the end-point energy of 7 Be is well separated from all the known backgrounds and other neutrino signals, allowing for a measurement of this flux with an uncertainty below 3% [106].
If a direct detection experiment were to take only the experimental values from Table 3, there would be a risk of adopting some measurements with overly large uncertainties, which are mainly driven by detector-specific effects. This could potentially be controlled by performing a global analysis that includes the likelihood from each of these neutrino experiments, albeit this might prove to be impractical. Similarly, the predictions from the solar models also present some problems and, as mentioned above, there is currently not one fully consistent solar model. Taking all this information into account, we recommend using the solar neutrino predictions described in Ref. [96], except for the 8 B and 7 Be fluxes, for which we recommend adopting the experimental values due to their small uncertainty and independence from other neutrino signals. We believe this choice will provide the best sensitivity for direct detection experiments, while using a reasonable collection of flux uncertainties.
Furthermore, there are a few important ingredients that need to be taken into account when converting a neutrino flux into a recoil rate: neutrino oscillations [110,111], the choice of form factor [112,113], electron binding effects [114], and electroweak uncertainties [16], to name the main ones. We leave the particular considerations for each of these factors at the discretion of each collaboration. Also, we recommend using the prediction from the high-Z model presented in Table 3, except for those cases in which the difference in the expected event count between the two models is sufficiently large, in which case we recommend that the predictions from both models be reported. The level at which this difference is considered important is also left at the discretion of each collaboration, but the crucial point is that this comparison should be made in terms of expected counts at the detector under consideration.

Atmospheric neutrinos
Atmospheric neutrinos arise from the collision of cosmic rays in the atmosphere and the subsequent decay of Tab. 3: List of the solar neutrino fluxes that are relevant for direct dark matter searches. HZ and LZ stand for "high-metallicity" and "low-metallicity", respectively. Values for the B16 solar models are from Ref. [ mesons and muons. This neutrino flux spans a wide range of energies, and while the high-end (> 1 GeV) has been well studied, the low energy region remains largely unexplored, which is the most relevant for dark matter searches. Currently, the best predictions on the atmospheric neutrino flux in the sub-GeV regime are based on the 2005 FLUKA simulations [115]. The sum of the predicted electron, anti-electron, muon and antimuon neutrino fluxes from this simulation is shown in Fig. 8. At higher energies, we recommend adopting the more recent calculation of Honda et al [116]. The two main uncertainties associated with this flux at low energies are the uncertainty on the interaction cross section between cosmic rays and air nuclei, and the one arising from the Earth's geomagnetic field, which introduces a cut-off in the low end of the energy spectrum. Taking into account these two effects, the uncertainty on the atmospheric neutrino flux below 100 MeV is approximately 20% [117,118]. It should be highligthed that the cut-off induced by the Earth's geomagnetic field is dependent on the detector's location, resulting in a larger atmospheric flux for detectors that are nearer to the poles [117]. Our recommendation for the total flux and its uncertainty is shown in Table 4.

Diffuse supernova neutrinos
The diffuse supernova neutrino background (DSNB) refers to the cumulative flux of neutrinos from supernova explosions over the history of the Universe. The expected total flux of the DSNB is not large compared to other neutrino sources, but it can be relevant for direct dark matter searches since it extends to a higher energy range than solar neutrinos.
The neutrino spectrum of a core-collapse supernova is well-approximated by a Fermi-Dirac distribution, with temperatures in the range 3-8 MeV [119]. The DSNB flux shown in Fig. 8 assumes the following temperatures for each neutrino flavour: 3 and 5 MeV for electron and anti-electron neutrinos, respectively, and 8 MeV for the total contribution of the remaining neutrinos. For more details on this calculation, see Refs. [120,121]. There are some large theoretical uncertainties in this calculation, and therefore, following the recommendations from Refs. [121,122], we recommend assigning an uncertainty of 50% on the DSNB flux.

Overall Recommendations
We conclude by providing a list of the main recommendations from the sections above. These recommendations do not preclude the development of new methods if they can be shown to have appropriate statistical properties, as long as comparisons with previous results are reported in a transparent manner. Instead, this set of recommendations provide a common framework that will facilitate the comparison of results between different experiments.

Statistical analysis
• We recommend collaborations to decide on all the choices covered in this paper before proceeding with final analyses, regardless of whether collaborations are employing bias mitigation techniques such as blinding or salting. Changes in the analysis due to, e.g. discovering bugs after an unblinding, should be pointed out when reporting results. • We recommend PLR as the test statistic to use to assess discovery significance and to construct confidence intervals. Alternate methods should fulfill similar statistical properties, in particular of coverage. • For standard WIMP searches we advocate performing these assessments on a per-mass basis, as in a raster scan. • Discovery significance should be assessed with the discovery test statistic, Eq. 5 (Eq. 4 evaluated at µ = 0). • If the signal hypothesis has free parameters not defined under the null hypothesis, a look-elsewhereeffect (LEE) computation should be performed to calculate a global significance, at least for a local significance that approaches or exceeds 3σ. • Claims of evidence should require at least a 3σ global discrepancy with the background-only hypothesis. We do not make a recommendation regarding discovery significance. • Experiments should publish their discovery p-value, both local and, if needed, global for any analysis. • The unified confidence interval approach should be used to construct confidence intervals, using the twosided test statistic of Eq. 4. Staying with past convention in the field, the primary limit should use α = 0.1 (i.e. 90% CL). We recommend collaborations publish the expected sensitivity of a result by showing a median expected limit with an uncertainty band (often called the "Brazil band").
• The two-sided confidence interval will "lift off" from 0 signal when p < α. Collaborations may decide, before proceeding to the final analysis, to apply an excess reporting threshold to report the lower limit only above some greater significance level. Note, however, that this approach will in general lead to overcoverage. See Ref. [21] for a previous example by XENON1T. • To avoid large underfluctuations that would exclude parameter space where a discovery is very unlikely, we advocate the use of a power-constraint (PCL) on the confidence intervals obtained using the test statistic of Eq. 4, with a power of π crit = 0.32. This value corresponds to a sensitivity to µ crit at the -1σ contour of predicted sensitivity in the asymptotic case. We leave it to the individual collaborations whether or not to also publicly present the unconstrained bound in some form (e.g. a dashed line), though we recommend that data be available upon request. • For excesses approaching or exceeeding 3σ, a separate mass-cross-section confidence contour could be included. The complete procedure including this step would be: • Compute per-mass (local) discovery significance and per-mass confidence intervals-both of these should always be reported. • If a local discovery significance indicates an excess, compute and apply the look-elsewhere-effect to report a global discovery significance. • If the global discovery significance exceeds a predetermined threshold, a separate mass-cross-section contour may also be included as part of reporting on the excess. The pre-determined threshold should be set high enough that flip-flopping between the per-mass cross-sections and the twodimensional contour is less of a concern, and the per-mass confidence limit should still be included when reporting the result. • We recommend that the distribution of test statistics be estimated using either toy simulations or approximations (asymptotic or otherwise) verified using toy simulations. • Whenever possible, models should be validated, both on calibration data or side-bands, using goodnessof-fit tests chosen to discover relevant model discrepancies. Tests and criteria should be decided before data is unblinded. • We recommend that collaborations work to make their data more usable to the physics community than specific limits, by making results computer-readable and accessible by default, and by working to develop open statistical models/likelihoods for use by the community. • To avoid analysis biases, experiments should perform blind or salted analyses to the extent possible, committing to analysis and statistical conventions before studying the science data.

Astrophysical models
• The overall recommendation is to use the SHM parameters in Table 1. The most significant is an updated value of v 0 , with all the other parameters being equal to the most commonly used values. We emphasize that if these parameter values are adopted, the relevant references should always be cited and citing this reference only is not sufficient. • Due to non-parametric uncertainties in the form of the SHM itself, it is recommended not to profile over the SHM parameters' uncertainties. Instead, we recommend that these parameters are fixed to clearly stated values, so they can easily be reinterpreted under different halo models. • The list of suggested normalization values for the relevant neutrino fluxes is shown in Table 4. We recommend using the theoretical prediction for all the neutrino sources, except for 7 Be and 8 B, for which the most recent experimental values have a low uncertainty and are completely uncorrelated to other neutrino signals. • We leave at the discretion of each collaboration to make the choices that they consider most appropriate to convert neutrino fluxes into recoil rates.