Global fits of axion-like particles to XENON1T and astrophysical data

The excess of electron recoil events seen by the XENON1T experiment has been interpreted as a potential signal of axion-like particles (ALPs), either produced in the Sun, or constituting part of the dark matter halo of the Milky Way. It has also been explained as a consequence of trace amounts of tritium in the experiment. We consider the evidence for the solar and dark-matter ALP hypotheses from the combination of XENON1T data and multiple astrophysical probes, including horizontal branch stars, red giants, and white dwarfs. We briefly address the influence of ALP decays and supernova cooling. While the different datasets are in clear tension for the case of solar ALPs, all measurements can be simultaneously accommodated for the case of a sub-dominant fraction of dark-matter ALPs. Nevertheless, this solution requires the tuning of several a priori unknown parameters, such that for our choices of priors a Bayesian analysis shows no strong preference for the ALP interpretation of the XENON1T excess over the background hypothesis.


Introduction
The XENON Collaboration recently reported an excess of electronic recoil events over known backgrounds [1]. The statistically preferred explanation in the original analysis was that the excess is due to solar axion-like particles (ALPs) with a significance of about 3.5σ over the background-only hypothesis. This anomaly has already garnered considerable interest . However, it was quickly noted that a solar ALP explanation is in conflict with astrophysical observations, including stellar evolution and cooling [8,16,[41][42][43], SN1987A [8,16], pulsating White Dwarfs (WDs) [8], and the predicted mass of astrophysical black holes [44], although this tension can be reduced in more complicated ALP scenarios [41,45]. Interestingly, WD cooling presents a different anomaly that can also be explained by axions, but the preferred axion couplings appear to be in conflict with the results of XENON1T.
A large number of physics scenarios have been put forward to explain the excess in the electronic recoil spectrum observed by XENON1T. One set of options is based around the existence of dark matter (DM) particles that either scatter inelastically in the detector [4,5,12,21,37,[46][47][48] or are boosted to semi-relativistic velocities via some other process before scattering elastically off electrons [10,15,17,[49][50][51][52]. A 2 keV -3 keV dark photon with a small (10 −15 ) kinetic mixing with ordinary photons [2,16,[53][54][55] or a massive dark photon produced from solar emission [6,11,54] (with the caveat of ref. [54]) may also explain the excess. Weakly-interacting relativistic bosons that are produced by the annihilation or decay of heavier dark particles have also been proposed to account for the data [9,54,56,57]. Alternatively, the anomaly might result from new neutrino-lepton or non-standard neutrino interactions mediated by a light scalar or a vector particle [3,13,14,45,[58][59][60][61][62]. Yet more potential explanations include exotic radioactivity affecting hydrogen decays [25], fermionic DM with an electric dipole moment (EDM) sourced by an oscillating axion-like field [63], and the resurrection of the solar ALP explanation via the postulate of a "stellar basin" of gravitationally-bound axions in the Sun [64]. Finally, tritium ( 3 H) [1,65] or argon [66] present in the detector material have also been identified as possible causes, though the latter has since been excluded by the XENON Collaboration in a revised version of their initial submission [1].
Here we focus on ALP explanations for the excess similar to the ones originally considered by XENON1T. We consider solar ALPs and a scenario recently proposed by ref. [19], in which ALPs constitute a fraction of the local DM. This latter setup can potentially reconcile the different ALP-electron couplings favoured by XENON1T and by WD cooling hints, since the DM ALP signal in XENON1T scales with the DM fraction of ALPs. Hence, if only a fraction of the DM is made out of ALPs, the ALP-electron couplings favoured by XENON1T can be large enough to simultaneously explain the anomalous WD cooling.
In this work, we carefully investigate the impact of the XENON1T electronic recoil data on solar and DM ALPs using the GAMBIT global fitting software [67][68][69], considering Xe data and existing astrophysical constraints on ALPs previously considered in ref. [70], including a careful treatment of WD cooling hints. However, because of the systematic uncertainties inherent to these constraints, we also provide results without WD cooling hints. We include inverse Primakoff processes, as recently pointed out and examined by refs. [71,72], and consider the potential 3 H background. We consider the impact of the Xe data on the ALP parameter space and the extent to which, when taken in combination with astrophysical data, it favours or disfavours ALP models.
We analyse the data by simply comparing best-fit likelihoods, by using the best-fit likelihoods in a frequentist analysis and by Bayesian methods. Reporting best-fit likelihoods and differences between them allows for a simple presentation of our findings without adopting either frequentist or Bayesian methods. In our frequentist analysis, we then use these results to compute confidence intervals and make estimates of p-values. Finally, we compute Bayes factors, which tell us how to update our belief in ALPs relative to the background model in light of all the data, and partial Bayes factors, which tell us how to update our belief in light of the Xe data, supposing we already knew about the astrophysical data. A controversial aspect of any such analysis is that the findings are sensitive to our choices of prior (further discussion and details are provided in the appendices).
The paper is structured as follows: section 2 presents the ALP models that we analyse, section 3 discusses the various experimental results that enter into our analysis, along with their individual impacts, and section 4 describes the combined impact of all the constraints. Finally, we conclude in section 5. For the interested reader, we further discuss Bayes factors and our Bayesian numerical methods in appendix A, and our choices of prior and the prior dependence of our results in appendix B. In appendix C, we discuss and compute an alternative Bayesian measure, namely the Deviance Information Criterion. Lastly, we discuss the Monte Carlo simulations used in the frequentist analysis in appendix D.

Models
In order to examine the plausibility of ALP models explaining the excess events, we compare the "solar ALP" and "DM ALP" models introduced in detail below to a backgroundonly model, in which we set all ALP couplings to zero. In the case of the XENON1T electronic recoil data, this corresponds to the known backgrounds included by the XENON Collaboration as described in section 3.1. We furthermore include an additional background contribution from tritium, which we discuss in section 2.3.

Solar ALP model
In order to compare directly to the analysis of ref. [1], and also to investigate the broadest possible parameter space, we consider ALPs with three independent couplings: to photons (g aγ ), electrons (g ae ), and nucleons (g eff aN ). 1 The axion mass, m a , is not a parameter in our solar ALP model, since the axions produced in the Sun are relativistic, E a m a . We recall that for the QCD axion, all three couplings are linearly related to m a . However, even for the QCD axion, there is considerable variation in the values of the couplings between different models, in particular for g ae which can be loop suppressed [74]. A general ALP model is defined as one in which the couplings do not obey any particular relation to one another, and QCD axion models are a subset of the general ALP models considered here.
The XENON1T signal prediction for the solar ALP case consists of the Atomic recombination and de-excitation, bremsstrahlung, and Compton (ABC), Primakoff (denoted by "P"), and 57 Fe ("Fe") fluxes. These can either be deposited in the detector via the axio-electric effect ("aee") or, as pointed out in follow-up studies [71,72], via the inverse Primakoff effect ("iP"). The latter was not considered in the original XENON1T analysis. The individual components are scaled by the effective axion couplings, i.e., we can calculate them at a reference scale. Schematically, s = g 2 ae · g 2 ae · s aee ABC + g 2 aγ · s aee P + (g eff aN ) 2 · s aee Fe + g 2 aγ · g 2 ae · s iP ABC + g 2 aγ · s iP P + (g eff aN ) 2 · s iP Fe , where s is the signal, the subscripts denote the production channel in the Sun, and the superscripts denote the detection channel. We take the ABC, Primakoff and 57 Fe signal components, and backgrounds, from figure 1 of ref. [1]. We compute the inverse Primakoff contributions following ref. [72].

Dark matter ALP model
ALPs are viable DM candidates with a large parameter space spanning many orders of magnitude in mass and coupling [75,76]. We consider three parameters for the DM ALP model: the coupling to electrons (g ae ), the ALP mass (m a ), and the fraction of the (local) DM around the Earth (η) that is made up of ALPs. We ignore the g eff aN coupling for the DM ALP model since it is not involved in the detection channels. While the g aγ coupling could potentially be relevant because of the inverse Primakoff process, it needs to effectively be zero as a result of the x-ray constraints discussed in section 3.4. We therefore ignore this possibility. We comment on the relation between solar ALPs and DM ALPs at the end of this section.
Given that the local DM moves non-relativistically with velocities of the order 10 −3 c, i.e., E a m a , we neglect the velocity effect in the DM ALP signal strength, which is given by where σ pe is the photoelectric cross section which we adopt from ref. [77] (who use results from ref. [78]) and ρ 0 is the local DM density for which we use the constraints implemented in ref. [68], i.e., we use a log-normal distribution with a median value of 0.40(15) GeV/cm 3 . Note that this results (in general) in a value that is larger than the value of ρ 0 = 0.3 GeV/cm 3 considered in ref. [1]. The required relic abundance of cold keV DM ALPs can be produced in the early Universe by the non-thermal vacuum realignment mechanism if the scale of spontaneous symmetry breaking, f a , is of order 10 10 GeV, assuming a temperature-independent ALP mass, and the standard thermal history up to T ∼ f a .
A keV ALP can also be produced thermally by the freeze-in mechanism, in which case it constitutes a warm DM component [79,80]. The allowed abundance of warm DM is constrained by the observed Lyman-alpha flux power spectrum, which favours η 0.1 for m a ∼ 1 keV [81][82][83]. For further discussion of the production mechanisms relevant to this scenario, and the warm DM limits, see ref. [19]. Several explicit models for a DM ALP with the required mass and Standard Model couplings are given in refs. [19,45].
In general, solar ALPs could be DM ALPs at the same time, but in our scenarios we assume that the solar and DM ALPs that explain XENON1T have rather different masses.
To explain the excess events in XENON1T, we must consider electron recoil energies of more than about 1 keV. As DM ALPs in the halo are non-relativistic, this implies that m a 1 keV as well. Solar ALPs are usually considered to be much lighter since the production of ALPs at the keV scale or above, i.e. the typical energy scales of the processes inside the Sun, would be suppressed. For the Primakoff flux, 2 and a recent solar model [84], we estimate that the total integrated axion flux in the energy range relevant for XENON1T is reduced by about 27% (70%) for an ALP mass of 3 keV (5 keV) compared to effectively massless ALPs. While this would allow heavier DM ALPs to also be produced inside the Sun and influence the statistical inference on the values of the ALP couplings, we assume that we can treat the two hypotheses as distinct scenarios.
Lastly, we note that our DM ALP cannot be the QCD axion: among many constraints, a keV QCD axion has a lifetime shorter than the age of the Universe.

The tritium background hypothesis
Let us now comment on the possible presence of a relevant 3 H background in the XENON1T experiment, which could give rise to an excess of events at about 1 keV -15 keV. The XENON Collaboration quote a conservative upper limit of 4 × 10 −20 mol/mol for the 3 H abundance relative to Xe from exposure to cosmic rays, but expect it to be reduced to at most 10 −27 mol/mol after purification. Ref. [65] suggests that these numbers should be considered uncertain by an order of magnitude. The XENON Collaboration furthermore discuss other mechanisms by which tritium may be introduced to the detector, possibly at most at the 10 −26 mol/mol level. For reference, fitting the anomaly with a tritium component requires about 5 × 10 −25 mol/mol.
In light of these uncertainties, in our Bayesian analyses we make a conservative treatment of the tritium level, employing a log-normal prior for the tritium fraction α t with a central value at the upper estimate of the level of tritium and a moderate standard deviation. Of course, the XENON Collaboration itself would be better placed to construct a prior describing plausible levels of tritium in the detector. Another possible choice of prior would be a log-uniform prior between the expected 3 H levels after purification and the amount expected from cosmic rays, effectively encoding the assumption that the purification process was inefficient to an unknown degree. In our frequentist analyses, we permit an unconstrained tritium component, following the XENON1T methodology.

Experimental constraints and hints
As the ALP interpretation of the XENON1T anomaly is in tension with astrophysical observables, we now discuss some of these as well as our implementation of the relevant Table 1: Likelihoods included in the analysis. Note that the Xe likelihood and the highmass corrections for the astrophysical likelihoods will be made available in a future release of the GAMBIT software.

XENON1T
Eq. (3.1) for binned data from electronic recoils [1] lnL XENON1T Anomaly likelihood functions. We later show the impact of the observables in their entirety in section 4. For a discussion and recent global analyses considering astrophysical constraints, see e.g., refs. [70,85]. The likelihood functions used in this analysis are summarised in table 1.

XENON1T
We implement the XENON1T likelihood (hereafter "Xe likelihood") from the binned data between 1 and 30 keV as made available by the XENON Collaboration on Zenodo [92]. We infer an exposure of about 0.6473 t yr from the Zenodo data. Our likelihood is the product of Poisson distributions, where o i are the observed counts; s i are the binned signal predictions; b i are the binned backgrounds other than tritium, which are scaled by a factor α b ; and t i is the binned tritium background, scaled by the tritium fraction α t (relative number of atoms w.r.t. Xe atoms; in units of mol/mol). The overall expected events are scaled by the efficiency . The efficiency and the background scale α b are varied with Gaussian uncertainties 0.03 and 0.026, respectively, which were estimated from refs. [1,93]. We note that additional possible contributions to the background [94] are not included. We calculate the binned signals using the energy resolution for XENON1T as determined in ref. [95] and the detection efficiencies from Zenodo [92]. We emphasise that XENON1T in fact perform an unbinned analysis, which is expected to have higher statistical power than what can be done with publicly available data.

Horizontal and Red Giant Branch stars
One of the most robust constraints on axions and ALPs comes from the lifetime of stars [96] such as Horizontal Branch (HB) and Red Giant Branch (RGB) stars. The stellar plasma is transparent to weakly-coupled ALPs, so that once they are produced, they easily escape the star, leading to an additional cooling channel. These theoretical constraints can be turned into a likelihood by counting the number of HB and the number of RGB stars in e.g. Galactic globular clusters. The ratio of their numbers, the so-called R parameter, has been used to place strong constraints on the ALP-photon and ALP-electron coupling. We use a likelihood based on results from ref. [86], which was first implemented in ref. [70]. Note that we include the rather small correction for this likelihood that arises for higher ALP masses [97]. We hereafter refer to this as the R likelihood.
The R likelihood is the most robust astrophysical constraint that we consider in the sense that it is not affected by large systematic uncertainties. The measurement itself is based upon a large number of systems, which show good agreement with each other, suggesting that the measurement error is statistic dominated [98]. The theory prediction relies on the helium abundance, which introduces an additional potential source of error. 3 After including the inverse Primakoff contribution, we find best-fit values ofĝae = 2.86 × 10 −12 , gaγ = 1.28 × 10 −10 GeV −1 ,ĝ eff aN = 8.55 × 10 −7 ,α b = 0.98 andˆ = 0.98 with χ 2 = 29.2. It has been observed in follow-up works [71,72] that after including the inverse Primakoff contribution, the XENON1T best-fit regions move closer to the region in which astrophysical constraints are satisfied. This has been interpreted as a reduction of the tension. However, that fact that the value of χ 2 = 43.4 of the best-fit point in the Xe + R likelihood combination analysis stays the same -with or without including the inverse Primakoff contribution -already indicates that the tension is not relieved significantly after including it.
However, ref. [87] argues that systematic uncertainties related to this measurement are under control. Hence, the R likelihood allows for a consistent statistical interpretation in the context of global fits. Our results from combining XENON1T and the R likelihood show that the latter puts the solar ALP interpretation of the XENON1T anomaly into question. Indeed, the R likelihood dominates over the XENON1T likelihood in the g ae -g aγ plane of the parameters. This forces the solar ALP best-fit couplings to occupy a degenerate line away from values that can explain the excess. Note that not combining the R parameter with the XENON1T likelihood in the solar ALP case would be inconsistent since both consider ALP interactions with stellar systems.
For an impression of the importance of the R likelihood, consider the solar ALP model. With only XENON1T, we find a best-fit of χ 2 = 29.3. With XENON1T and the R parameter, we find a best-fit of χ 2 = 43.4, a considerable increase, despite only adding one additional data point to the fit. Indeed, the increase by about ∆χ 2 = 14 indicates that although g eff aN (with a best-fit atĝ eff aN = 1.08 × 10 −5 ) is not constrained by the R parameter, it cannot alleviate the tension between XENON1T and the R parameter, mostly because the 57 Fe signal associated with it is monochromatic and only contributes to the spectrum near 14.4 keV, whereas the excess is observed below 7 keV.

White Dwarf cooling hints
Similar to the XENON1T anomaly, observations of anomalous cooling in WDs can be interpreted as being due to ALPs with non-vanishing ALP-electron coupling g ae . Indeed, measurements of the period increase in a number of pulsating WDs show anomalous cooling that is consistent with an ALP-electron coupling of g ae ∼ O(10 −13 ) [86]. Another observable that can be used to infer the WD cooling is the white dwarf luminosity functions, see e.g., ref. [99].
Here we use a likelihood based on the findings of refs. [88][89][90][91], first implemented and described in ref. [100]. Similar to the R likelihood, we have to include correction terms for higher ALP masses (see e.g., ref. [101]). We estimate the WDs internal temperature from their astroseismological properties in refs. [88][89][90]102] using Kramer's opacity law [96, section 2.2.2]. The typical corrections at m a = 3 keV increase the ALP-electron coupling by a factor of 1.4 (except for PG1351+489, which has a higher internal temperature than the others and the correction is less than 1.1).
More specifically, for light ALPs (e.g., the solar ALPs that we consider) "WD cooling hints" therefore point to a value of g ae ∼ 3.4 × 10 −13 , which is an order or magnitude lower than the coupling expected to fit the XENON1T anomaly with solar ALPs [1]. This fact, together with the importance of the R parameter constraint mentioned above, takes our combined best fit point for solar ALPs to a region in significant tension with the XENON1T anomaly.
The situation is reversed for the DM ALP case. Here, the required value of g ae ∼ 3.7 × 10 −13 to fit the cooling hints 4 is larger than the constraints placed by the XENON1T experiment (assuming that ALPs make up all of the local DM) [1]. This constraint can be evaded if ALPs are allowed to only constitute a fraction of the local DM.
It should be noted that the WD cooling hints are somewhat speculative due to the difficulties involved in both the measurement and the modelling of WD evolution. Nonetheless, just as with the XENON1T anomaly, it is interesting to consider the consequences of the WD cooling anomaly. When included, the combined likelihood of the four WDs considered constitutes a strong constraint that can dominate the statistical inference on the ALP-electron coupling.

DM ALP decays
If ALPs constitute some or all of DM, their decays into photons would lead to potentially observable x-ray lines. The strongest constraints in the mass range of interest stem from observations of M31 [103] and from NuSTAR [104] and require which is many orders of magnitude stronger than the constraints from stellar cooling. However, the ALP-photon coupling is not necessary to explain either the XENON1T anomaly or the WD cooling hint, so we will assume that the ALP-photon coupling for the DM ALP case is sufficiently suppressed (see refs. [19,80] for how to obtain the necessary suppression in explicit models) to satisfy this constraint, and we set g aγ = 0 explicitly. Note that ignoring the g aγ coupling will have an impact on the statistical inference compared to including it together with the x-ray constraints mentioned above. In both frequentist (via the model DOFs) and Bayesian (via an Occam's penalty) analyses, the presence of a strongly constrained parameter will disfavour the ALP hypothesis -unless that parameter vanishes or can at least be suppressed sufficiently in a given model. By ignoring the ALP-photon coupling for the DM ALP case, we therefore assume that such models can be constructed in a natural way. This might be overly optimistic but also means that if the DM ALP model is disfavoured in this setting, it would perform even worse with g aγ as a free parameter.

SN1987A cooling
We end this section with a note on the possible impact of further cooling constraints from supernova SN1987A, though as later explained, we do not include SN1987A in our statistical analysis. SN1987A constrains axions and ALPs in numerous ways [96] such as the conversion of axions in interstellar magnetic fields, the decay of ALPs, and the neutrino cooling time. For the present work, the latter would be the most relevant one as it presents one of the strongest constraints on the ALP-nucleon coupling, g eff aN . Unfortunately, the usually cited theoretical cooling bound has so far not been cast into a statistical framework. One of the difficulties is that the uncertainties on the observed neutrino flux from SN1987A are large and ALPs hardly affect its value, as noted in ref. [105]. 5 It thus appears that statistical statements about the bound would require full supernova simulations including ALPs to find which ALP-nucleon couplings are consistent with SN1987A happening at all.
Although the lack of a likelihood function prevents us from including this bound in our analysis, let us now nonetheless discuss its possible implications on the XENON1T anomaly. The bound, usually quoted as appears to exclude the XENON1T anomaly best-fit ALP-nucleon coupling,ĝ eff aN ≈ 10 −6 , by several orders of magnitude. 6 However, the ALP-nucleon coupling only impacts the 57 Fe component of the signal, a monochromatic feature at around 14.4 keV, whereas it is the ABC and Primakoff components that could explain the XENON1T excess. Setting g eff aN = 0, we find that the minimum χ 2 value for the solar ALP case would change from χ 2 = 29.2 to 30.9, i.e., the effect of the SN1987A constraint on the solar ALPs hypothesis would be small. Importantly, though, the SN1987A bound could prevent the QCD axion from playing the role of a solar ALP that explains the XENON1T excess. This is most easily seen by noting that the SN1987A limit on g eff aN places the strongest upper bound on the QCD axion mass, m a 10 −2 eV. This is incompatible with the smallest value of the QCD axion mass allowed by the XENON1T 90% CL region, which occurs for the DFSZ model with g aγ ≈ 10 −11 GeV −1 , leading to m a ≈ 4 × 10 −2 eV. However, variations of the QCD axion model [108] with smaller g aγ remain compatible.

Results
We now turn to the detailed discussion of the results from our frequentist and Bayesian analyses. Our choice of priors and the prior sensitivity of our conclusions for the Bayesian analyses are summarised in appendix B. In our frequentist analyses the priors reflect the sampling strategy of the parameter scans and we anticipate limited impact.

Frequentist results
First, we consider frequentist results for the solar ALP model for only Xe data, when adding the R parameter and finally when adding WD hints. In table 2, we show the χ 2 ≡ −2 log max L at every step and the log-likelihood ratio test statistic, Note that the effective ALP-nucleon coupling for SN1987A [107], written in terms of the ALP-neutron (gan) and ALP-proton (gap) couplings, differs from the effective ALP-nucleon coupling g eff aN considered by XENON1T. Despite these subtleties, we can still compare the effective ALP-nucleon couplings in order of magnitude. The χ 2 values associated with the best-fit points in our models for the Xe data, when adding the R parameter, and finally when adding the WD hints. The ∆χ 2 columns show the test statistic in eq. (4.1). For each model (solar or DM ALP) the test statistic is computed using the corresponding background (with or without 3 H) as the null hypothesis.

Model
Xe where max L s+b is the maximum likelihood in the ALP + background hypothesis and max L b is the maximum likelihood in the background hypothesis. In each case, this involves maximising the likelihood over several parameters. With just Xe data and without tritium, we see a ∆χ 2 15 preference for solar ALPs over the background model. With an unconstrained tritium component, this reduces to only ∆χ 2 5. When we include the R parameter, the incompatibility between XENON1T and the R parameter in the solar ALP model destroys the preference for solar ALPs and we find ∆χ 2 1.5. Finally, adding the WD hints restores some preference for solar ALPS, ∆χ 2 11. As we see in the recoil energy spectra in figure 2, in the latter case, the solar ALP model is barely distinguishable from the background model in XENON1T, as the ABC and Primakoff contributions must be suppressed to satisfy the R parameter constraint. The 57 Fe component, however, remains visible above the background. As discussed in section 3.5, even the visible 57 Fe component could be ruled out by SN1987A.
As mentioned in section 3.2 and section 3.3, fitting both the R parameter and WD cooling hints requires ALP couplings away from the XENON1T preferred region as depicted in figure 1. This can be readily seen in figure 3, where we show profile likelihoods for the solar ALP model with constraints from XENON1T, the R parameter and WD cooling hints. For comparison, we also show the 90% CL from the results of a fit including only Xe data (with the inverse Primakoff contribution [72]). We note that our best fit point lies outside of the Xe 90% CL in both the g ae -g aγ and g ae -g eff aN planes. 7 This tension is significantly driven by stellar cooling, which imply g ae 10 −12 and g aγ O(10 −10 ). For our best-fit point, the larger value of g eff aN , outside the 90% CL for Xe only, can be understood by the need to compensate the very small value of g aγ in order to reproduce the right signal around 14.4 keV.
Deviation from obs. value Deviation from obs. value We also show the best-fit ALP + B 0 model for the XENON1T and astrophysical data (dashed-dotted blue).
Right: A comparison of the best-fit predictions for the WD period decrease and R parameter for no ALP (red stars) and including ALPs (blue stars; with XENON1T and astrophysical data).

Bayesian results
For our Bayesian results, in table 3 we show the Bayes factors in favour of our solar ALP model when omitting a tritium component, including it in the ALP and background models, and when including it only in the background model. For example, the Bayes factor in the third row and third column (B = 0.25) is obtained by considering the ratio of probabilities i.e., for that entry, we consider the probability of the XENON1T, R parameter and WD data in the solar ALP model with the XENON1T B 0 background versus the probability of With only Xe data -and when omitting tritium backgrounds -we find a Bayes factor of about 3 in favour of the solar ALP scenario. This is "barely worth mentioning" on the Jeffreys' scale [109] and corresponds to a Z-score of about 0.6σ. With tritium backgrounds, the ALP model and backgrounds are barely distinguished, with a slight preference for the background only model. Lastly, the Xe data slightly favours the background-plus-tritium hypothesis over a signal without a tritium component.
The R parameter removes the slight preference for the solar ALP model. Indeed, with XENON1T and the R parameter, the solar ALP model is disfavoured by a factor of about 4. The inclusion of WD cooling hints slightly reverses the impact of the R parameter, but even then we find at most a tiny preference for the solar ALP model of about 1.3. Lastly, the partial Bayes factors for the Xe data given the astrophysical data were less than about one; meaning that given we knew the astrophysical data, the XENON1T excess in fact told us little about the solar ALP model. This happened because the astrophysical data forced the solar ALP couplings into regions that could not explain the XENON1T excess, making its predictions for the data observed by XENON1T no better than the background model, and in fact worse than a tritium component.
We emphasize that the Bayes factors that we obtain depend sensitively on our choice of priors. A detailed discussion of how our results would change for different priors is given in appendix B. We find that the maximum Bayes factor obtainable for any priors for the solar ALP couplings is roughly e ∆χ 2 /2 ≈ 1500, which corresponds to a Z-score of about 3.2σ. As an alternative approach for model comparison not directly dependent on the priors, we discuss the Deviance Information Criterion in appendix C. Table 3: Bayes factor in favour of solar ALP versus background, with no tritium background (first row), tritium contributions to the background in both models (second row) and tritium contributions only in the background model (third row). The first three columns show results considering Xe, adding the R parameter and finally adding WD hints. The final two columns show partial Bayes factor for the Xe data given the R parameter, and given the R parameter and WD hints.

Frequentist results
Turning to DM ALPs, our frequentist results for the DM ALP mass, electron coupling and DM fraction -when combining the Xe, R, and WD likelihoods -are shown in figure 4.
In figure 4a, we see that only ALP masses close to the best-fit pointm a = 2.7 keV are favoured. Figure 4b shows that smaller DM fractions permit greater electron couplings, with η 0.1 favoured, and the best-fit point atη = 0.030. While this has been appreciated before [19], here we show the confidence intervals compatible with the Xe likelihood, which might be of particular interest for model builders. Even at 90% CL, ALP DM is not the dominant DM component, where all our results include the uncertainties on the local DM density.
In figure 5, we show the DM ALP frequentist results and the various individual observables considered in this work. If the local DM density consists entirely of ALPs (η = 1) we can derive a bound g ae 10 −13 for any given ALP mass. By allowing the ALP DM abundance η to vary, greater couplings are allowed, and we can obtain a decent fit to the Xe data, while at the same time fitting the combined cooling hints and respecting the R parameter constraint (χ 2 = 43.5). Note that the combination of the four WD cooling hints prefers ALP masses smaller than about 3 keV, even though heavier ALPs can also contribute to WD cooling for larger values of g ae . To illustrate this, we include the profile likelihood for only the WD G117-B15A in figure 5 as an example.
In table 2, we show best-fit χ 2 and ∆χ 2 for the DM ALP model when adding the astrophysical data step-by-step. With only Xe data, the DM ALP model best-fit improves on the background model by ∆χ 2 17, which is slightly greater than in the solar ALP case. Even with a tritium component, the DM ALP model is preferred by ∆χ 2 8.5. For the solar ALP model, we saw severe tension between the R parameter and XENON1T. Here, however, we see that adding the R parameter in fact slightly increases the preference for DM ALPs to ∆χ 2 18. Indeed, the DM ALP model successfully reconciles XENON1T and the R parameter by tuning the ALP DM fraction. Lastly, adding WD cooling hints further increases the preference for DM ALPs, reaching ∆χ 2 23 without tritium, and ∆χ 2 15 with tritium. The success of the model can be seen in figure 2 by the similarity between the best-fit spectra for XENON1T only, and to XENON1T and astrophysical data. Rather than reduce the allowed signal, the astrophysical data in fact push the amplitude of the signal slightly higher.
To make a rough estimate of the significance of the combined signals in the DM ALP model from XENON1T, R likelihood, and WD cooling we use Wilks' theorem [110], which states that in certain cases the log-likelihood ratio test statistic in eq. (4.1) should follow a χ 2 distribution. We compute local significances assuming 2 degrees of freedom, corresponding to the number of independent parameters of the DM ALP model except for the DM ALP mass. When including a tritium component and considering all data simultaneously, the observed value ∆χ 2 = 14.9 corresponds to a local significance of about 3.3σ. With no contribution from tritium the observed value ∆χ 2 = 23.1 corresponds to a local significance of 4.3σ.
However, we warn that although our models are nested, important assumptions in Wilks' theorem are violated [111] -for example, the values we obtain are local significances that do not include a trial factor to account for the look-elsewhere effect for the DM ALP mass. Moreover, the background only model lies on the boundary of the ALP models, meaning that even our computation of the local p-values violate conditions in Wilks' theorem. In appendix D we provide a robust estimate of the global p-values using Monte Carlo simulations. The simulations largely confirm our estimates above, implying that our approximate local p-values are in fact accidentally closer to the global p-values (that account for the look-elsewhere effect) than one might naively have expected.

Bayesian results
Lastly, we present our Bayesian results for the DM ALP model. The posterior probability densities and 68% and 95% credible regions for the DM ALP parameters are shown in figure 6 for XENON1T, the R parameter and WD data. As expected from the profile likelihood results in figure 4, the DM ALP mass is strongly constrained to be around m a ∼ 2.7 keV. However, in contrast to the frequentist results, DM ALP masses are allowed throughout the explored ranges, 1 keV < m a < 30 keV at 95% credibility. Furthermore, the preference for a small DM fraction η < 0.1 is minimal in the Bayesian case (see figures 6b and 6c), compared to the frequentist results. The one dimensional marginalised posterior on η is in fact largest near η ∼ 0.1, although this is barely visible in either of the two-dimensional plots. This can be understood as the marginalisation over either m a or g ae hides the enhancement due to the XENON1T results or the WD cooling hints, respectively. The last figure 6d shows, as we expected from the frequentist results, an enhancement of the posterior probability for the largest allowed ALP-electron coupling g ae ∼ 10 −13 and DM ALP mass around m a ∼ 3 keV.
We show in table 4 the Bayes factors in favour of the DM ALP model versus the background model. Without tritium, the Bayes factors from the Xe data alone in fact slightly disfavours the DM ALP model. In other words, for our choices of prior, the DM ALP model does not predict the observed data better than the background model. The reward that the DM ALP model obtains for being able to in principle explain the excess for specific combinations of couplings does not outweigh the penalty incurred for making much broader, less specific predictions for the data, including predicting potential signals that are much bigger than the one observed. The combination of XENON1T, the R parameter and WD hints, however, favoured the DM ALP model by about 2.  The partial Bayes factors prefer the DM ALP model more than the Bayes factors, as once we take into account the astrophysical constraints, the remaining viable DM ALP couplings make better predictions for the excess observed at XENON1T than the background model. In particular, the extra freedom in the model allows it to reconcile the WD cooling anomaly and XENON1T, leading to a partial Bayes factor of about 3 in favour of DM ALPs over the background with no tritium for XENON1T given astrophysical data. Thus we again see that the DM ALP model is partially successful in explaining the XENON1T and WD anomalies, whilst avoiding R parameter constraints. Table 4: Bayes factor in favour of DM ALP versus background, with no tritium background (first row), tritium contributions to the ALP model and background-only model (second row) and tritium contributions only in the background model (third row). We show in the first three columns the results considering Xe, adding the R parameter and finally adding WD hints. The final two columns show partial Bayes factor for the Xe data given the R parameter, and given the R parameter and WD hints.

Discussion
We now comment on the differences in the preference for ALPs in our Bayesian and frequentist analyses. The approaches are mathematically different; there is no reason for them to agree and the fact that they differ does not imply that either one is wrong. Nevertheless, we see three reasons why the Bayesian approach yields so much lower preference for the DM ALP hypothesis than the frequentist approach. First, we did not correctly account for a look-elsewhere effect in our computation of the p-values for the DM ALP, which might reduce the significance.The Bayesian approach, on the other hand, automatically accounts for the fact that the ALP mass is unknown and hence the signal could have appeared anywhere in the search window.
Second, the Bayesian evidence includes an automatic Occam penalty [112]. We chose a very broad prior range for the coupling strength g ae , which in particular includes values much larger than those favoured by the Xe data. This is penalised by the Bayesian evidence; the fact that the signal is not actually larger than what is observed counts as evidence against the DM ALP model. Of course, XENON1T is not the first experiment to search for ALPs and hence it was already known that g ae cannot be much larger than the XENON1T preferred value. Including this previous information in the prior range for g ae would slightly increase the Bayes factor.
Finally, the Bayesian computations include only the observed data. The frequentist approach, on the other hand, requires that we instead consider data as extreme or more extreme than that observed; see e.g., ref. [113] for further discussion.

Conclusions
The recently observed anomalous signal at low energies in the XENON1T experiment, consisting of an excess of 53 events between 1 keV -7 keV with a significance of 3.5σ above the background only hypothesis, has generated considerable interest due to its possible interpretation as evidence for physics beyond the Standard Model. In this work we focus on two possible explanations: solar ALPs and a DM ALP model in which ALPs constitute a fraction of the local DM density. We used the GAMBIT software to perform global fits of the models to a combination of XENON1T data and astrophysical data 8 using both Bayesian and frequentist statistics to assess the ability of each model to explain the excess.
When the XENON1T data are considered on their own, we find that solar ALPs are favoured by about ∆χ 2 ≈ 15 in our frequentist analysis, and a Bayes factor of about 3 for our choices of prior. When including the lifetime of Horizontal Branch and Red Giant Branch stars, we find that the solar ALP model cannot explain the XENON1T data. Including, in addition, the anomalous cooling of white dwarfs, however, the combined data nevertheless favour solar ALPs over the background only hypothesis by ∆χ 2 ≈ 11 and by a Bayes factor of 1.3. The evidence in favour of the solar ALP is driven largely by the strength of the anomalous white dwarf data, and not by the XENON1T data.
Concerning DM ALPs that constitute a fraction of the local DM density we find that they give a better fit to XENON1T than the solar ALP, with ∆χ 2 ≈ 17. The freedom to lower the DM fraction furthermore allows the model to explain the XENON1T data with a larger axion-electron coupling, which also has the ability to explain the WD cooling hints, resulting in a ∆χ 2 ≈ 23 preference for the DM ALPs. The results of our Bayesian analysis, however, slightly disfavour DM ALPs by about 0.75 with only XENON1T data and slightly favour it but only by about 2 with XENON1T, the R parameter and WD data. For our choices of prior, the background model better predicts the XENON1T data, since, although specific DM ALP couplings could fit the data, the DM ALP model makes much broader, less specific predictions for the data, including predicting potential signals that are much bigger than the one observed. We stress, however, that different choices of prior for the ALP DM fraction and the DM ALP couplings could increase the evidence for the DM ALP model. Moreover we find that the partial Bayes factors in favour of the DM ALP model for the XENON1T data, given known astrophysical constraints and hints, can be as large as 3 even for conservative priors, which, however, still corresponds to an evidence "barely worth mentioning" [109].
The DM ALP model that is favoured by the combination of the XENON1T and WD cooling anomalies consists of a particle massm a = 2.7 keV, an axion-electron couplinĝ g ae = 2.5 × 10 −13 , and constitutes a fraction of the local DM of aboutη = 0.030 (η 0.2 at 95% CL; 1 DOF). Further evidence for or against this model could come from the anticipated XENON-nT data and other electron-recoil direct DM searches, from further study of WD cooling with improved modelling and observations of the period decrease. Another promising strategy is to search for DM ALPs through their inevitable decay into two photons in future x-ray observatories like ATHENA [114].
There is no straight-forward way for the best-fit DM ALP to be the QCD axion, which would require circumventing constraints on hot DM and DM stability. Hence the DM ALP model offers no explanation for the non-observation of the neutron electric dipole moment [115]. Such an ALP is invisible to axion DM haloscopes, such as ADMX [116], which are sensitive only to the ALP-photon coupling, and masses m 1 eV. It is also invisible to the QUAX haloscope [117], which probes the axion-electron coupling, but is also only sensitive to very low ALP masses. If the XENON1T DM ALP exists, however, it could be part of a larger ALP sector.
Our analysis is insensitive to the DM ALP temperature, and is consistent with either a thermal or non-thermal production channel in the early Universe. In the case of a thermal production channel, evidence for the DM ALP could appear in future precision measurements of the matter power spectrum, such as Euclid [118].
Lastly, we consider a possible tritium component in the background. In both the frequentist and Bayesian analyses, a tritium component reduces the preference for the ALP models and is preferred over the background only model by ∆χ 2 ≈ 10 and a Bayes factor of about 5. In the frequentist analysis, however, the combination of XENON1T and the astrophysical data still favours the ALP models by as much as ∆χ 2 ≈ 15, owing to the WD hints. We emphasise that we have allowed the level of tritium to vary by several orders of magnitude. It will be interesting to see what the additional cross checks planned for XENON-nT will reveal about the potential presence of tritium in the detector.
In summary, we have shown that the preference for either a solar or DM ALP explanation of the XENON1T excess strongly depends on the inclusion of astrophysical axion constraints. These generically lower the preference for the solar ALP explanation over the backgroundonly model, and raise it for the DM ALP. Further interesting conclusions result from employing complementary statistical approaches (i.e. both Bayesian and frequentist), which allow one to determine whether it is justified to increase the number of parameters of a theory in order to bring different measurements into better agreement. The combination of growing datasets and advanced statistical methods will inevitably shed more light on the XENON1T anomaly in the near future. This work used the Scientific Computing Cluster at GWDG, the joint data centre of Max Planck Society for the Advancement of Science (MPG) and the University of Göttingen, as well as computing resources of the North-German Supercomputing Alliance (HLRN).
We also acknowledge PRACE for awarding us access to Marconi at CINECA, Italy, and Joliot-Curie at CEA, France. We acknowledge the use of Diver [119], MultiNest [120,121] and T-Walk [119] for the statistical analysis and the use of pippi [122] and the Python packages matplotlib [123], scipy [124], and numpy [125].
We thank Adrian Thompson for discussions about ref. [72], and the GAMBIT community for developing and maintaining the GAMBIT global fitting software as well as for valuable discussions and meetings over the last few years, without which this work would not be possible.

A Bayes factors
To understand the impact of the Xe data on the models, we compute Bayes factors and partial Bayes factors [109]. For a review of Bayes factors, see ref. [126]. The Bayes factor, updates our relative belief in two hypotheses (here M 1 and M 0 ) in light of experimental data, D. The factors in the Bayes factor are Bayesian evidences, which may be found by marginalising the likelihood over a prior for the model's parameters θ i.e.
For a pedagogical discussion of priors, see refs. [127,128]. The dependence of the above integrals on the choice of prior is a major problem in such analyses. Even among adherents of Bayesian statistics, Bayes factors remain controversial. They are considered by e.g., ref. [129] to be the primary and indisputable tool for model testing to the extent that science should abandon p-values and adopt Bayes factors. On the other hand, ref. [130] emphasizes their dependence on the choice of prior and that the choice may be unjustifiable and untestable. In the Bayesian framework, there are alternative tests that use predictive checks and depend only the posterior distribution of the model's parameters, ameliorating prior dependence. We present one such check, the Deviance Information Criterion (DIC, see appendix C). We also consider the impact of varying the priors within classes of alternatives. For our main results, we compute all evidence integrals with MultiNest v3.11 [120,121] inside ScannerBit [119] in GAMBIT v1.4.5. MultiNest is known to efficiently calculate the evidence for up to about 30-dimensional problems [131], whereas our models had at most 7 parameters. We use 5000 live points (nlive), a stopping tolerance (tol) of 0.001 and sampling efficiency (efr) of 0.05. The estimated fractional statistical errors in the evidence estimates were always less than 5%. We furthermore used the state-of-the-art cross-checks recently proposed in ref. [132] and found no evidence of a systematic error in the evidence estimates. To investigate prior dependence in figure 7, the Bayes factors were recomputed by reweighting the evidence estimates, after cross-checks to ensure that the procedure produced reliable results.
We furthermore consider a partial Bayes factor for data D given data D , which is itself a ratio of Bayes factors. The partial Bayes factor tells us how to update our relative belief in two models, supposing that we already knew about the data D . For independent datasets, the factors in the partial Bayes factor may in fact be written as Unlike in eq. (A.2) where we averaged over the prior, here we average over a posterior distribution, weakening the dependence on the prior. Lastly, to compare the Bayesian results with the significances reported by XENON1T (e.g., 3.5σ), we compute Z-scores corresponding to the posterior probability of the background models assuming equal prior odds, where Φ is the standard normal CDF. It is known that they are often less than the frequentist significances [133].

B Choice of priors and prior sensitivity
The priors that we choose are summarised in table 5. For the ALP-photon coupling, we choose logarithmic priors in the range 10 −20 GeV −1 -10 −3 GeV −1 . The lower bound corresponds roughly to the ALP-photon coupling expected from an anomalous global symmetry broken at the GUT scale. Even smaller ALP-photon couplings can be realised if the global symmetry is anomaly-free [80] or if there is a cancellation between different contributions, but such small couplings would be indistinguishable from the background-only hypothesis even for far-future experiments. The upper bound for this range stems from the very conservative assumption that the ALP-photon coupling should be smaller than the pion-photon coupling to satisfy constraints on e + e − → γ + invisible from LEP [134][135][136], which are valid for all sub-MeV ALP masses. Of course, astrophysical constraints place much stronger bounds on the ALP-photon coupling. For instance, for g aγ = 10 −8 GeV −1 the axion luminosity of the Sun would exceed its photon luminosity by an order of magnitude, which would be in clear contradiction with helioseismology. Rather than encoding astrophysical constraints in the prior ranges, we implement them via the likelihoods discussed above and therefore allow much broader prior ranges. The ALP-nucleon coupling can be written as g eff aN = m N C N /Λ, where m N is the nucleon mass, C N is a coupling constant and Λ is the unknown scale where the effective interaction is generated. Assuming Λ to lie between the electroweak scale and the Planck scale, and C N to be of order unity, we choose a logarithmic prior in the range 10 −20 -10 −3 to encode our ignorance of the scale of new physics. Naively, the ALP-electron coupling Table 5: Choices of prior for the solar and DM ALP models and nuisance parameters. We discuss these choices at the beginning of section 4. XENON1T nuisance parameters

Solar
DM ALP nuisance parameters ρ 0 Log-normal, log 10 ρ 0 1 GeV/ cm 3 = log 10 (0.4) ± 0.138 [68] η Uniform, (0, 1) should be smaller than the ALP-nucleon coupling by a factor m e /m N . However, there are many examples of ALP models where one of the two couplings is strongly suppressed, for example, if the ALP couples dominantly to gluons [137]. We therefore treat the two couplings as completely independent and choose the same prior range for g ae as for g eff aN . Again, we include astrophysical constraints in the likelihoods rather than in the prior ranges. Since our Bayesian results will depend on these choices, we later investigate the impact of varying the width of the logarithmic ranges for the couplings.
We picked a uniform prior on the ALP mass in the DM ALP scenario from 1 keV -30 keV, which spans the XENON1T low-energy energy spectrum. Enlarging this range could only weaken the evidence for a DM ALP signal.
The rationale behind the 3 H abundance prior is explained in section 2.3. The prior for the ALP DM fraction, η, is chosen so that ALPs are typically a sizeable fraction of the local DM, even if they could constitute none of it. Our priors for the XENON1T nuisance parameters, α b and , are based on the constraints given by XENON1T, while our prior for the local density of DM [68], ρ 0 , comes from astrophysical estimates. In our frequentist analyses, we instead include these constraints via equivalent likelihood functions. As discussed in section 2.3, we allow for the tritium component to be unconstrained in our frequentist analyses.
To check the dependence of the Bayes factors that we obtain for solar ALPs in section 4.1 The alternative y-axis on the right-hand side of (b) shows the Bayes factor converted into a Z-score by eq. (A.5).
on these choices, we varied the priors for the solar ALP couplings and the uncertainty on the tritium component. We re-computed the Bayes factor for the XENON1T data for logarithmic priors for the solar ALP couplings between 10 c−0.5w and 10 c+0.5w for different choices of width, w and centre, c. 9 The results in figure 7a show that the Bayes factor could favour the solar ALP scenario by as much as about 1000 for a narrow prior centred about the best-fitting electron coupling, 3 × 10 −12 , but that the preference is usually less than about 10. The Bayes factor falls rapidly once the best-fitting couplings lie outside the prior range, tending towards one if the couplings are too small, as the solar ALP model resembles the background model (lower left yellow triangular region), and zero if they are too large (upper left white triangular region).
We find that a tritium component in the background is favoured over the background only by about 5. To check the dependence of this result on the uncertainty in the tritium component, we re-calculated it for alternative choices of prior. We used log-normal priors, log 10 α t = −27 ± σ, for σ in the range 0 to 5. The results in figure 7b show that the Bayes factor depends weakly on the uncertainty on the tritium component, ranging from about one when σ ≈ 0, to about five for σ ≈ 3. The latter is preferred as it accommodates the best-fit tritium level, 5 × 10 −25 mol/mol, without giving much support to much higher levels of tritium that are strongly disfavoured by the data.
Lastly, we checked the dependence of the Bayes factors and subsequent conclusions in section 4.2 on our choices of priors. We found that had we for example reduced the prior range for the DM ALP coupling g ae from 10 −20 -10 −3 to 10 −20 -10 −11 , all Bayes factors would favour the DM ALP model by approximately a factor of two more. We anticipate, however, that the partial Bayes factors, which take the astrophysical constraints as background knowledge, would not be significantly affected, as the astrophysical constraints already strongly disfavour g ae 10 −11 . Likewise, we find that in order to reconcile the WD hints and XENON1T requires an ALP DM fraction η 0.1, which makes our results sensitive to our choice of prior for η. When using a logarithmic prior from 0.01 -1 instead of a uniform prior from 0 -1, we find that the preference for the DM ALP model in table 4 increases from B = 1.8 to B = 2.3.

C Deviance Information Criterion
Given a parameter point θ we define the deviance for a model by where L(θ) is the likelihood function. The deviance measures the model's ability to predict the observed data. We can compare several different models by computing the the Deviance Information Criterion (DIC) [138] for each model, where · indicates the posterior mean: with D denoting the available data and M the model under consideration. This is an estimate of the deviance that would be obtained with new data given the posterior mean estimate of the parameters, θ [130]. The term corrects bias from over-fitting and is motivated by an analytic result for Gaussian posteriors. It may be interpreted as the number of model parameters constrained by the data, such that the DIC is a Bayesian version of the Akaike Information Criterion (see ref. [139] for further discussion). The model with the smallest DIC is expected to best predict future data and is therefore preferred. In contrast to the Bayesian evidence, the DIC depends on the posterior and not directly on the prior. Whilst this could reduce the dependence on the choice of prior that is inherent in Bayesian model comparison, we find that this is not the case in the present context, because our posteriors are highly non-Gaussian. More specifically, the posteriors for the ALP couplings and tritium fraction exhibit fat tails which stem from the fact that vanishing couplings cannot be excluded. As a result, the posterior mean estimate of the parameters θ can be very different from the most probable value and hence D( θ ) can be very different from D min . Moreover, the fat tails lead to an over-estimation in the number of constrained parameters and therefore an excessively large penalty for over-fitting (see e.g., the Cauchy Table 6: Difference in Deviance Information Criterion (DIC) between the solar and DM ALP models and the background only model for the Xe data, when adding the R parameter, and finally when adding the WD hints. Positive (negative) differences indicate that the ALP (background-only) model was favoured. To emphasize the importance of the overfitting term p D we write our results as ∆DIC = ∆D + ∆(2p D ), where ∆D is the difference between the two models in the mean deviance and ∆(2p D ) is the difference in the overfitting penalty.

Xe
(Xe + R) (Xe + R + WD) example in ref. [139]). We therefore find that the DIC generally disfavours models with more parameters, even if they substantially improve the goodness of fit. 10 We present the DIC results for the solar ALP and DM ALP models in table 6. In each case we show the difference in DIC between the background-only and ALP model such that positive differences indicate that the ALP model is preferred. For the reason explained above our findings are however quite counter-intuitive. For example, adding a tritium component typically reduces the preference for the background model or even leads to a preference for the ALP model. We will therefore refrain from attempting a more detailed interpretation of the numbers in table 6.

Solar ALP
Nevertheless, the DIC presented here complements our frequentist and Bayesian results by suggesting that we should not expect ALP models to make better predictions for new data than the background only model. However, given the highly non-Gaussian posteriors in our analysis, our results are very sensitive to the priors and to the specific form of the over-fitting term in the definition of the DIC. We note that alternative forms of the over-fitting term exist; we chose the form in eq. (C.4) because it is always positive, whereas we found that alternatives could become negative for our data and models.

D Monte Carlo simulations for frequentist results
We largely avoid quoting p-values in section 4 and, where we do, we caution that the assumptions of the asymptotic formulae derived from Wilks' theorem are violated in our study. For example, the ALP signals in the XENON1T experiment must be positive, and the DM ALP mass parameter does not exist (is unidentifiable) for g ae = 0.
There are several ways to overcome this issue, such as Monte Carlo (MC) simulations and semi-analytic approaches [140]. Monte Carlo simulations require the fewest assumptions to compute p-values and the coverage of our confidence regions (see ref. [111] for a more general discussion). In particular, MC simulations can account for the look-elsewhere effect, which affects the DM ALP mass parameter. Such simulations are, however, computationally expensive if the ensuing p-values are small.
To estimate the required computational effort, we can use Wilks' theorem, according to which the statistical significances under consideration are between 3σ and 4.5σ (similar for the WD cooling hints). Since p-values encode how extreme an observation is compared to the null hypothesis, we are interested in events that occur with odds between 1 : 400 and 1 : 150, 000. Reliably calculating the probability of such rare events then requires significantly more simulations than the inverse of these odds.
The first step for performing MC simulations is to generate pseudo-data from the background-only hypothesis. Here, we make two simplifying assumptions to render the problem more tractable. First, when considering the DM ALP, we fix the local DM density nuisance parameter ρ 0 to its maximum-likelihood value, which only has a mild effect when η 1. Second, we always set both α b = 1 and = 1 as these nuisance parameters only have a negligible influence on the result.
We do, however, take into account the nuisance parameters for the systematic uncertainties in the R and WD measurements by averaging over them, which corresponds to computing prior predictive p-values and coverages that average over the unknown nuisance parameters (see ref. [141] for further discussion). Since we consider the background-only model, we only need to vary the nuisance parameters and uncertainties associated with the measurements (R parameter and WD pulsation period increase), and not those associated with the theoretical predictions in the ALP models.
The pseudo-data then consists of: counts in each bin for XENON1T (using a Poisson distribution), a measurement of the R parameter (Gaussian with the cut R > 0), and the WDs (Gaussian; not imposing a positivity cut since a period decrease is not inherently unphysical, even though it would be a surprising observation).
To obtain a p-value from the simulations, one needs to compute the ∆χ 2 test-statistic in eq. (4.1) using the pseudo-data. Then, the p-value can be estimated from the fraction of the simulated ∆χ 2 values that are more extreme than the observed ∆χ 2 (quoted in the main text in table 2). 11 We show the relationship between p-value and ∆χ 2 found from simulations in figure 8 for the combinations of both ALP models (left and right panels) and 11 To validate our implementation we have performed a goodness-of-fit test using the pseudo-data for XENON1T, finding that the observed data corresponds to a 2σ fluctuation (for the energy range of 1 keV -30 keV), consistent with the expectation for a χ 2 distribution with 29 degrees of freedom. We have also performed MC simulations for a hypothesis test involving the background only model and the solar ALP model (no 3 H background) for the Xe likelihood alone, where we find a significance of 3.6σ. This agrees well with the 3.4σ stated by the XENON1T Collaboration, who perform an unbinned analysis and take into account additional nuisance parameters.  H hypotheses (red and blue colours), using the Xe + R + WD likelihoods. For reference, we show results from chi-squared distributions with k degrees of freedom (χ 2 k ; grey lines) and indicate the actually observed ∆χ 2 values as vertical dotted lines. We generated and optimised about one million pseudo-data sets for each DM ALP case, and two million pseudo-data sets for each solar ALP case.
We can see that the simulated distribution of the test statistics seems to -at least in part -track some of the reference distributions shown in figure 8. For example, the solar ALP hypothesis (without 3 H) closely follows a χ 2 1 distribution while the DM ALP is similar to a χ 2 3 distribution. Introducing an additional 3 H component for both models increases the p-value for a given value of the test statistic, even though the extra parameter appears not to contribute a full additional degree of freedom.
The fact that the p-value falls more rapidly with increasing value of the test statistic than expected from a naive count of the number of model parameters (three for the solar ALP model) is not unexpected since e.g. Chernoff's theorem [142] predicts a similar deviation for the one-parameter case, where the true value of the parameter of interest lies on the boundary of the parameter space. Similar relations for multi-parameter cases also exist, as e.g. shown in ref. [143]. An additional issue is that g eff aN actually becomes unidentifiable on the background, as a non-zero value of this parameter would not have any effect on the signal in eq. (2.1). Again, this is in violation of the assumptions behind Wilks' theorem.
The situation for the DM ALP scenario is even more involved. Here, we also have three parameters, m a , g ae , and η (plus α t for the 3 H case). For the background hypothesis of g ae = 0 (where these parameters are on the boundary of the parameter space), the ALP mass m a and efficiency η are not identifiable as can be seen from eq. (2.2). For non-zero values of g ae on the other hand, we have to consider the look-elsewhere effect for the parameter m a . 12 Further note that, for Xe data alone, the signal prediction is degenerate in the combination of ηg 2 ae , which could consequently be treated as a single free parameter. Only the inclusion of other data such as the R parameter and WD data breaks this degeneracy.
While the boundaries of the parameter space and the degeneracies are expected to decrease the p-value compared to the naïve expectation, the look-elsewhere effect should increase the p-value. The fact that the p-value curves in the right panel of figure 8 approximately follow a χ 2 3 is therefore to some degree coincidental. Nevertheless, these findings imply that there is no large difference between the local significances estimated in section 4 based on Wilks' theorem and the global significances obtained from MC simulations.
Regarding the ∆χ 2 values in table 2, the calibrated p-values for Xe + R + WD likelihoods from fig. 8 imply a significance of 3.3σ for the solar ALP hypothesis (3.1σ when a 3 H background is included). Recall that the small difference between the p-values is because the outcome of this analysis mostly determined by the WD likelihoods. For the DM ALP hypothesis, we obtain a significance of 4.1σ (2.9σ when a 3 H background is included).
We can also use MC simulations to investigate the accuracy of the confidence regions shown in fig. 4. To do so, we consider the profile likelihood ratio test statistic for different values of m a and generate pseudo-data for the background only hypothesis. Since we are at most interested in the 95% CL region, we only need to accurately estimate p-values above 5% and therefore only need to generate a few thousand pseudo-data sets. For every set of pseudo-data, we first find the values of g ae and η that maximise the likelihood for each value of m a and then find the value of m a that maximises the likelihood globally. We then compare the distribution of profile likelihood ratios for each value of m a to the actually observed profile likelihood ratio as obtained from our scans of the full parameter space. The p-value is then defined as the fraction of simulations that yield a value of the test statistic larger than the one observed. The confidence interval is then given by all values of m a for which p > 1 − CL. Figure 9 shows the resulting confidence interval compared to the one from fig. 4. We find that the confidence intervals (obtained from the red histogram) are slightly wider than their counterparts from Wilks' theorem (dashed black line). This is not surprising given that Wilks' theorem does not take into account the look-elsewhere effect, which is expected to increase p-values, because random fluctuations in all parts of the Xe spectrum can be fitted.
In conclusion, we find various differences between the p-values and confidence regions obtained from MC simulations and those derived from asymptotic results such as Wilks' theorem. This result is expected, given the various conditions for Wilks' theorem -or extensions thereof -to hold. Indeed, it is well-known that the MC simulations may both under-or over-predict asymptotic results even for the simple choice of a likelihood ratio as test statistic (see e.g. ref. [111] for a more detailed discussion). In our concrete case, we deal with parameters on the boundary of the parameters space, unidentifiable parameters, and the look-elsewhere effect. In spite of all these complications, we find no qualitative differences between results obtained from asymptotic expressions and MC simulations and therefore conclude that our results are robust. Indeed, it appears that some of the additional effects included in the MC simulations partially cancel, leading to a surprisingly good agreement between our simple estimates and the results from more elaborate simulations.