1 Introduction

The nature of dark matter is one of the most compelling mysteries in both cosmology and particle physics. One of the foremost possibilities is that dark matter is a new fundamental particle that is yet to be discovered. There is currently an extensive experimental effort to try to detect the particle nature of dark matter through its interactions with Standard Model particles [1, 2]. Among the leading dark matter candidates are Weakly Interacting Massive Particles (WIMPs), a generic class of particles that includes, for example, the supersymmetric neutralino. These particles interact gravitationally and through the weak force, and their expected masses range from \(\mathcal {O}(1)\) GeV to \(\mathcal {O}(10)\) TeV. More than thirty years ago, Refs. [3, 4] proposed a mechanism for detecting weakly interacting particles, including WIMPs, via coherent elastic scattering with nuclei. Soon after, the dark matter detection rates in the context of a Galactic Halo of WIMPs were computed for the first time, and it was proposed that they could be differentiated from background by looking for an annual modulation of the signal [5].

The DAMA collaboration has constructed one of these “direct detection” experiments to search for the annual modulation of nuclear recoils due to dark matter scattering within NaI scintillation detectors. The original detectors were deployed in 1995 and consisted of 100 kg of NaI [6]. The experiment went through an upgrade in 2002 that included increasing the detector mass to 250 kg and improving the photo-multiplier tubes that detect the scintillation light from the nuclear recoils. Through the seven years that the original DAMA/NaI experiment ran, the total exposure was \(1.08\times 10^5\) kg-days [7, 8]. The upgraded DAMA/LIBRA experiment collected a larger exposure of \(3.80\times 10^5\) kg-days over the seven years it took data [9,10,11]. The DAMA collaboration has identified at a 9.3\(\sigma \) confidence level an annual modulation in their event rate that is consistent with a dark matter signal[12,13,14,15,16,17,18,19]. Thus far, no plausible alternative explanation for the modulation signal observed in the DAMA detectors has been universally accepted [20]. Meanwhile, there are three upcoming NaI experiments that will test the DAMA modulation results: SABRE [21], which will consist of two identical experiments located in the northern and southern hemispheres; ANAIS-112 [22, 23], which, as of April 2017, is commissioning at the Canfranc Underground Laboratory; and COSINE-100 [24], a joint effort between the DM-Ice [25] and KIMS [26] collaborations, which has been running at Yangyang Underground Laboratory since September 2016. Furthermore, DAMA/LIBRA has been running in an upgraded phase 2 configuration since January 2011 [27], and may also shed light on past DAMA modulation results [19].

In this work we examine the amplitude of the modulation signal observed by the DAMA collaboration, focusing on the possibility of any time dependence of the amplitude over the 14 years of operation of the DAMA/NaI and DAMA/LIBRA experiments. We find that descriptions of the modulation amplitude as decreasing with time are preferred over a constant modulation amplitude at anywhere between \(1\sigma \) and \(3\sigma \), depending on the phenomenological model for the time dependence and the signal energy range considered.

In Sect. 2 we remind the reader of the expected form for the modulation signal. In Sect. 3 we present our analysis of the modulation amplitudes and their potential time dependence. We discuss the results and draw conclusions in Sect. 4.

2 Annual modulation

If dark matter is composed of a new, as-yet-unknown particle with weak interactions with ordinary matter, then dark matter particles in the galactic halo will pass through the Earth and occasionally scatter on nuclei, causing a nucleus to recoil. Dark matter direct detection experiments aim to observe these rare, low-energy nuclear recoils induced by collisions with a dark matter particle [4]. The diffuse dark matter halo has little bulk rotation so, from the perspective of the Earth as it orbits about the center of the galaxy (along with the rest of the galactic disk) and through the halo, there is a wind of dark matter approaching from the direction of the disk rotation. The form of the expected signal (measured in counts per mass per time per energy) in a direct detection experiment is often written as

$$\begin{aligned} \frac{dR}{dE}(E,t) = S_0(E) + S_m(E) \cos [\omega (t-t_0)] + \ldots , \end{aligned}$$
(1)

where \(\omega \equiv 2\pi \)/year, \(S_0\) is the time-averaged differential recoil rate, \(S_m\) is the annual modulation amplitude, and higher-order terms in the expansion are suppressed. The phase of the expansion \(t_0\) is approximately the time of year at which the Earth is moving fastest into the dark matter wind, around the beginning of June for the Standard Halo Model (SHM), a common first-order approximation to the dark matter halo [5, 28]. For a recent review of status of the search for the annual modulation signal from dark matter including alternatives to the SHM, see Ref. [29].

Table 1 Best-fit modulation amplitudes for the DAMA/NaI, DAMA/LIBRA, and combined exposures. Energy-interval-averaged efficiencies are calculated from the overall efficiencies in Fig. 17 of Ref. [6] and Fig. 26 in Ref. [30]. Modulation amplitudes for the 2–4, 2–5, and 2–6 keVee energy ranges are fits from the DAMA collaboration, taken from Refs. [8, 11, 30]. (\(\dagger \)) Modulation amplitudes for the 4–6, 4–5, and 5–6 keVee energy ranges are estimates derived from the existing analysis intervals, as described in the text (caveats apply)

3 Analysis

3.1 DAMA/NaI vs. DAMA/LIBRA

In this subsection, we gather the best fit modulation amplitudes for the DAMA/NaI and DAMA/LIBRA exposures, as well as the combined best fit amplitudes for the entire 14 years of operation for which the modulation amplitudes have been published. The DAMA collaboration provides the best fit modulation amplitudes for the 2–4, 2–5, and 2–6 keVee energy ranges. However, we would like to explore the signal in more detail, specifically the higher end of the energy range.

An estimation for the modulation amplitudes in the 4–6, 4–5, and 5–6 keVee energy ranges, which are not provided in the DAMA collaboration publications, can be obtained from the known modulation amplitudes available from the collaboration using the following procedure. For each energy interval, there is a modulation amplitude (\(S_m\)), as well as an average efficiency for the detector to detect a recoil (\(\varepsilon \)). If this interval is split into two ranges, one can write

$$\begin{aligned} \varepsilon S_m=f_1\varepsilon _1 S_{m_1}+f_2\varepsilon _2 S_{m_2}, \end{aligned}$$
(2)

with the subscripts 1 and 2 referring to the two energy ranges, and \(f_{1,2}\) representing the fraction of the energy interval that each of the two ranges covers. For example, \(S_m\) and \(S_{m_1}\) could be the published 2–5 and 2–4 keVee amplitudes, and \(S_{m_2}\) could be estimated using Eq. 2. This relationship comes from the fact that the DAMA experiment is fundamentally a counting experiment with the number of events in each range proportional to the total exposure. Similarly, the uncertainty (\(\sigma \)) can be expressed as

$$\begin{aligned} (\varepsilon \sigma )^2=(f_1\varepsilon _1 \sigma _1)^2+(f_2\varepsilon _2 \sigma _2)^2, \end{aligned}$$
(3)

with \(\sigma _{1,2}\) representing the uncertainty in each of the relevant energy ranges. This procedure assumes the data for the different energy ranges modulate with the same period and phase but statistically independent, i.e. the 2–5 keVee fit is the appropriately weighted average of fits to the 2–4 and 4–5 keVee ranges. In the case where the phase and period are fixed, this assumption is valid. When the phase and period are allowed to vary, however, the assumption is not strictly correct. In looking at the fits provided by the collaboration in Tables 3 and 4 of Ref. [11], we see that the largest difference in modulation amplitude between the fixed versus free fits is at the \(\sim 0.5\sigma \) level. We thus conclude that the assumption that the energy bins modulate with the same period and phase is approximately correct to quite good accuracy. This caveat should be kept in mind, however, when examining our results for the derived data in the higher energy ranges.

Table 1 presents the results that have been released to date by the DAMA collaboration along with our estimated data for the higher energy ranges. The best-fit modulation amplitudes are listed under the assumption of a fixed period and phase, as well as allowing for a free period and phase. The modulation amplitudes displayed for the 2–4, 2–5, and 2–6 keVee energy ranges are the fits performed by the collaboration as found in Refs. [8, 11, 30], while the modulation amplitudes for the 4–6, 4–5, and 5–6 keVee energy ranges, as well as the efficiencies, are calculated as explained above, and we use the data in Refs. [6, 30] to calculate the average efficiencies.

In comparing the amplitudes for DAMA/NaI to DAMA/LIBRA as shown in Table 1, we find that the amplitudes decrease in every energy range. This points towards a possible inconsistency in the results between the two incarnations of the experiment. To test the hypothesis of a time-varying modulation amplitude, we begin by examining the compatibility of the DAMA/NaI and DAMA/LIBRA fit results of Table 1, under the assumption of a common modulation amplitude. In Table 2, we present the minimum \(\chi ^2\) and corresponding p-value for a \(\chi ^2\) goodness-of-fit, as well as the best-fit common amplitude. We find the 2–6 keVee energy range is discrepant at the \(\sim 3\sigma \) level for both the fixed and free period and phase. Our estimated data indicates that the majority of the cause of the poor fit is due to events with energies \(\gtrsim 4\) keVee, as the 4–6 keVee interval is similarly discrepant. Note that the best fit modulation amplitudes in Table 2 match very well the values presented by the collaboration for the combined fits in the third column of Table 1, as expected.

Table 2 Compatibility of the DAMA/NaI and DAMA/LIBRA fit results of Table 1, under the assumption of a common modulation amplitude. The minimum \(\chi ^2\) and corresponding p-value for a \(\chi ^2\) goodness-of-fit are shown, as well as the best-fit common amplitude. (\(\dagger \)) Data for these energy intervals are derived as discussed in the text

We also collect the fits for the per-cycle modulation amplitudes as performed by the DAMA collaboration for the free period and phase in Table 3. Unfortunately the fits to the data under the assumption of a fixed period and phase have not been released to the public, and we are thus unable to analyze the data in that case. The mean cycle times and amplitudes are taken from Figure 3 of Ref. [11] and exposures are given in Refs. [8, 11]. The mean time for each cycle is relative to January 1, 1995, the first year of DAMA/NaI’s operation.

Table 3 The per-cycle modulation amplitudes as determined by the DAMA collaboration, allowing the period and phase to be freely fit in addition to the amplitude. Mean cycle times and amplitudes are taken from Figure 3 of Ref. [11] and exposures are taken from Refs. [8, 11]. The mean time for each cycle is given relative to January 1, 1995, the first year of DAMA/NaI’s operation

The DAMA Collaboration has employed “run tests” as a check for consistency of the data between the different years of the experimental data. There are many different possible run tests that can be performed on a given set of data to check for consistency. In this context, a run is a group of consecutive identical elements in a two-valued random sequence constructed from the data. A run test checks for runs in the data, in terms of a two-valued characteristic one can assign to each data point. The collaboration has performed run tests to determine whether all data are consistent with the best fit value for the amplitude, and found lower tail probabilities of 41, 29, and 23% for the 2–4, 2–5, and 2–6 keVee energy ranges, respectively. We have repeated the DAMA collaboration run tests and reproduced their results (see Table 4; Above and Below Best Fit). Here, we are interested in the consistency between the two iterations of the DAMA experiment, NaI and LIBRA, and we perform several other tests of interest as described below.

For the DAMA data, there are a total of 7 data points for each of the two iterations of the experiment (for each energy bin). To check whether the NaI and LIBRA data are consistent with each other, we rank the full 14 measurements from low to high, keeping track of which experiment made the given measurement. The number of runs is then one plus the number of times in this list the subsequent element changes from one version of the experiment to the other. As an example, when the amplitude values for the 2–6 keVee energy range are sorted in this way, the list would be (LIBRA, LIBRA, LIBRA, LIBRA, LIBRA, NaI, LIBRA, LIBRA, NaI, NaI, NaI, NaI, NaI, NaI), giving a total of four runs in this list. In this version of the run test, the null hypothesis would be that the two distributions are equal, i.e. that the DAMA/NaI and DAMA/LIBRA experiments are measuring the same amplitude in a given energy bin. The alternative hypothesis is that the two distributions are not equal.

Under the null hypothesis, i.e. that the two populations of data are drawn from the same distribution, the probability of obtaining a certain number of runs can be calculated using

$$\begin{aligned} P(R=2k)=2\frac{\left( \begin{array}{c} n_1-1\\ k-1 \end{array}\right) \left( \begin{array}{c} n_2-1\\ k-1 \end{array}\right) }{\left( \begin{array}{c} n_1+n_2\\ n_1 \end{array}\right) }, \end{aligned}$$
(4)

where R is the number of observed runs, \(n_1\) is the number of values from one population or experiment, and \(n_2\) is the number of values from the other population or experiment, and k is an integer that gives the appropriate number of runs. Similarly, if the number of runs is odd, the probability of obtaining that number of runs can be written as

$$\begin{aligned} P(R=2k+1)=\frac{\left( \begin{array}{c} n_1-1\\ k \end{array}\right) \left( \begin{array}{c} n_2-1\\ k-1 \end{array}\right) +\left( \begin{array}{c} n_2-1\\ k \end{array}\right) \left( \begin{array}{c} n_1-1\\ k-1 \end{array}\right) }{\left( \begin{array}{c} n_1+n_2\\ n_1 \end{array}\right) }. \end{aligned}$$
(5)

The lower tail probability can then be computed by summing the probability of two runs (there must be at least two runs) up to the actual number of runs observed. To be concrete, there are four runs in the case of the 2–6 keVee energy bin, so the probability would then be summed for obtaining, two, three, and four runs. The results for this run test applied to the data in Table 3 are collected in Table 4 under the heading of Ranked by Experiment. Of all of the possible run tests, this one is the most relevant to the question at hand. This test directly examines if the two iterations of the experiment are consistent with one another. We find that the null hypothesis for the 2–6 keVee energy range would be rejected at the \(2\sigma \) level, while the 2–4 and 2-5 keVee energy ranges have p-values of 30 and 21%, respectively.

Other possible run tests focus on fluctuations above and below a reference point, i.e. the median, mean, best fit point, etc. In this case, the measurements are put in an array in the order in which they were measured. Then a new list is formed where at each element, a ‘+’ sign or ‘−’ is placed depending on if the measurement is above or below the reference point. The number of runs are then counted in this list and the probabilities calculated exactly as in the previous case with \(n_1\) and \(n_2\) now representing the number of points above and below the reference value. The null hypothesis for this run test is that the data points are randomly fluctuating about the reference value. A run test of this type with reference to the best fit point is employed by the DAMA collaboration, and we reproduce their p-values. The results for these run tests are presented in Table 4. These tests all point towards consistency with the assumption of randomly fluctuating about the reference value. Although the Ranked By Experiment test is the most relevant to our discussion, the run tests in aggregate produce inconclusive results.

Table 4 Results for the run test as presented by the collaboration and for the tests we have performed as described in the text. The corresponding p-value is for a hypothesis test where the null hypothesis is taken to be the case where the two iterations of the experiment are drawing measurements from the same distribution
Fig. 1
figure 1

The deviations of the modulation amplitude in each annual cycle for a given energy range compared to the average in that range. \(S_{m,k}\) and \(\sigma _k\) are the amplitude and uncertainty in the kth annual cycle, respectively, and \(\left\langle S_m\right\rangle \) is the average amplitude in the particular energy range for the entire 14 year period

Due to the inconclusive results of the run tests, we also utilized the Kolmogorov–Smirnov (KS) test for the two data sets. The KS test is another non-parametric test in which the null hypothesis is that two data sets are drawn from the same distribution. As can be seen in Table 4, we find that the null hypothesis has p-values of 21 and 5% in the 2–4 and 2–5 keVee energy ranges, respectively, and that it is rejected at the \(2.4\sigma \) level in the 2–6 keVee energy range. The KS test and the most relevant run test both point towards a possible inconsistency in the 2–6 keVee energy range. One possible explanation for these results, which will be explored in the remainder of this paper, is that the modulation amplitude is changing (seemingly decaying) with time.

3.2 Annual cycles

Given the apparent decrease in the modulation amplitude from DAMA/NaI to DAMA/LIBRA (the amplitude decreases in every energy range in Table 1), it is interesting to investigate in detail the time dependence of the modulation amplitude. Here, we postulate several phenomenological models for the time dependence of the modulation amplitude, and perform likelihood fits to investigate whether or not any model is favored by the data.

A visual representation of the deviations of the amplitude in each annual cycle from the average is shown in Fig. 1. For each energy interval considered, we plot the deviation in each annual cycle as \(\frac{S_{m,k}-\left\langle S_m\right\rangle }{\sigma _k}\) where \(S_{m,k}\) and \(\sigma _k\) are the amplitude and uncertainty in the kth annual cycle, respectively, and \(\left\langle S_m\right\rangle \) is the uncertainty-weighted average amplitude in the particular energy range over the entire 14 year period. We observe that all the published data (in the 2–4, 2–5, and 2–6 keVee energy ranges) fall within \(2\sigma \) of the average. A Shapiro–Wilk test of normality of the deviations gives p-values of 73, 10, and 58% for these published energy ranges, respectively. Low p-values point to a non-normal distribution of the published deviations. The most striking feature in the plot is the fact that among the 14 annual cycles, there are three \(>2.8\sigma \) departures from the average in the 5–6 keVee energy range. If the data in this range were gaussian distributed, then the probability of this occurring is \(2\times 10^{-5}\,(4.2\sigma )\). This seemingly unlikely situation is an indication that there may be a problem with the data in this energy range.

To describe a potential time dependence of the modulation amplitude, we have performed likelihood tests for the following models: a constant modulation amplitude, separate constants for NaI and LIBRA, linear in time, exponential in time, and a broken exponential (a common exponential scale, but independent normalizations for NaI and LIBRA). We estimate the per-cycle modulation amplitudes for the 4–6, 4–5, and 5–6 keVee energy ranges using Eqs. (2) and (3) and these values are shown in Table 5. Fits for these models to the per-cycle DAMA modulation amplitude data from Tables 3 and 5 are presented in Table 6. The minimum \(\chi ^2\) and corresponding p-value for a \(\chi ^2\) goodness-of-fit are shown. Best-fit parameters are found in the final column and parameters not shown are profiled over. We have highlighted in bold any hypotheses that are discrepant at \(\ge 2.5\sigma \).

Table 5 The per-cycle modulation amplitudes for the 4–6, 4–5, and 5–6 keVee energy ranges, derived from the data in Table 3 as described in the text. Exposures and mean times for each cycle are the same as in Table 3

Based on the goodness of fit test, we find that all the models we investigated (including the single constant amplitude) provide reasonable fits to each of the energy ranges, with the exception of the 5–6 keVee range. This range is excluded at nearly \(4\sigma \) by the goodness of fit test for every model we studied. This is the quantitative conclusion regarding the odd behavior in the 5–6 keVee range observed in Fig. 1.

Finally, we compared models (hypothesis tests) against a single, constant modulation amplitude taken as the null hypothesis. The results are displayed in Table 6 with the improvement in fit given by the \(\Delta \chi ^2\) along with the corresponding p-value. These tests quantify the level of improvement in the fits above the amount expected due to adding additional parameters, assuming the null hypothesis is correct. The improvements in fit for the 2–4 keVee energy range are marginal (\(1.0\sigma \) to \(1.6\sigma \)), indicating consistency in this energy range. We determine, however, that all time-varying models we investigated are preferred over a constant amplitude in the 2–6 keVee energy range at levels from \(2.3\sigma \) to \(2.7\sigma \). As the 5–6 keVee data may be somewhat suspect, we find that the alternative models are still preferred over a constant amplitude in the 2–5 keVee range at the level of \(2.0\sigma \) to \(2.7\sigma \). Although the data clearly indicate that some form of time dependence is preferred, none of the alternative models is clearly better than the others. Unfortunately, the data is not capable of distinguishing between the different functional forms proposed here to model the time dependence of the modulation amplitude.

In comparing the results from the constant amplitude from Tables 2 and 6, one might come to the conclusion that the results are contradictory. A constant amplitude is inconsistent at the 1–3\(\sigma \) level in Table 2, whereas the inconsistency is much less in 2–4, 2–5 and 2–6 keVee ranges in Table 6. One possible explanation for this result is that the inconsistencies are statistically diluted by the additional degrees of freedom that are included in Table 6. This explanation is supported by the fact that the values for the \(\Delta \chi ^2\) test for broken constant model shown in Table 6 are quite consistent with the values from Table 2. These two results are the most direct comparison between the two tables.

Table 6 Fits of the per-cycle DAMA modulation amplitude data of Table 3 and Table 5. The minimum \(\chi ^2\) and corresponding p-value for a \(\chi ^2\) goodness-of-fit are shown, followed by the \(\Delta \chi ^2\) and corresponding p-value for a hypothesis test where the conventional constant modulation amplitude is taken as the null hypothesis. Best-fit parameters are shown in the final column; parameters not shown are profiled over

4 Discussion and conclusions

In this study, we have analyzed the annual modulation data released by the DAMA collaboration that they collected over 14 annual cycles and performed likelihood analyses for several phenomenological models to explore a possible time dependence in the modulation amplitude. Although a constant modulation amplitude is already a good fit to the data, our results indicate that all of the models we examined are preferred over a constant amplitude at up to \(\sim 3\sigma \). Although the data clearly prefer some form of time dependence, they are currently incapable of distinguishing between the different functional forms for the time dependence we investigated.

Though we have identified the fact that the modulation amplitude is discrepant at the 2–3\(\sigma \) level for each of the two versions of the experiment in all but the 2–4 keVee energy range, we do not propose a physical explanation for the phenomenon. More data will certainly aid in determining whether this is a real effect or a statistical fluctuation. The DAMA/LIBRA experiment has undergone a significant upgrade, and has been taking data in the DAMA/LIBRA-phase 2 configuration since January 2011 [27]. In November of 2010, all of the photo-multiplier tubes (PMT’s) in the DAMA/LIBRA experiment were replaced by high quantum efficiency PMT’s [31]. The anticipated lower thresholds (\(\sim 1\) keVee) will give access to a new signal range that should help to clarify the dark matter interpretation of the DAMA signal, as described in Ref [19].

It is also interesting to note that the half life for the exponential decay model in the 2–5 keVee energy range is about 11 years. If the collaboration releases data again in the near future, both the linear and exponential decay models would predict a noticeable decrease in the modulation amplitude, somewhere in the neighborhood of 50%. If this trend continues, this would have serious implications for the dark matter interpretation of the DAMA modulation.

In this study we have attempted to answer the question of whether the DAMA collaboration data demonstrates any time dependence, and the answer to that question appears to be “yes,” at the \(2-3\sigma \) significance level in all but the 2–4 keVee energy range. A more conclusive answer will of course require additional data, which we look forward to in the near future, from the DAMA collaboration as well as from ANAIS-112, COSINE-100, and SABRE, each of which will test the DAMA modulation results. Furthermore, given the questions raised here as well as the tension of the DAMA results with results from other dark matter direct detection experiments, we urge the DAMA collaboration and other experimental collaborations to make enough of their data public (i.e. providing time-stamped events) in order for researchers to repeat their analyses to encourage scrutiny and, with any luck, a consistent explanation for the observed phenomena, dark matter or otherwise.

The Mathematica code that we have used to complete our analysis is freely available on the arXiv.