A novel approach for nearly-coincident events rejection

We present a novel technique, called DSVP (Discrimination through Singular Vectors Projections), to discriminate spurious events within a dataset. The purpose of this paper is to lay down a general procedure which can be tailored for a broad variety of applications. After describing the general concept, we apply the algorithm to the problem of identifying nearly coincident events in low temperature microcalorimeters in order to push the time resolution close to its intrinsic limit. In fact, from simulated datasets it was possible to achieve an effective time resolution even shorter than the sampling time of the system considered. The obtained results are contextualized in the framework of the HOLMES experiment, which aims at directly measuring the neutrino mass with the calorimetric approach, allowing to significally improve its statistical sensitivity.


Introduction
The demand for increasing sensitivities of nowadays experiments requires the development of complex analysis tools to respond to several demands, according to the design and goals of the experiment. In many experiments, a crucial factor in achieving a high sensitivity is the ability to discriminate spurious events. This is a particularly relevant feature to keep into account for experiments where the statistics of the spurious events might be comparable with, or even overcome, the statistics of the proper events. This is the case, for instance, of the direct measurement of the neutrino mass with the calorimetric approach [1]. So far, few techniques are currently employed for the purpose of discriminating the spurious events from the proper ones, and they all require that events belonging to these two fama e-mail: matteo.borghesi@mib.infn.it (corresponding author) ilies must differ in some way from each other. In this paper we outline a novel technique, called DSVP (Discrimination through Singular Vectors Projections), based on a previous work by Alpert et al. [2]. Besides its effectiveness, one of the main advantage of this method, is that it does not rely on any particular assumption about the structure of the events, thus in principle it can be applied in various scenarios almost in a semi-automatic way. In Sect. 2 we illustrate the DSVP method while in the second part of the article (Sect. 3 and 4 ) we show an application of this technique in the context of a direct calorimetric neutrino mass measurements. In particular, we will focus on the simulated data sets representing the one foreseen for detectors used in the HOLMES experiment [3], for which the expected main source of background will be unrecognized pile-up events.

The DSVP technique
The aim of the DSVP algorithm is to discriminate as many as possible undesirable events (i.e. spurious events which differ from a reference signal) present in a given dataset, using the information about the mean 'morphology' of the events present in the dataset. In order to apply the DSVP technique, the following elements are required: -The measured dataset, M. This n×d matrix consists of the dataset of interest, where each row is an event described by d variables. Namely, the events can be seen as points in the R d space. From now on, we call the good events in the dataset A events, while B events are the ones to be rejected. We assume that the A events are more numerous respect to the B ones (N A > N B ). 1 -The expected number of B events N B that the algorithm should discard at most. -A training dataset, T, such that N A N B . The events of this n × d matrix can be distributed in a different region of R d respect to the events in M. For instance, in the case of microcalorimeter signals, the events in T can lie in a different energy range respect to the events in M.
We will use the training dataset T to define a new vector space which will help us to highlight the features that distinguish an A event from a B one. This new vector space, called from now the projections space, has dimension k, with k d. The events can be represented as points in the R k projection space, so that the A events are distributed differently respect to the B ones. The idea is to find a model (i.e. hypersurfaces) describing the distribution of the A points in M in this new space, so that the B points can be identified as the ones with a larger distance from what predicted by the model. In order to find the model we need to 'clean' the dataset first, obtaining a subset of M, M ⊂ M, which contains mostly A events at the expense of deleting also some A events. The next step is to represent the events in M in the projection space and to find the model parameters which describes the distribution of the M (∼ A) events. We then define the discrimination parameter and its threshold to recognize an A event from a B one in R k . Finally, we take the original dataset M, represent the events in the projection space, find the discrimination parameter and discard all the events that have a value of the discrimination parameter above the threshold found. The procedure (dataset 'cleaning', model and threshold definition and B discrimination) is then repeated with the survived events. At each iteration, the M dataset will contain a smaller fraction of B events. In the following sections, each steps of the algorithm are described in detail.

Raw cleaning with PCA
In the first step, the aim is to create a suitable dataset for modeling the distribution of A events in the M matrix, lowering the ratio N B /N A at the expense of deleting also A events. Knowing that N A > N B , the mean 'morphology' of the events is closer to the A ones. We can define a suitable parameter using the Principal Component Analysis (PCA) [4] to discard mainly B events. The procedure used is equal to the one described in [2], which will be reported for completeness. The singular value decomposition (SVD) [5] is computed for the n×d matrix M, which 1 See Sect. 2.1. is decomposed in a product of three matrices M = UDV T . The columns of U and V are the left and right singular vectors respectively, while the entries of the diagonal matrix D are the singular values. The singular values are ordered from 1 to d in order of importance. Only the first j < d columns of D are non-neglibile. It is convenient to define a new matrix U which contains only the first j columns of U subtracted by their means which is equivalent to centering the data matrix, as required by the PCA. The columns ofÛ are vectors of length n. Basically, they represent the projections of the mean-centered events contained in M on the right singular vectors (i.e. the columns of V, which are called principal vectors in the PCA framework) with the first column ofÛ expressing the projections on the first right singular vector and so on. The columns of V are vectors of dimension d which represent the direction of greatest variance of the data in M. Thanks to the properties of the PCA, an appropriate combination of the projections can be of use to define a parameter, called norm 2 , which indicates how close an event is to the mean 'morphology' of the events in M. The precision matrix (σ 2 ) −1 is computed from the j × j empirical covariance σ 2 =Û TÛ and it is used to evaluate the parameter norm 2 for each event i = 1, . . . , n in the matrix M Suppose that we have a guess of how many B events are expected in the dataset. We call this number N guess B . B events deviate disproportionately from the mean in this covarianceadjusted sense, so we discard those with largest norm 2 and repeat the procedure on the remaining data a total of l times, removing on the l-th iteration a number of events equal to N guess B /2 l with the largest norm 2 . In our tests, we use l = 5. The iterations guarantee that the mean morphology of the events are closer and closer to the A events each cycle, as B events are increasingly eliminated. After the PCA, we have eliminated events, where the B events are the ones predominantly discarded. The remaining events after the PCA are m = n − N PC A del . We call M the m × d the matrix of the survived events, which is mostly composed of A events.

Define a model for the A-events
To discriminate the undesirable events, we now need to define a model which describes the distribution of the A points (the ones belonging to M ) in the projection space.  Fig. 1 An example of distribution of the points in T in the projection space from Sect. 3. In this particular case, the projection space is in R 4 . We decided to set k = 2, thus describing the points distribution in R 4 with two curves: First, we need to define this space. We decompose the T matrix using the SVD. Because the training matrix T is mainly composed by A events, we assume that its first k significant right singular vectors {v 1 , v 2 , . . . , v k } can constitute a base of the projection space.
The events in M are projected onto these vectors. From now on, each event in M will be described by k < d variables, its projections onto the right singular vector of T. We indicate all the coordinates of the M points along the i-th base vector of the projection space as To describe the points distribution in the new vector space, the projections p are classified into two groups: the k independent projections, indicated as p ind and the dependent ones, p dep .
The dependent projections can be expressed as a function of the independent ones. There is no general rule to identify which projection is "independent" and which one is not, since it is related to the specific problem. The training dataset can be used to identify the dependencies among the projections, as shown in Fig. 1. The distribution of the dependent projections can now be easily described in a R k +1 subspace by a set of f curves Knowing precisely the set of curves { f }, we will be able to differentiate between the two distribution of events, because the projection of the B events will not follow the same curves as the one of the A events.
Usually the functional form of the different f is unknown. However, we can approximate each f curve with a Taylor expansion and let a (weighted) linear regression find the best parameters of the expansion. In particular, we use a modified version of the random sample consensus (RANSAC) algorithm [6]. 2 The set of curves { f } which describes the M events in the projection space is what we called the model.

Find a discrimination threshold
The difference between the measured dependent projections and the ones expected from the model is evaluated for each event in the M matrix. A residual norm is defined as In order to discriminate between the A events, the one with the lowest residual norm, and the B, we need to define a threshold value, d thr . Due to the fact that the M dataset is mainly made of A events the threshold is chosen as the highest value of d plus the standard deviation of the d distribution This threshold definition should ensure to include not only the A events in M , but also the A events in the original dataset M which were eliminated by the 'PCA cleaning' described in Sect. 2.1. Nevertheless, this definition of threshold might need to be redefined to account for the specific problem considered.

Apply the model
Now all the components to make the algorithm work are present: a base for the projection space, a set of curves to model the points distribution in that space and a discrimination threshold.
We will now use these tools on the original dataset M, namely: 1. Take the inner product of the events in M with the base of the projection space, determining p k . 2. Evaluate the residual norm d using the curves describing the A projections distributions. 3. The events with a residual norm above the threshold are discarded.
After the third step, we will have discarded N del events. The events deleted by the third step will be almost, if not all,  Figure 2 shows a visual representation of some of the steps of the DSVP technique. As an example the figure reports signals from a TES microcalorimeter, as explained in Sect. 3.2. This particular case was chosen because there are just three non-negligible singular values, therefore the points in the projection space can be easily shown on a 3D plot. The method was implemented in python, taking advantage of many of the fast modules of NumPy and SciPy. The majority of the computational time is taken by the Raw cleaning with PCA part, due to the fact that the SVD on the matrix M is performed five times for each iteration. Nevertheless, the algorithm is quite fast, taking ∼ 7 min to compute nine iterations on a matrix M composed of 1,20,000 rows and 1024 columns of float32 numbers, using only one (six years old) CPU with a base clock of 2.6 GHz.

HOLMES and pile-up discrimination
The algorithm described in Sect. 2 is now applied in the framework of HOLMES. The HOLMES experiment will perform a direct measurement of the neutrino mass with a sensitivity of the order of 1 eV, measuring the energy released in the electron capture (EC) of 163 Ho, as proposed by De Rujula and Lusignoli in [7]. It will also demonstrate the scalability of the calorimetric technique for a next generation experiments that could go beyond the current best expected sensitivity of 0.1 eV [8]. In order to reach this sensitivity, HOLMES The effect of a non-zero neutrino mass on the 163 Ho EC decay spectrum can be appreciated only in a energy region very close to the end point, where the count rate is low and the fraction of nearly-coincident events, called pile-up events, to single events is greater than one. If a pile-up event is composed of two events of energies E 1 and E 2 which occur within a time interval shorter than the time resolution of the detector, it is recorded as a single event with energy E E 1 + E 2 . Thus, if not correctly identified, pile-up events will distort the decay spectrum of 163 Ho, lowering the sensitivity to m ν . The neutrino mass sensitivity of HOLMES has been evaluated through Monte Carlo simulations [9], see Fig. 3. Once fixed the number of recorded events to 3 × 10 13 , the simulations have shown that the sensitivity on neutrino mass is not strongly dependent on the energy resolution of the detector (as long as E < 10 eV), but rather on the pile-up fraction f pp , i.e. the ratio between the number of pile-up events to single events. Its reduction is crucial for the success of the experiment. Using the terminology of Sect. 2, in the HOLMES experiment an A event is a signal caused by a single energy deposition in the microcalorimeter detector, while a B event is a signal caused by nearly coincident events. Each signal is a collection of records I i of the current flowing through the detector sampled at an instant t i = i × t samp , where t samp is the sampling time of the readout system. An example of a microcalorimeter signal is shown in Fig. 2a. With the current setup, the sampling time is fixed at 2 µs. We tested the algorithm robustness and efficiency through many simulations which aim at emulating the results expected by the HOLMES experiment. The HOLMES TES microcalorime-ters do not have the 163 Ho implanted yet, therefore a real data test will be done at later times. 163 Ho decays via electron capture to an atomic excited state of 163 Dy which relaxes mostly emitting atomic electrons (i.e. the fluorescence yield is less than 10 −3 [7]). The deexcitation energy E c spectrum probability density is proportional to

Energy spectrum and ROI definitions
where m ν is the effective neutrino mass. The pulses are generated according to the spectrum in [7] with Q = 2.833 keV [10], m ν = 0 and with energy between 2.650 keV and 2.900 keV. Second order effects like shake-up and shake-off [11] have not been considered in this work. Despite the optimal region of interest (ROI) aimed at determining the neutrino mass will be determined only when actual data will be collected, this energy range can be considered as a reasonable ROI. Each detector must be treated separately with the DSVP technique in order to account for their slightly different characteristics. Thus, to create a statistic expected for a single detector with a target activity of 300 Hz over two years of data taking, we generated 40000 (∼ 80000) single (double) pulse events. The arrival time of the pile-up pulses is uniformly distributed between 0 and 10 µs.

Detector models
For this study we modeled three different TES microcalorimeters with the one-body model [12] or with the two-body dangling model [13]. In both cases the current pulse profile is obtained by solving the system of the electro-thermal differential equations applying the fourth-order Runge-Kutta method (RK4) and considering the transition resistance as proposed by [14] for taking into account the TES non-linear behavior. To these pulses a noise waveform, generated as an autoregressive moving average ARMA (p,q) process with a power spectrum given by the Irwin-Hilton model, is added. To test the DSVP effectiveness with slightly different signal shapes, the physical parameters in the differential equations are chosen to describe three types of detectors (Fig. 4): (a) the detectors in [2] which are characterized by a non linear response and one thermal body. (b) the target detectors of HOLMES [15] have nearly-linear response and behave according to a two thermal body model. The detectors of type (b) are the most promising for the HOLMES goals. Therefore, for this detector three different configurations were tested, in which the rise time was changed, adjusting the inductance of the circuit and keeping the other detector parameters constant.

DSVP and HOLMES
We indicate the ratio between the number of pile-up pulses and the number of single pulses in the ROI as f RO I pp . From simulations, setting a time resolution of 10 µs a value of f RO I pp 2 is expected. In order to apply the algorithm, the M matrix, which contains the ROI events, must have N A > N B , thus f RO I pp needs to be lowered below one. To reduce this ratio many different strategies can be adopted. In the following a non exhaustive list is reported.
-Adding an additional calibration source By adding a source characterized by a monochromatic X-ray emission in the ROI, the number of single pulses in the ROI can be increased while keeping the number of pile-up pulses unchanged. This approach can be very useful because it reshapes the energy spectrum, potentially reducing the probability of discarting single events with energy very Before evaluating their projection on the T right singular vectors, the T and ROI events are normalized to set their amplitude equal to one. Then, we define the region in the k-space in which the T events are distributed. We increase it by a little amount in order to account small non-linearity effects. Finally, we select only the events in the ROI included inside this region. This method can achieve good time resolution, but it works only if the detector response does not depart from linearity too much, so in our simulation in detectors (b), (c) but not (a) . -Filtering Few filtering techniques allow to achieve effective time resolution close to the sampling time. Among these, a particular Wiener filter, as described in [16], is probably the best technique to achieve this goal.
For the HOLMES purposes, in order to fulfill the N A > N B condition in the ROI the most suitable and practical method are the 'wiener filter' and the 'volumetric cuts'. As indicated in Table 1, applying these algorithms to the ROI events, the f RO I pp can be reduced around 0.6. Most of our simulations are therefore aimed to test their applications. Nevertheless, in Sect. 4.2 the performance of the DSVP technique with an external calibration source is also shown. As shown in Fig. 5, we use the events at M1 peak (E ∼ 2 keV) as training region T. This is also the energy range in which the average signal for the Wiener filter is defined. The M1 peak is the most suitable region for two reasons: it is the peak  closest to the ROI, thus reducing the non-linearity effects on the filters and on the discrimination algorithm and it fulfills the condition of N B N A . The f pp in this sector is expected to be of the order of ∼ 10 −3 , which can be further reduced with a raw cleaning with PCA, as described in Sect. 2.1.

Simulation results
To quantify the efficiency of the pile-up discrimination algorithms, we define an effective time resolution τ e f f as the ratio of the number of retained piled-up records to singlepulse records after the algorithm divided by the same ratio referred to raw data, times 10 µs.
Through simulations similar to the ones in [9], [17], we have preliminary estimated that even a small fraction of false negative modifies the single events spectrum and leads to a systematic error on the neutrino mass evaluation. We note that in our simulations no single pulse event was mistaken as pileup. The DSVP technique described in Sect. 2 is designed to leave unaffected the A events. In applications where a more robust discrimination of the B events is required, it is possible to adapt the algorithm toward this goal, for example by adjusting the threshold definition (Eq. 6), at the expenses of increasing the chance of deleting some A events. The energy dependence of the method must be assessed for each specific application. In general, we would say that the events with the energies further away than the mean energy in the dataset M are most likely to become false positive. In the end, the number of false positives is due to the threshold value (Eq. 6), while their nature is due to the mean "morphology" of the events present in the original dataset M.

DSVP with Wiener filter and Volumetric cuts
We have estimated the τ e f f on the simulated data processed with the DSVP after lowering the initial f RO I pp using the 'wiener filter' or the volumetric cut techniques. Furthermore, before being processed by the DSVP algorithm, the signals where also whitened, i.e. transformed to whiten noise by a fast Cholesky-factor backsolve procedure [18]. The results are reported in Table 1. All the simulations showed that the DSVP is able to reach a time resolution lower than the sampling time of the signal. Table 1 shows that the time resolution is strongly dependent on the sampling time, the faster the better, but also on the rise time of the pulse. While the sampling frequency is constrained by the readout resources, there is more scope to change the rise time of the detectors, acting on the electrical time constant of the biasing circuit. Changing the rise time by a factor two may be achieved reducing the inductance of the TES circuit by a similar factor. This change the noise spectrum too but usually it does not worsen the energy resolution. Also, the non-linear detector response generally improves the efficiency of pile-up recognition algorithms. When two near-coincident energy depositions happen inside the TES, the detector will have different starting conditions. The shape of the pile-up pulse will be much more different from the single pulse for a non-linear TES than for a linear one, thus allowing the algorithms to recognize them more efficiently. As we stressed in Sect. 2, the only external parameter required by the DSVP algorithm is the number of events that it should discard at most, N B . To quantify the influence of this parameter on the effectiveness of the algorithm, we fixed the dataset M and varied N B , computing the effective time resolution each time. Figure 6 shows that no false positive was detected even if we get the number of event to eliminate wrong up to 50%.

DSVP with additional calibration source
We have also tested if the performance of the DSVP technique remains unchanged reducing the f RO I pp by adding an external source of single events with energy inside the region of interest instead of using preliminary filters. We added a source from Lα x-ray emission lines of Pd (2.833, 2.839 keV). Figure 7 shows that increasing the number of photons of the Pd source (thus decreasing f RO I pp ) the effective time resolution of the DSVP improves. Moreover, τ e f f always remains below the sampling time even for a pile-up fraction up to 0.9.

Conclusions
The DSVP algorithm represents a very powerful technique to decrease the number of undesirable events in a dataset. In this work we have applied this algorithm for pile-up discrimination, which can lead to major improvement in experimental sensitivity for experiments such as HOLMES (neutrino mass measurement) or CUPID (0νββ) [19]. It can also be useful to recognize single-site events of the 0νββ interactions from multi-site background events in GERDA [20]. We tested the DSVP technique for the HOLMES application and we compared its efficiency, represented in this case by the effective time resolution τ e f f , to more 'classical' discrimination techniques, resulting in a better time resolution. With the target detector of HOLMES, the DSVP techniques allows us to reduce the total fraction of pile-up events from 10 −3 (∼ τ e f f 3 µs) to 10 −4 (∼ τ e f f 1.5 µs), thus improving the neutrino mass sensitivity from 2 eV to about 1.4 eV. To put this result in perspective, achieving the same improvement would require to increase the acquisition time by a factor 4: from 3 to 12 years.

Data Availability Statement
This manuscript has no associated data or the data will not be deposited. [Authors' comment: All the relevant data and the necessary informations are reported in the figures.] Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Funded by SCOAP 3 .