We perform a number of signal processing steps on the raw waveforms before passing them to the neural networks for further analysis. The neural network analysis consists of two stages: In the first stage, the autoencoder extracts features from all preprocessed waveforms of unlabeled events and stores them in a low-dimensional feature vector. The feature vectors of labeled events are then passed to the classifier network in the second stage.
To train both networks, the total dataset is split into 60% for training and 40% for testing. Both the autoencoder and classifier networks are then trained and evaluated on subsets of these two datasets.
Preprocessing
The raw charge-waveforms, digitized from the core electrode, span a time of 20 \(\upmu \)s (Fig. 3a). In the first preprocessing step, the baseline is subtracted and the waveforms are normalized to their total charge (Fig. 3b). The normalized waveforms are then aligned in time so that all of them reach a value of 0.5 at the same sample number. We then trim the waveforms to a symmetric 1 \(\upmu \)s window around the alignment point (Fig. 3c). The aligned and trimmed waveforms consist of 256 samples that cover the entire rise of the signal. Finally, a differentiation step is performed to obtain the current-waveform from the charge-waveform (Fig 3d) where individual peaks correspond to spatially separated energy depositions in the detector. Thus, events with one distinguishable peak are assumed to be SS events while events with multiple peaks are MS events.
Autoencoder
After preprocessing, the resulting current-waveforms are used as input to the autoencoder. The autoencoder is a convolutional neural network. Its layout consists of two parts, shown in Fig. 4: the encoder extracts the important features from the waveform, storing them in a low-dimensional feature vector, and the decoder attempts to reconstruct the waveform from the feature vector.
In the encoder, a convolutional layer first performs two convolutions with trainable filters on the 256-dimensional input vector, producing two vectors of the same length. Both filters have a length of nine samples plus a constant bias. The convolutional layer is followed by an activation, applying a rectifying linear unit, \(\mathrm {ReLU}(x) \equiv \max (0,x)\), to each value x of its input. Next, a pooling operation quarters the time-resolution from 256 to 64 by picking the maximum value of each 4 neighbouring samples across the vectors. A fully connected layer then transforms the reduced vectors into a low-dimensional vector. This operation is implemented as a matrix multiplication where all entries of the \((2 \cdot 64 \times \text {feature vector dimension})\) matrix are trainable. Another ReLU is applied to produce the feature vector. The use of convolution, activation and pooling has become common in computer vision research, where 2D convolutions on images are employed instead of 1D temporal convolutions [14]. Since key operations of the encoder depend on trainable parameters, the encoding step is flexible and can map a wide variety of possible functions. All trainable parameters are randomly initialized before training. It is not possible to predict what information each individual entry of the feature vector will represent after training, and there is no obvious interpretation of the feature representation.
Trials established that seven parameters in the feature vector are sufficient as input to a lightweight classifier network. Seven parameters are also enough to ensure that all waveforms are reconstructed with sufficient accuracy and that the training converges reliably. The encoder, with only two hidden layers, proves to be powerful enough to capture the underlying structure of the waveforms, as will be discussed in Sect. 4.
The layout of the decoder mirrors the encoder: it consists of a fully connected layer followed by a ReLU activation, a four times upsampling operation and a deconvolution. The goal of the decoder during training is to reconstruct the original waveform from the feature vector. The mean squared error (MSE) is used to measure the accuracy of the reconstruction and as the loss function to be minimized during training:
$$\begin{aligned} L_{{\text {MSE}}} = \frac{1}{2NM} \sum _{n \in \mathcal {D}} \sum _{i \in \mathcal {S}} \left( x_{n,i} - x_{n,i}^* \right) ^2, \end{aligned}$$
(1)
where, \(\mathcal {D}\) denotes the training data containing N events, \(\mathcal {S}\) the set of the \(M=256\) sample indices of each waveform and x and \(x^*\) represent a value of the reconstructed and the original waveform, respectively. Both encoder and decoder are trained together as a single network that tries to reproduce the input waveform. This way, the encoder learns to extract the information from the waveform that yields the most faithful reconstruction, focusing on the underlying structure rather than the noise of the signal.
In principle, all recorded events could be used to train the autoencoder since no labels are required. However, we discard events with energies lower than 1000 keV as they are less relevant for \(0\nu \beta \beta \) decay searches and are more affected by noise.
Because of their more complex pulse shapes, MS events require more information to model than SS events. However, our dataset contains more similar-looking SS events than MS events. To counteract this, we drop a fraction of SS events to balance the datasets, leaving 725k events. This filter is based on the A / E value and drops a large fraction of SS events which look almost identical except for noise. Discarded events are chosen randomly to prevent introduction of a bias.
Examples of current waveforms and their reconstructions are shown in Figs. 4 and 10. The reconstructions exhibit the same shape as the original waveforms but lack the high-frequency noise due to its high entropy. A quantitative analysis of the reconstruction quality is presented in Sect. 4.
Classifier
The training of the autoencoder is followed by the training of the classifier network. The DEP and FEP events serve as the SS and MS training datasets, respectively (see Sect. 2). Their small difference in energy ensures that the noise level is very similar for the two peaks, an additional safeguard that prevents the training to be influenced by varying signal-to-noise ratios.
The classification network takes the low-dimensional feature vector as input. Two fully connected layers consisting of 10 and 5 neurons with ReLU activation functions process the feature vector and an output layer produces a single value, \(c \in (0,1)\), which is correlated to the probability of a given event to be SS (see Fig. 5).
With the described network architecture, the classifier has a total of 141 trainable parameters. The training set contains around 30 times as many events per class to ensure that the classifier cannot overfit and remember individual events from the training dataset. This lightweight network architecture is only possible because the underlying structure of the raw waveforms has already been extracted by the autoencoder. Again, we use MSE loss (see Eq. 1) for training but adjust only the parameters of the classifier network, leaving the previously trained autoencoder unchanged.
The output values of the classifier are shown in Fig. 6 for the complete test dataset. They demonstrate that the peaks are classified as expected: DEP events are clustered around 0.9, while events from MS peaks cluster below 0.5.