1 Introduction

Muons have been used as clean probes of new phenomena in particle physics ever since their discovery in cosmic showers [1, 2]. Their detection and measurement enabled many groundbreaking discoveries, from those of heavy quarks [3,4,5] and weak bosons [6] to that of the Higgs boson [7, 8] through its decay into weak bosons; most recently, a first evidence for \(\hbox {H} \rightarrow \upmu \upmu \) decays has also been reported by CMS [9], highlighting the importance of muons for searches as well as measurements of standard model parameters. The uniqueness of muons is due to their intrinsic physical properties, which produce a distinctive phenomenology of interactions with matter. Endowed with a mass 200 times higher than that of the electron, the muon loses little energy by electromagnetic radiation as it traverses dense media; it behaves as a minimum ionizing particle in a wide range of energies, where it is easily distinguishable from long-lived light hadrons such as charged pions and kaons.

In continuity with their glorious past, muons will remain valuable probes of new physics phenomena in future searches at high-energy colliders. A number of heavy particles predicted by new-physics models are accessible preferentially, and in some cases exclusively, by the detection of their decay to final states that include electrons or muons; in particular, the reconstruction of the resonant shape of dileptonic decays of new \(\hbox {Z}'\) gauge bosons resulting from the addition of an extra U(1) group or higher symmetry structures to the Standard Model [10, 11] constitutes a compelling reason for seeking the best possible energy resolution for electrons and muons of high energy.

Fig. 1
figure 1

Mass stopping power for muons in the 0.1 MeV to 100 TeV range, in \(\hbox {MeVcm}^2/\hbox {g}\). The rise in radiative loss becomes important above 100 GeV. Image reproduced with permission from Ref. [12]

Unfortunately, the very features that make muons special and easily distinguishable from backgrounds also hinder the precise measurement of their energy in the ultra-relativistic regime. While the energy of electrons is effectively inferred from the electromagnetic showers they initiate in dense calorimeters, muon energy estimates rely solely on the determination of the curvature of their trajectory in a magnetic field. If we consider the ATLAS and CMS detectors as a reference, we observe that the relative resolution of muon transverse momentum achieved in those state-of-the-art instruments at 1 TeV ranges from 8 to 20% in ATLAS, and from 6 to 17% in CMS [13, 14], depending on detection and reconstruction details; by comparison, for electrons of the same energy the resolution ranges from 0.5 to 1.0% in ATLAS, and from 1 to 2% in CMS [15, 16]. Clearly, for non-minimum-ionizing particles, calorimetric measurements win over curvature determinations at high energy, due to the different scaling properties of the respective resolution functions: relative uncertainty of curvature-driven estimates grows linearly with energy, while the one of calorimetric estimates decreases with \(\sqrt{\hbox {E}}\).

However, ultra-relativistic muons do not behave as minimum-ionizing particles; rather, they show a rise in their radiative energy loss [12] above roughly 100 GeV (see Fig. 1). The effect is clear, although undeniably very small in absolute terms; for example, a 1 TeV muon is expected to lose a mere 2.3 GeV in traversing the \({25.8}\,{\hbox {X}_0}\) of the CMS electromagnetic calorimeter [17]. For that reason, patterns of radiative losses have never been exploited to estimate muon energy in collider detectors.Footnote 1 It is the purpose of this work to show how low-energy photons radiated by TeV-energy muons and detected in a sufficiently thick and fine-grained calorimeter may be successfully exploited to estimate muon energy even in collider detector applications. Crucially, we will also demonstrate how the input of such a measurement is not only the magnitude, but also the pattern of the detected energy depositions in the calorimeter cells.

The spatial patterns of calorimeter deposits are a well known and heavily exploited feature for object identification purposes, e.g. to distinguish electromagnetic showers from hadronic showers by comparing the depth profile of the energy deposits [15, 22]. Recently, in the context of proposals for calorimeters endowed with fine grained lateral and longitudinal segmentation, it has been shown that this granularity not only improves the identification purity, but also allows for an accurate determination of the energy of hadronic showers, by identifying individual patterns of their electromagnetic and hadronic sub-components [23,24,25,26,27,28]. In parallel, machine learning techniques have proven to be very powerful for reconstructing individual showers [27, 29, 30] as well as multiple, even overlapping showers while at the same time being adaptable to the particularities of the involved detector geometries [31,32,33]. Also pattern recognition applications for quick identification of pointing and non-pointing showers at trigger level have been proposed [34, 35]. Following the success of such applications, we chose a deep learning approach to the problem, based on convolutional neural networks and loosely inspired by the techniques used for reconstructing hadronic showers in [29, 30].

The plan of this document is as follows. In Sect. 2 we describe the idealised calorimeter we have employed for this study. In Sect. 3 we discuss the architecture of the convolutional neural network we used for the regression of muon energy from the measured energy deposits. In Sect. 4 we detail our results. We offer some concluding remarks in Sect. 5. In Appendix 6 we describe the high-level features we constructed from energetic and spatial information of each muon interaction event; these features are used as additional input for the regression task. In Appendix 7 we offer an extensive ablation study of the model architecture and loss, the training schedule, and other technical aspects of our approach. Finally, in Sect. 8 we describe the hardware and time requirements of both the study and the regressor.

A public version of the research code is available from Ref. [36]. The pre-processed datasets are available from Ref. [37], and are designed to be used directly with the code-base.

2 Detector geometry and simulation

2.1 Detector geometry

Since our goal in this work is to show the feasibility of muon-energy estimation from energy deposits in a calorimeter, we strip the problem of any complication from factors that are ancillary to the task. For that reason, we consider a homogeneous lead tungstate cuboid calorimeter with a total depth in \(\hbox {z}\) of \({2032}\,\hbox {mm}\) and a spatial extent of \({120}\,\hbox {mm}\) in \(\hbox {x}\) and \(\hbox {y}\). The calorimeter is segmented into 50 layers in \(\hbox {z}\), each with a thickness of \({39.6}\,\hbox {mm}\); this corresponds to 4.5 radiation lengths. Such a longitudinal segmentation allows for electromagnetic showers to be well resolvable. Each layer is further segmented in \(\hbox {x}\) and \(\hbox {y}\) in \(32 \times 32\) cells, with a size of \({3.73}\,\hbox {mm} \times {3.73}\,\hbox {mm}\). This results in 51 200 channels in total.

We assume that the calorimeter is embedded in a uniform 2-Tesla magnetic field, provided by an external solenoid or dipole magnet. The chosen magnet strength equals that of the ATLAS detector, and is in the range of what future collider detectors will likely be endowed with. We note that the magnetic bending of muon tracks inside the calorimeter volume is very small in the energy range of our interest (1 TeV and above), and its effect on the regression task is negligible there.Footnote 2 In the studies reported infra we will both compare the curvature-based momentum estimate provided by an ATLAS-like detector to the radiative losses-driven one, and combine the two to show their complementarity.

Fig. 2
figure 2

Pair of examples of muons entering the simulated calorimeter in the \(\hbox {z}\) direction. The colour palette indicates the energy of each deposit, relative to the highest-energy deposit for each muon, moving from blue, through green, to yellow in increasing energy

2.2 Data generation

We generate unpolarised muons of both charges with a momentum \(\hbox {P} = \hbox {P}_{\mathrm{z}}\) in the \(\hbox {z}\) direction, of magnitude ranging between \({50}\,\hbox {GeV}\) and \(8\,\hbox {TeV}\). This interval extends beyond the conceivable momentum range of muons produced by a future high-energy electron–positron collider such as CepC or FCC-ee [38], and it therefore enables an unbiased study of the measurement of that quantity in an experimentally interesting scenario.

The generated initial muon position in the \(\hbox {z}\) coordinate is set to \(\hbox {z}=-{50}\,\hbox {mm}\) with respect to the calorimeter front face; its \(\hbox {x}\) and \(\hbox {y}\) coordinates are randomly chosen within \(|\hbox {x}|\le {20}\,\hbox {mm}\) and \(|\hbox {y}|\le {20}\,\hbox {mm}\). The momentum components in \(\hbox {x}\) and \(\hbox {y}\) direction are set to zero. As mentioned supra, to compare the curvature-based and calorimetric measurement we assume that the calorimeter is immersed in a constant \(\hbox {B}=2\hbox {T}\) magnetic field, oriented along the positive \(\hbox {y}\) direction. The detector geometry and the radiation pattern of a muon entering the calorimeter are shown in Fig. 2. Even at a relatively low energy of \({100}\,\hbox {GeV}\), the produced pattern of radiation deposits is clearly visible and we can also see that the multiplicity of deposits grows with the muon energy. The interaction of the muons with the detector material is simulated with Geant  4 [39, 40] using the FTFP_BERT physics list.

For the training and validation tasks of the regression problem a total of 886 716 muons are generated, sampled from a Uniform distribution in the 0.05–8 TeV range. Additional muon samples, for a total of 429 750 muons, are generated at fixed values of muon energy (E = 100, 500, 900, 1300, 1700, 2100, 2500, 2900, 3300, 3700, 4100 GeV) in order to verify the posterior distributions in additional tests discussed infra, Sect. 4. Such a discrete-energy dataset allows us to compute precisely the resolution of the trained regressor at specific muon energies, rather than having to bin muons according to their energy.

Fig. 3
figure 3

Diagrams illustrating the three types of models used

3 The CNN regression task

Three regressor architectures are considered: regressors that only use continuous input-features (such as the energy sum and other high-level features) pass their inputs through a set of fully-connected layers (referred to as the network body), ending with a single-neuron output; when the 3D grid of energy deposits is considered, the body is prepended with a series of 3D convolutional layers (referred to as the head), which act to reduce the size of the grid, whilst learning high-level features of the data, prior to passing the outs to the body; the main model used is a hybrid model combining both approaches, in which the energy deposits are passed through the head, and the pre-computed high-level features are passed directly to the body. Layout diagrams for these three models are illustrated in Fig. 3, and a technical description of component is included in the following subsection. Models are implemented and trained using PyTorch [41] wrapped by Lumin [42] – a high-level API which includes implementations of the advanced training techniques and architecture components we make use of in the regressor.

3.1 Architecture components

3.1.1 Convolutional head

The head architecture is inspired by domain knowledge and is based on the fact that the sum of the energy deposits is related to the energy of the traversing muon, however accurate correspondence requires that the deposits receive small corrections based on the distribution of surrounding deposits. The convolutional architecture draws on both the DenseNet [43] and ResNet [44] architectures, and is arranged in blocks of several layers. Within each block, new channels are computed based on incoming channels (which include the energy deposits) using a pair of 3D convolutional layers. The channels computed by the convolutional layers are weighted by a squeeze-excitation (SE) block [45]. The convolutional plus SE path is by-passable via a residual sum to an identity path. At the output of the block, the channel corresponding to the energy deposits is concatenated (channel-wise) to the output of the addition of the convolutional layers and the identity path.Footnote 3 In this way, convolutional layers always have direct access to the energy deposits, allowing their outputs to act as the “small corrections” required.

The architecture becomes slightly more complicated when the energy is downsampled; in such cases, convolutional shortcuts [46] are used on the identity path, and fixed, unit-weighted convolutional layers with strides equal to their kernel size are applied to the energy deposits. These fixed kernels act to sum up the energy deposited within each sub-cube of the detector, and are referred to here as the “E-sum layers”. This approach is strongly inspired by [29, 30]. Additionally, for blocks after the very first one, a pre-activation layout [46] is adopted with regards to the placement of batch normalisation layers. Figure 4 illustrates and discusses the general configurations of the three types of blocks used.

Fig. 4
figure 4

Diagrams illustrating the three types of blocks used to construct the convolutional heads

Sets of these convolutional blocks are used to construct the full convolutional head. In all cases, the grid is downsampled four times, each time with a reduction by a factor of two. However, non-downsampling blocks (Fig. 4b) may be inserted in between the downsampling blocks in order to build deeper networks. Figure 5 illustrates the layout of the full convolutional head.

Technical specification In the convolutional layers, the kernel sizes of all convolutional and average-pooling layers are set to three, with the exception of the first convolution in downsampling and initial blocks, which use a kernel size of four, to match the stride and padding of the E-sum layer. Zero-padding of size one is used (two when the kernel size is four). Swish activation-functions [47] are used with \(\upbeta =1\) (Swish-1). Weights are initialised using the Kaiming rule [48], with the exception of the E-Sum layers, which are initialised with ones. No biases are used.

The squeeze-excitation blocks feed the channel means into a fully connected layer of width \(\max \left( 2,\hbox {N}_{\mathrm{c}}//4\right) \) (\(\hbox {N}_{\mathrm{c}}\) = number of channels, // indicates integer division, Kaiming weight initialisation and zero bias initialisation) and a Swish-1 activation, followed by a fully connected layer of width \(\hbox {N}_{\mathrm{c}}\) (Glorot [49] weight initialisation and zero bias initialisation) and a sigmoid activation. This provides a set of multiplicative weights per channel which are used to rescale each channel prior to the residual sum.

Due to the sparse nature of the data, we found in necessary to use running batch-normalisation [50]. This modifies the batch normalisation layers to apply the same transformation during both training and inference, i.e. during training, the batch statistics are used to update the running averages of the transformation, and then the averaged transformation is applied to the batch (normally only batch-wise statistics are used to transform the training batches, causing potential differences between training and inference computations). Additionally, running averages of the sums and squared sums of the incoming data are tracked, rather than the mean and standard deviation, allowing the true standard deviation to be computed on the fly (normally the average standard deviation is used). Together, these changes with respect to a default batch normalisation implementation provide greater stability during training, and enabled generalisation to unseen data. All batch normalisation layers use a momentum of 0.1, meaning that the running average of statistic \(\uptheta \) is tracked according to \({\bar{\uptheta }}\leftarrow 0.9{\bar{\uptheta }}+0.1\uptheta _{\text {batch}}\).

3.1.2 Network body and output

The body of the network is relatively simple, and consists of three fully connected layers, each with 80 neurons. Weights are initialised using the Kaiming rule, and biases are initialised with zeros. Swish-1 activation functions are placed after every layer. No batch normalisation is used.

The output layer of the network consists of a single neuron. Weights are initialised using the Glorot rule, and the bias is initialised to zero. No activation function is used.

3.2 Training

3.2.1 Data

Models are trained on simulated data for the full considered range of muon true energy, 50–8000 GeV. The 3D grid of raw energy deposits does not undergo any preprocessing, nor do the target energies. When used, the measured energy extracted from the curvature fit (V[24], see infra, Appendix 6) is clamped between 0 and 10 TeV.Footnote 4 All high-level features are then standardised and normalised by mean-subtraction and division by standard deviation.

The full training dataset consists of 886 716 muons. This is split into 36 folds of 24 631 muons; the zeroth fold is used to provide a hold-out validation dataset on which model performance is compared. During training a further fold is used to provide monitoring validation to evaluate the general performance of the network and catch the point of highest performance.

Prior to using the discrete-energy testing-data to compute the resolution, the continuous-energy validation dataset is finely binned in true energy, allowing us to compute an approximation of the resolution at the central energy of the bin (computed as the median true-energy of muons in the bin).

3.2.2 Loss

Models are trained to minimise a Huberised [51] version of the mean fractional squared error (MFSE):

$$\begin{aligned} {{\,\mathrm{L}\,}}\!\left( \hbox {y},{\hat{\hbox {y}}}\right) =\frac{1}{\hbox {N}}\sum _{\hbox {n}=1}^{\mathrm{N}}\frac{\left( \hbox {y}_{\mathrm{n}}-\hat{\hbox {y}_{\mathrm{n}}}\right) ^2}{\hbox {y}_{\mathrm{n}}}, \end{aligned}$$
(1)

where \(\hbox {y}\) is the true muon-energy, \({\hat{\hbox {y}}}\) is the predicted energy, and \(\hbox {N}\) is the batch size. The form of this loss function reflects the expectation of a linear scaling of the variance of the energy measurement with true energy, as is normally the case for calorimeter showers when the energy resolution is dominated by the stochastic term. In this study, the batch size used for training the models is 256.

Fig. 5
figure 5

Block layout for the convolutional head. Tensor dimensions are indicated in the form \((\text {channel},\hbox {z,x,y})\). The convention is to increase the number of channels to eight in the first downsample, and then increase the number of channels at each downsample by a factor of 1.5. The number of channels increases by one in each block due to the energy concatenation. Prior to being fed into the network body, the tensor is pooled by computing the maximum and mean of each channel. The data-batch dimension is not shown, for simplicity

Fig. 6
figure 6

Data weight as a function of true muon energy

Huber loss To prevent non-Gaussian tails of the regressed muon energy distribution from dominating the loss estimate, element-wise losses are first computed as the squared error, \(\left( \hbox {y}_{\mathrm{n}}-\hat{\hbox {y}_{\mathrm{n}}}\right) ^2\), and high-loss predictions above a threshold are modified such that they correspond to a linear extrapolation of the loss at the threshold:

$$\begin{aligned} {{\,\mathrm{L}\,}}_{{\mathrm {Huber,i}}} = \hbox {t}+\left( 2\sqrt{\hbox {t}}\left( \left| \hbox {y}_{\mathrm{i}}-\hat{\hbox {y}_{\mathrm{i}}}\right| -\sqrt{\hbox {t}}\right) \right) , \end{aligned}$$
(2)

where \(\hbox {i}\) are indices of the data-points with a squared-error loss greater than the threshold \(\hbox {t}\). This Huberised element-wise loss is then divided by the true energy to obtain the fractional error, which is then multiplied by element-wise weights (discussed below) and averaged over the data points in the batch.

Since the loss values vary significantly across the true-energy spectrum, data points are grouped into five equally sized bins, each of which has its own threshold used to define the transition to the absolute error. The transition point used for a given bin is the \({68}{\mathrm{th}}\) percentile of the distribution of squared-error losses in that bin (allowing the threshold to always be relevant to the current scale of the loss as training progresses). However, since for a batch size of 256 one expects only 51 points per bin, the threshold can vary significantly from one batch to another. To provide greater stability, the bin-wise thresholds are actually running averages of the past \({68}{\mathrm{th}}\) percentiles, again with a momentum of 0.1, i.e. for bin \(\hbox {j}\), the threshold is tracked as \(\hbox {t}_{\mathrm{j}}\leftarrow \hbox {0.9t}_{\mathrm{j}}+0.1{{\,\mathrm{L}\,}}_{{\mathrm {SE,j}},{68}{\mathrm{th}}}\), where \({{\,\mathrm{L}\,}}_{{\mathrm {SE,j}},{68}{\mathrm{th}}}\) is the \({68}{\mathrm{th}}\) percentile of the squared errors in bin \(\hbox {j}\).

Data weighting Models are trained on muons of true energy in the 50–8000 GeV range, but will only be evaluated in the range 100–4000 GeV in order to avoid biases due to edge effects; effectively the regressor can learn that no targets exist outside of the data range, and so it is more efficient to only predict well within the data-range. This leads to an overestimation of low-energy muons, and an underestimation of high-energy muons. By training on an extended range and then evaluating on the intended range, these edge-effects can be mitigated. Yet we still want the network to focus on the intended range; rather than generating data with a pre-defined PDF in true energy, we use a uniform PDF and down-weight data with true muon energy outside the range of interest.

The weighting function used depends solely on the true energy of the muons and takes the form of:

$$\begin{aligned} \hbox {w} = {\left\{ \begin{array}{ll} 1-\hbox {Sigmoid}\left( \frac{\hbox {E}-5000}{300}\right) \quad &{}\hbox {E}\le {5000}\,\hbox {GeV},\\ 1-\hbox {Sigmoid}\left( \frac{\hbox {E}-5000}{600}\right) \quad &{}\hbox {E}>{5000}\,\hbox {GeV}.\\ \end{array}\right. } \end{aligned}$$
(3)

This provides both a quick drop-off above the intended range, and a slow tail out to the upper-limit of the training range. Figure 6 illustrates this weighting function. It should be noted that the above weights correspond to a comparatively smooth modification of the true energy prior; for specific applications where the physics puts hard boundaries on the energy spectrum (such as, e.g., a symmetric electron–positron collider, where one may safely assume that muons cannot be produced with energy larger than the beam energy) a sharper prior may be used instead, and significantly improve the resolution at the high end of the spectrum.

Fig. 7
figure 7

Details of a typical training, showing the loss and metric evolution and the associated schedule of the optimiser hyper-parameters

3.2.3 Optimiser

The Adam optimiser [52] is used for updating the model weights. The \(\upvarepsilon \) and \(\upbeta _2\) parameters are kept constant, at \(1\times 10^{-8}\) and 0.999, respectively. The learning rate (LR) and \(\upbeta _1\) (momentum coefficient) are adjusted during training in two stages. For the first 20 epochs of training, the 1cycle schedule [53, 54], with cosine interpolation [55], is used to train the network quickly at high learning rates; this is followed by up to 30 epochs of a step-decay annealing [44], which is used to refine the network at small learning rates. For the 1cycle schedule, training begins at an LR of \(3\times 10^{-7}\) and \(\upbeta _1=0.95\). Over the first two epochs of training the LR is increased to \(3\times 10^{-5}\) and \(\upbeta _1\) is decreased to 0.85. Over the next 18 epochs, the LR is decreased to \(3\times 10^{-6}\), and \(\upbeta _1\) increased back to 0.95. Following this, the best performing model-state, and its associated optimiser state, is loaded and training continues at a fixed LR and \(\upbeta _1\) until two epochs elapse with no improvement in validation loss. At this point, the best performing model-state is again reloaded, \(\upbeta _1\) is set to 0.95, and the LR is halved. This process of training until no improvement, reloading, and halving the LR continues until either all 50 epochs have elapsed, or 10 epochs elapse with no improvement. At this point the best performing model-state is again loaded and saved as the final model. Figure 7 details a typical training with such a schedule.

Explicit, tunable regularisation was not found to be required during training. Instead, overtraining is prevented by continual monitoring of the model performance on a separate validation sample, and saving of the model parameters whenever the validation loss improves.

3.2.4 Ensemble training

As mentioned in Sect. 3.2.1, the training dataset is split into 36 folds, one of which is retained to provide a comparison between models, and another is used to monitor generalised performance during training. During development and the ablation study (discussed infra, Appendix 7), it was useful to obtain an averaged performance of the model architecture from five repeated trainings. Since, however, one training on the full dataset takes about one day, we instead ran these trainings on unique folds of the full dataset, using different folds to monitor generalisation, i.e. each model is trained on seven folds and monitored on one fold, and no fold is used to train more than one model (but folds can be used to monitor performance for one model and also to train a different model). This allows us to train an ensemble of five models in just one day, and also to get average performance metrics over the unique validation folds, to compare architecture settings. This method of training is referred to as “unique-fold training”.

For the full, final ensemble, each model is trained on 34 folds and monitored on one fold, which is different for each model. Once trained, the ensemble is formed by weighting the contributions of each model according to the inverse of its validation performance during training. This method of training is referred to as “all-fold training”.

4 Results

Unless explicitly specified, all results presented in this section refer to the main regression model, in which both raw energy-deposits and the high-level features are used.

Fig. 8
figure 8

Raw predictions of the regressor ensemble as a function of true energy. The ideal response is for all points to lie on a straight line along \(\hbox {y}=\hbox {x}\). The green line shows a linear fit to predictions in bins of true energy

4.1 Regressor response and bias correction

Figure 8 shows the predictions of the regression ensemble as a function of true energy for the holdout-validation dataset. Whilst the general trend is linear, we can see some dispersion, and Fig. 9 better details the fractional error as a function of true energy, along with the trends in the quantiles. From this we can see that regressor overestimates medium energies, and underestimates high energies. Low energies are predicted without significant bias.

We can correct for the bias in the prediction, however we must do so in a way that does not assume knowledge of the true energy, such that the correction can also be applied to prediction in actual application. The method used is to fit a function (in this case a linear function – green line in Fig. 8) to the mean of the predictions in bins of true energy, and their uncertainties as estimated from bootstrap resampling. Having fitted the function, the inverse of the function can now be used to look up the true energy of a given prediction, resulting in a corrected prediction. Figure 10 illustrates the corrected predictions on both the continuous validation data. Although the difference is only slight, as we will see later in Sect. 7, the de-biased predictions allow for a better resolution once the residual biases in the predictions are accounted for. To best reproduce actual application, the debiasing correction is fixed using the validation data, and then applied as is to the testing data.

Fig. 9
figure 9

Fractional error of predictions as a function of true energy, along with quantile trends. The ideal response is for all points to lie on a straight line along \(\hbox {y}=0\). The one and two sigma lines indicate the \(50\pm {34.1}{\%}\) and \(50\pm {47.7}{\%}\) percentiles, respectively

Fig. 10
figure 10

Corrected predictions on validation data resulting from the inversion of the fit as a function of the true energy. The black dashed line indicates the ideal response

Figure 11 shows the distributions of the ratios of corrected predictions to true energies on the testing data.

Fig. 11
figure 11

Distributions of the ratios of corrected predictions to true energies in bins of true energy. The ideal response here would be delta distributions centred at one. Distributions not centred at one are indicative of residual bias in the predictions for that energy range

4.2 Resolution and combination with curvature measurement

From the discussion in Sect. 1 we can expect that the relative resolution of the energy estimation from the calorimeter should improve as the energy increases, similarly we expect the resolution from magnetic-bending in the tracker will improve as the energy decreases. This difference in energy dependence means that the two measurements are complementary to one another and it would make sense in actual application to use both approaches in a weighted average.

Since our setup only includes a calorimeter, we assume that the resolution of a tracking measurement, performed independently by an upstream or downstream detector, scales linearly with energy, and equals 20% resolution at 1 TeV. Figure 12 shows the resolution of both the regressor measurement and the simulated tracker measurement, along with the resolution of their weighted average. Resolution here is the fractional root median squared-error computed in bins of true energy according to:

$$\begin{aligned} \text {Resolution} = \frac{\sqrt{\left( \tilde{\hbox {E}_{\mathrm{p}}}-\tilde{\hbox {E}_{\mathrm{t}}}\right) ^2+\Delta _{68}\left[ \hbox {E}_{\mathrm{p}}\right] ^2}}{\tilde{\hbox {E}_{\mathrm{t}}}}, \end{aligned}$$
(4)

where \(\tilde{\hbox {E}_{\mathrm{p}}}\) and \(\tilde{\hbox {E}_{\mathrm{t}}}\) are the median predicted and true energies in a given bin of true energy (their difference being the residual bias after the correction via the linear fit), and \(\Delta _{68}\left[ \hbox {E}_{\mathrm{p}}\right] \) is the difference between the \({16}{\mathrm{th}}\) and \({84}{\mathrm{th}}\) percentiles of the predicted energy in that bin (the central \({68}{\mathrm{th}}\) percentile width). When computing the resolution on the testing data (which are generated at fixed points of true energy), \(\tilde{\hbox {E}_{\mathrm{t}}}\) is instead the true energy at a given point.

Fig. 12
figure 12

Resolutions of the energy regression (Calorimeter), the simulated tracker, and their weighted average in a combined measurement. Resolution is computed on testing data at fixed points of true energy. The tracker is assumed to provide a linearly scaling resolution with a relative value of 20% at 1 TeV

It is interesting to note that the regression resolution initially gets worse with energy, rather than starting poor and gradually improving. Good resolution at low energy was not observed during development studies prior to the introduction of the magnetic field, therefore we assume that the CNNs are able to make use of the magnetic bending in the calorimeter to recover performance when there is reduced radiation. As expected, the regressor quickly improves in resolution once the energy reaches a certain threshold at around 1.5 TeV.

Having established that both the calorimeter and tracker measurements are useful and complementary, for later studies it makes sense to compare models in terms of the performance of the combined measurement. One such metric is the poorest resolution achieved by the combined measurement for the studied energy range (in this case 29.5% – lower = better). This however relies only on a single point of the response. A more general metric is to compute the improvement of the combined measurement over the tracker-only measurement in bins of true energy, and take the average or sum; this then characterises the improvement due to the regression across the whole spectrum. We will refer to this metric as the Mean Improvement (MI). Considering the 11 points in the range 100–4100 GeV, our mean improvement is 22.1% (higher = better). Computation of the MI on the validation data instead uses 20 bins in the 100–4000 GeV.

Input comparison: high-level features and raw inputs As discussed in Sect. 6, alongside the recorded calorimeter deposits, a range of high-level (HL) features are also fed into the neural network. To better understand what, if anything, the CNN learns extra from the raw information, we can study what happens when the inputs are changed. In Table 1 we show the MI metric values for a range of different inputs. In cases when the raw inputs are not used, the neural network (NN) consists only of the fully connected layers. For this comparison, we use the MI computed during training on the monitoring-validation dataset and average over the five models trained per configuration.

Table 1 Mean Improvements for a range of different input configurations. The MI is computed on the monitoring-validation data and averaged over the training of five models per configuration. The change in MI is computed as the difference between configuration and the nominal model (“Raw inputs + HL feats.”) as a fraction of the MI of the nominal model. Energy-sum features are the three features corresponding to the sums of energy in different threshold regions (V[0], V[26], and V[27])

From these results we can see the CNN is able to extract more useful information from the raw data that our domain expertise provides, however we are still able to help the model perform better when we also leverage our knowledge. Moreover, we can see that the additionally computed HL-features provide a significant benefit to the energy-sum features. The importance of the top features as a function of true energy is illustrated in Fig. 13. From this, it is interesting to note the shift in importance between V[0] and V[26] (the sum of energy in cells above 0.1 GeV and below 0.01 GeV, respectively; see Appendix 6, infra). This is due to the increased chance of high-energy deposits as the energy of the muon increases. The fact that the HL-features give access to a finer-grained summation of energy than the energy-pass-through connections in the CNN architecture (which sum all the energy in cells within the kernel-size, regardless of energy) is potentially why the HL-features are still useful; a further extension to the model, then, could be to also perform the binned summation during the energy pass-through.

Fig. 13
figure 13

Permutation importance of the most important features as evaluated using the “HL-feats. only” model in bins of true energy. The features, described further in Sect. 6 are: V[0] – E-sum in cells above 0.1 GeV, V[1] – fractional MET, V[3] – overall \({2}{\mathrm{nd}}\) moment of transverse E distribution, V[11] – maximum total E in clustered deposits, V[15] – maximum energy of cells excluded from clustered deposits, V[22] – relative \({1}{\mathrm{st}}\) moment of E distribution along x-axis, V[26] – E-sum in cells below 0.01 GeV

Fig. 14
figure 14

Resolutions of the models with varying input menus. Resolution is computed on the holdout-validation data in bins of true energy

Figure 14 shows the resolutions of the four different models on the holdout-validation data. From this we can clearly see the benefits of providing access to the raw-hit data. The benefits of the high-level features are most prominent in the low to medium energy range, where features V[0] and V[26] have very similar importance.

5 Conclusions

As we move towards the investigation of the potential of new accelerators envisioned by the recently published “2020 Update of the European Strategy for Particle Physics” [56], we need to ask ourselves how we plan to determine the energy of multi-TeV muons in the future detectors which those machines will be endowed with and beyond. As mentioned supra (Sect. 1), the CMS detector is able to achieve relative resolutions in the range of 6% to 17% at 1 TeV, thanks to its very strong 4-Tesla solenoid. It is important to note that the choice of such a strong magnet for CMS imposed a compact design to the whole central detector; the result proved successful at the LHC, but might be sub-optimal in other experimental situations. Given the linear scaling with momentum of relative momentum resolution as determined with curvature fits, it is clear that complementary estimates of the energy of high-energy muons would be highly beneficial in future experiments.

In this work we investigated, using an idealised calorimeter layout, how spatial and energy information on emitted electromagnetic radiation may be exploited to obtain an estimate of muon energy. Given the regularity of the detector configuration, processing of the raw data was possible using 3D convolutional neural networks. These allowed us to exploit the granular information of the deposited energy pattern to learn high-level representations of the detector readout, which we could also combine with high-level information produced by physics-inspired statistical summaries. We found the use of deep learning and domain-driven feature engineering to both be beneficial. In Sect. 7 we further explore the CNN architecture and training loss, finding there too, that using knowledge of the physical task can help inspire more performant solutions.

Our studies show that the fine-grained information on the radiation patterns allows for a significant improvement of the precision of muon energy estimates. E.g. for muons in the 1–3 TeV range, which are the ones of higher interest for future applications, the relative resolution improves approximately by a factor of two with respect to what can be achieved by only using the total energy release (see Fig. 14). A combination of such information with that offered by a curvature measurement, such as a resolution term of the form \(\updelta \hbox {P} = 0.2 \hbox {P}\) (with P in TeV) which can typically be enabled by tracking in a \(\hbox {B} = 2\,\hbox {T}\) magnetic field, may keep the overall relative resolution of multi-TeV muons below 30% across the spectrum, and achieve values below 20% at 4 TeV (see Fig. 12).