1 Introduction

Cosmic rays are fully-ionised nuclei with relativistic kinetic energies that arrive at Earth. The elements range from proton to iron, with a negligible fraction of heavier nuclei. Cosmic rays originate from unknown sources outside of our solar system and are messengers of the high-energy universe. The cosmic-ray energy flux weighted with \(E^{2.6}\) to compress the enormous scale is shown in Fig. 1. It spans over 11 orders in energy and over more than 30 orders in flux intensity, which can only be covered by multiple experiments using different measurement techniques. The particle with the highest energy ever reported was a cosmic ray (Bird et al. 1995a) with \((320\pm 90)~\text{EeV} = (3.2 \pm 0.9) \cdot 10^{11}~\text{GeV} \) (\(1~\text{EeV} = 10^{9}~\text{GeV} \)). Cosmic rays with energies below \(1~\text{PeV} = 10^{6}~\text{GeV} \) are commonly assumed to be produced by shock acceleration in super-nova remnants (Blasi 2011; Caprioli 2012). The origins of cosmic rays with higher energies are unclear and many mechanisms have been proposed, see e.g. Anchordoqui (2019) for a recent review. Ultra-high energy cosmic rays with energies exceeding \(\text{EeV}\) are of unknown extra-galactic origin.

Fig. 1
figure 1

Selected measurements of the cosmic ray flux as a function of kinetic energy, see Dembinski et al. (2018), Schröder et al. (2019) for the full data set and references. The steeply falling fluxes are scaled by \(E_{\text{kin}}^{2.6}\). Open symbols show the all-particle cosmic-ray flux measured by air shower experiments. Coloured solid symbols show the fluxes of individual elements measured by balloon- and satellite-borne experiments. The line is an empirical fit to these data. The upper axis shows the equivalent centre-of-mass energy of a nucleon-nucleon collision. Determining the elemental composition of the cosmic-ray flux above \(10^{6}~\text{GeV}\) motivates this project, which requires precise measurements of forward hadron production at the LHC

Classic astronomy with cosmic rays has not yet been established. Although cosmic rays are likely produced in well-identifiable point-sources, these sources do not appear point-like in the sky. The incoming flux of cosmic rays is very isotropic (Aab et al. 2014d). The reason is that cosmic rays are charged and scattered by inhomogeneous galactic and extra-galactic fields on their way to Earth. Their movement through space resembles a diffusive flow and their arrival directions at Earth are largely random. The average angle of deflection decreases with energy, however, and evidence of anisotropies has been found above the EeV scale (Aab et al. 2017b,a, 2018; Abbasi et al. 2014).

Up to particle energies of about \(100~\text{TeV}\), cosmic rays are observed directly by space-based experiments, like AMS-02 (Kounine 2012), and high-altitude balloons, like CREAM (Yoon et al. 2011). At higher energies the flux is too low for direct observation and ground-based experiments with huge apertures (up to \(3000~\text{km}^{2}\)) like the Pierre Auger Observatory (Aab et al. 2015b) and Telescope Array (Abu-Zayyad et al. 2013; Tokuno et al. 2012) are used. Ground-based experiments observe cosmic rays indirectly through the particle showers (extensive air showers) produced in Earth’s atmosphere. How air showers arise from cosmic rays and how observable air shower features are linked to the properties of the cosmic ray, its direction, energy \(E\), and nuclear mass \(A\) is described in Sects. 2.1 and 2.2.

In regard to determining the dominant sources of cosmic rays, an important complementary approach to anisotropy studies is to measure the energy-dependent elemental (or mass) composition of cosmic rays. The fluxes of individual elements can be directly measured with suitable satellite- and balloon-borne experiments, but this is not equally possible with indirect air shower observations. The mass has to be inferred from air shower features in the latter case, which change depending on the mass and are subject to stochastic randomisations because of intrinsic fluctuations in the shower. These fluctuations overwhelm the small average shower differences between neighbouring elements. The composition above the PeV scale is therefore often summarised by a single number, the mean-logarithmic mass \(\langle \ln A \rangle \). In Fig. 2 left-hand side, predictions for \(\langle \ln A \rangle \) are shown for several proposed source classes (lines) (Kampert and Unger 2012, and references therein). Precise measurements of \(\langle \ln A \rangle \) can rule out many of these competing theories. In particular, whether the cosmic rays with the highest energies are light or heavy is of crucial importance for the design of the next generation of cosmic ray and cosmic neutrino observatories, see e.g. Aloisio et al. (2011), Alves Batista et al. (2019).

Fig. 2
figure 2

Left: Predictions of the mean-logarithmic mass \(\langle \ln A \rangle \) of cosmic rays as a function of their energy from several theories. Right: Two bands that cover the ranges of measurements, grouped by the mass-sensitive variable used (\(X_{\text{max}}\) or \(N_{\mu }\), explained in the text). The vertical line indicates the equivalent energy of p-p interaction at \(13~\text{TeV}\) at the LHC. The width of the data bands is dominantly caused by theoretical uncertainties of forward hadron production. These uncertainties prevent the exclusion of theories on the origins of cosmic rays. Data and model lines were taken from Kampert and Unger (2012)

Two main features of an air shower are used to estimate the mass, its depth of shower maximum \(X_{\text{max}}\), and the number of muons \(N_{\mu }\) produced in the shower. The two bands in Fig. 2 right-hand side represent an envelope of the measurements carried out by various air shower experiments (Kampert and Unger 2012). The composition estimates derived from measurements of \(N_{\mu }\) have particularly large systematic uncertainties and hardly constrain the models. This is very unsatisfactory, since \(N_{\mu}\) discriminates better between light and heavy primaries shower-by-shower at the EeV scale (Müller et al. 2018) than \(X_{\text{max}}\), and it is useful to collect large statistics especially above \(10^{19.5}~\text{eV}\), where observations of \(X_{\text{max}}\) with fluorescence telescopes run out of statistics as these telescopes can only be operated in dark nights and therefore have a duty cycle of about 15%, while muons can be observed with a duty cycle of 100%.

A closer inspection reveals that most of the uncertainty is not experimental. Hybrid experiments in particular are able to make precise air shower measurements. An overview on these experiments is given in Sect. 2.5. The experimental uncertainty on the measured value of \(N_{\mu }\) is around 10%, which is a factor 2.5 to 4 (depending on the energy) smaller than the width of the band shown in Fig. 2. Instead, most of the uncertainty is of theoretical nature and originates from the air shower simulations that are used to infer \(\langle \ln A \rangle \) from \(X_{\text{max}}\) and \(N_{\mu }\). These simulations are essential for the interpretation of air shower measurements, since there is no astrophysically identifiable source in the sky with a known mass composition that could be used for calibration.

The uncertainty does not originate from the particle transport in the atmosphere itself, which is comparably well understood. A variety of independently developed air shower simulation programs is in use and have been compared against each other. An overview is given in Sect. 2.3. These codes show only small variations at the level of 5% in regard to \(N_{\mu }\). The uncertainty originates from the evolution of the hadronic cascade that drives the air shower evolution and the muon production at the end of that cascade. The cascade is dominated by hadronic collisions with small momentum transfer, which cannot be calculated with perturbative quantum chromodynamics (pQCD). Effective theories and phenomenology are used to predict the rates of these interactions and the spectra of secondary particles produced in them. These take the form of software codes which are called hadronic interaction models or generators. A variety of generators is in use which follow different ideas and which vary in their predictions. An overview is given in Sect. 2.4.

LHC data were used to improve the latest versions of these generators, which would reduce the width of the bands in Fig. 2 if all measurements based on older versions were updated accordingly. However, a detailed comparison of measurements with predictions from the latest generators then revealed that the generators consistently predict a lower muon production in air showers than is observed. This was established with a nearly model-independent measurement for the first time by the Pierre Auger Observatory (Aab et al. 2015a, 2016b). The measurement was recently updated and now includes also a measurement of the intrinsic shower-to-shower fluctuations of \(N_{\mu }\) (Aab et al. 2021), which has been measured for the first time in any air shower experiment. While there are also theoretical uncertainties in the simulation of \(X_{\text{max}}\) that need to be further reduced, there are no such blatant discrepancies between the simulated and observed values of \(X_{\text{max}}\). No large discrepancies are observed between the predicted and measured \(N_{\mu }\)-fluctuations, which disfavours some exotic explanations of the muon deficit.

The Auger publication from 2015 triggered several follow-up measurements and the re-analysis of existing air shower data. In particular, the IceCube Neutrino Observatory was used to perform a nearly model-independent study of \(N_{\mu }\) at lower energies (Dembinski 2017). The wealth of newly published data over a wide range in shower energies made a review necessary and the Working group on Hadronic Interactions and Shower Physics (WHISP) was founded by members of eight experimental collaborations (Dembinski et al. 2019). To facilitate the comparison of very diverse muon measurements, the group defined an abstract \(z\)-scale. The \(z\)-values are proportional to the logarithm of \(N_{\mu }\) and can be computed for each pair of experiment and generator. In a comprehensive meta-analysis, simulations with six hadronic interaction models were tested against the combined data (three from the previous generation and three state-of-the-art). The results are shown in Fig. 3. The data points exceed the expectation for showers above \(10~\text{PeV}\). More precisely, a positive linear slope with \(8~\sigma\) significance was found. This significance is higher than that of any individual measurement, where the significance for a muon deficit does not exceed \(3~\sigma\). More details on the WHISP meta-analysis are given in Sect. 2.6.

Fig. 3
figure 3

Compilation of muon measurements converted to the abstract \(z\)-scale and after cross-calibrating the energy scales of the experiments as described in the text (image from Soldin 2021). Shown for comparison are predicted \(z_{\text{mass}}\)-values based on air shower simulations and \(X_{\text{max}}\)-measurements (grey band). The prediction from the GSF model (Dembinski et al. 2018) for \(z_{\text{mass}}\) is also shown (dashed line)

This muon discrepancy is called the Muon Puzzle, because the authors of the generators have been unable to resolve the discrepancy by parameter tuning. The required changes to existing models would violate either data constraints from accelerators or the consistency between air shower simulations and the other air shower features. This suggests that a physical effect is missing in the generators that governs soft hadronic interactions at high energies. This point is explored in Sects. 2.8 and 3. Also remarkable is the early onset of the muon discrepancy, which starts smoothly above the knee and increases with the logarithm of the shower energy. The onset of the discrepancy is seen in showers where the first interaction of the cosmic ray has a centre-of-mass (cms) energy of about \(8~\text{TeV}\) in the nucleon-nucleon system. This suggests that the source of the Muon Puzzle can be studied at the LHC.

Given the wealth of existing LHC data described in Sect. 4.2, this poses the question why the source of the discrepancy was not yet observed, which is the other aspect that turns the muon discrepancy into a puzzle. One likely answer is that we have not yet looked in the right place. Since the study of soft hadronic interactions was not driving their design, most LHC experiments focus their instrumentation on the mid-rapidity region where new heavy particles such as the Higgs are best observed, while the air shower development is strongly dominated by particles produced in the forward region. The relevant phase-space starts at pseudo-rapidityFootnote 1\(\eta \gtrsim 2\). LHCb is the notable exception in that it is fully instrumented in the pseudo-rapidity range \(2 < \eta < 5\). Important input is further provided by the CMS experiment with its CASTOR forward calorimeter covering \(-6.6 < \eta < -5.2\) and from the TOTEM and LHCf experiments. An overview of the acceptances of LHC experiments is given in Sect. 4.1.

It is well-known that the number of muons \(N_{\mu }\) produced at the end of the hadronic cascade is very sensitive to the energy fraction carried away by photons, which are primarily produced in \(\pi ^{0}\) decays, as explained in Sects. 2.2 and 2.8. A reduction of the \(\pi ^{0}\)-fraction has a compounding effect in the shower cascade, so that only a comparably small change is required to obtain a notable change in the muon number. Why the \(\pi ^{0}\)-fraction cannot be easily changed without introducing new physics is explained in Sect. 3, together with extensions of the standard soft QCD picture in high-multiplicity interactions which achieve just that. Such extensions seem inevitable based on basic QCD arguments and are required to match recent LHC data.

We may have already seen a glimpse of the solution to the puzzle. The ALICE experiment has observed a universal enhancement of strangeness production (Adam et al. 2017b; Vasileiou 2020) in high-multiplicity events at mid-rapidity, an effect that was previously only observed in heavy-ion collisions (Koch et al. 1986). An increase in strangeness leads to a corresponding relative decrease of the pion yield, including the \(\pi ^{0}\) yield. Preliminary studies suggest that this effect could potentially solve the Muon Puzzle (Baur et al. 2019; Anchordoqui et al. 2020), if it is present also in the forward region that drives the air shower development.

One of the most pressing questions is therefore whether there is also a universal enhancement of strange particle production in the forward region \(\eta > 2\). The production cross-sections for strange hadrons need to be measured as a function of charged particle multiplicity in several collisions systems (p-p, p-Pb, and Pb-Pb). The LHCb experiment is in a unique position to perform such measurements with its high-precision particle tracking with particle identification in the forward region. Such studies can be done by analysing already recorded LHC data. More generally speaking, precise measurements of the energy fraction carried by forward photons are needed. The CASTOR component of CMS and the LHCf experiment are ideal for such measurements.

Similarly important would be the study of soft hadronic interactions in future p-O collisions at the LHC, which directly mimic interactions of the cosmic ray with air, which have been proposed in Citron et al. (2019) for the upcoming Run 3 of the LHC. It is clear that soft hadron production at the TeV scale is full of unexpected phenomena and the scaling of hadron production from a p-p system to a p-O system is not well understood. Current generators show a large variation in the charged particle multiplicity of \(\pm 25\%\) in the p-O system despite good agreement in the p-p system, which could be reduced five-fold to 5% or better with LHC data. While it may be possible to interpolate to p-O from reference measurements in p-p and p-Pb, this has to be confirmed with an explicit high-precision measurement. The case for p-O collisions at the LHC is discussed in Sect. 4.4. Finally, there are also opportunities to further improve our knowledge on hadronic interactions below the TeV scale with fixed-target experiments that are described in Sect. 4.3.

A solution for the Muon Puzzle from the LHC would have a large impact on the field of astroparticle physics. The sizes of the \(N_{\mu }\)-bands in Fig. 2 could shrink by factors 2.5 to 4 (depending on the energy). The ambiguity of the cosmic-ray mass composition at the EeV scale would be resolved and the LHC measurements on hadron production would also further decrease the uncertainty of \(X_{\text{max}}\)-predictions. The changes in air shower predictions would have an immediate impact on the interpretation of existing air shower data and trigger the re-evaluation of existing data. The improvements to the generators and the increased accuracy of the estimated cosmic-ray mass composition would in turn make predictions of atmospheric lepton fluxes more precise, the main background for neutrino observatories like IceCube. Air shower measurements connected to the Muon Puzzle are summarised in Sect. 2.7.

2 Cosmic-ray detection via air showers: Muon Puzzle and connected challenges

Cosmic rays with EeV energy produce extensive air showers (EAS) in Earth’s atmosphere with lateral extensions that reach tens of kilometres. The angular spread of the produced particles with respect to the direction of the cosmic ray is initially very small due to relativistic boost, but increases as the average energy per particle drops and eventually reaches a few degrees. These particles define the shower axis. The axis is characterised by two angles, the zenith angle \(\theta \) and the azimuthal angle \(\phi \). The zenith angle measures the inclination of the axis, \(0^{\circ }\) corresponds to a vertical down-going shower. At any given point in time, the shower particles are concentrated in a thin disc with a thickness of a few tens of meters perpendicular to the shower axis that moves with the speed of light. The fastest particles (dominantly muons) form the shower front. The position where the shower axis intersects the ground is called the shower core.

Air showers above PeV energies are highly regular and can be characterised by a few parameters, a feature sometimes referred to as shower universality (Patterson and Hillas 1983; Giller et al. 2005; Lipari 2009; Yushkov et al. 2010). Statistical fluctuations in the first few interactions have an important effect on the development of the rest of the shower, but the later shower stages show no substructure since fluctuations average out due to the large number of particles interactions. This makes it feasible to build cosmic-ray observatories with huge apertures that cover thousands of square-kilometres by building sparse arrays of particle detectors at the ground. In principle, a shower can be fully characterised with measurements at only three spatially separated points. The angles \(\theta \) and \(\phi \) of the shower axis can be accurately triangulated from the arrival times, while the measured local densities determine the shower core and the amplitude of its lateral profile of particle density.

After the first interaction, the shower develops two partially coupled parallel cascades: the hadronic and the electromagnetic cascade. The hadronic cascade is formed by interactions of long-lived hadrons (mostly pions, but also nucleons and strange hadrons), while the electromagnetic cascade is fed by photons which are mostly generated by decays of neutral pions. The electromagnetic cascade develops independently via electron pair-production and bremsstrahlung. There is only a small feedback from the electromagnetic into the hadronic cascade from rare photo-nuclear interactions.

The air shower produces long-lived hadrons, electrons, photons, muons, and neutrinos. The average energy of particles in the shower continuously decreases, since every interaction spreads the energy of the parent onto several children. Most of the muons are produced at the end of the hadronic cascade when the Lorentz factor gets so small that decay becomes more probable than another interaction. A small fraction of muons is produced electromagnetically by direct pair-production. Neutrinos are completely decoupled once produced. A small fraction of muons decays to feed the electromagnetic component and neutrino component. Muons trace the development of the hadronic cascade.

Charged particles produced by the shower ionise air and thereby loose energy. The ionised air molecules eventually fall back to their ground state. Some of that released energy is converted into isotropically emitted fluorescence light; the process has been measured precisely in the laboratory (Ave et al. 2013). This light is used in dark nights to observe air showers at several kilometres distance with sensitive telescopes. The advantage of fluorescence light is that the shower can be seen from the side, which allows a single telescope to observe a large volume of air. Air showers further generate Cherenkov light flashes in a cone around the shower axis. The aperture achieved with Cherenkov light is smaller, so this technique is mostly used for the detection of sub-EeV air showers.

The air shower detection with particle detectors at the ground and with telescopes detecting fluorescence and Cherenkov light has become the established standard in the field. To measure the muon number, particle detectors are required and the telescopes provide important additional information. It was the combination of telescopes with particle detectors in a large scale experiment, the Pierre Auger Observatory, that allowed the field to unambiguously establish the muon production as the source of several discrepancies that had been observed in the last twenty years. A third more recently developed technique is the detection of air showers via radiowave emissions (Falcke et al. 2005; Buitink et al. 2014; Kostunin et al. 2016), which could become an alternative to the detection via fluorescence and Cherenkov light with a 100% duty cycle.

In the following subsections, we give an overview on how air showers are modelled and detected, and highlight key results in regard to the Muon Puzzle.

2.1 Calculating air showers with cascade equations

The transport, production and decay of particles in the atmosphere can be characterised by coupled differential equations. A generic introduction into this subject is given by Gaisser et al. (2016). These equations are conveniently expressed as a function of slant depth \(X\). A slant depth interval is the product of the geometric length interval \(\text{d}s\) travelled and the air density, \(\text{d}X = \rho _{\text{air}} \, \text{d}s\). Let \(n\) be the density of particles of type \(k\) in an infinitesimal energy interval \(\text{d}E\), then the change of that density over a travel distance over a slant depth interval is given by

$$\begin{aligned} \frac{\text{d}n}{\text{d}X} &= -\left ( \frac{1}{\lambda _{\text{int},k}(E)} + \frac{1}{\lambda _{\text{dec},k}(E, X)} \right ) n(E,X) \\ &\quad{}-\frac{\text{d}}{\text{d}E}\big(\mu _{k}(E) \, n(E,X)\big) \\ &\quad{}+\sum _{\ell }{\int _{E}^{\infty }\text{d}E_{\ell }~ \frac{c_{\ell \to k}(E_{\ell },E)}{\lambda _{\text{int}, \ell }(E_{\ell })}} n_{\ell }(E_{\ell },X) \\ &\quad{}+\sum _{\ell }{\int _{E}^{\infty }\text{d}E_{\ell }~ \frac{d_{\ell \to k}(E_{\ell },E)}{\lambda _{\text{dec}, \ell }(E_{\ell }, X)}} n_{\ell }(E_{\ell },X), \end{aligned}$$
(1)

with \(n \equiv n_{k}\), \(E \equiv E_{k}\). This equation describes the longitudinal shower development. The lateral development is integrated out, but can in principle be included as well. The first two terms describe losses; particle loss from interaction and decay with the lengths \(\lambda _{\text{int},k}\) and \(\lambda _{\text{dec},k}\), respectively, and energy loss from ionisation where the specifics are captured by the function \(\mu _{k}(E)\). The last two terms describe gains from interactions and decays of other particle species \(\ell \), which depends on the interaction and decay lengths and the transfer probabilities \(c_{\ell \to k}\) and \(d_{\ell \to k}\). The integrals go formally to infinity, but are cut off by the upper energy limit in \(n_{\ell }(E_{\ell }, X)\).

Cascade equations are primarily used to compute atmospheric lepton fluxes, where the initial densities \(n_{k}(E, 0)\) are given by the energy spectra of cosmic nuclei, but they are also used to calculate particle densities in an average air shower. The air shower solution is obtained with the initial condition \(n_{k}(E, 0) = \delta (E-E_{0})\). The solution for a \(10~\text{EeV}\) proton calculated with the program MCEq (Fedynitch et al. 2015) is shown in Fig. 4. The particle densities initially increase exponentially as the energy of the initial cosmic ray is distributed to more and more secondary particles. The cascading process stops when most particles drop below a critical energy, where decay (in case of hadrons) or absorption (in case of electrons) becomes dominant over re-interaction. At this point the exponential growth is replaced by exponential loss. The loss rate for electrons is much higher than for muons due to the much smaller mass and correspondingly larger pair-production cross section. Muons start off with energies of tens of GeV and are minimum ionising with an energy loss of only a few \(\text{MeV}/(\text{g}\,\text{cm}^{-2})\).

Fig. 4
figure 4

Average particle count (top) and mean energy (bottom) in an air shower initiated by a 10 EeV proton at a zenith angle \(\theta = 45^{\circ }\). In the legend, em corresponds to electrons and photons. The maximum of this component is the observable shower maximum with slant depth \(X_{\text{max}}\)

The interaction length \(\lambda _{\text{int},k}\) is the ratio of the average mass of an air nucleus \(m_{\text{air}}\) and the inelastic cross-section, \(\lambda _{\text{int},k} = m_{\text{air}} / \sigma _{\text{inel},k}(E)\). The decay length \(\lambda _{\text{dec},k}\) depends on the gamma factor \(\gamma = E/m\) of the particle, the decay length in its rest frame \(c \tau \) and the local air density, \(\lambda _{\text{dec}} = E/m \, c \, \tau \, \rho_{\text{air}}(X)\). These losses compete; we have \(\lambda _{\text{int},k} \ll \lambda _{\text{dec},k}\) at high energy, but the situation reverses eventually at low energies.

Hadrons, electrons, and photons are produced dominantly by interactions,

$$ c_{\ell \to k}(E_{\ell },E_{k}) = \frac{\text{d}N_{\ell + air \to k + X}}{\text{d}E} (E_{\ell }, E_{k}). $$
(2)

In hadronic processes, the couplings \(c_{\ell \to h}\) correspond to inclusive differential production cross-sections or decay energy distributions (Fedynitch et al. 2018b). Theoretical uncertainties in regard to hadron production enter here. For electromagnetic processes, the couplings correspond to pair production, annihilation, bremsstrahlung, Moeller, Bhabha and Compton scattering (Bergmann et al. 2007), which are well understood and can be calculated from first principles. Muons and neutrinos are produced dominantly by pion decays. The transfer is given by the decay energy distributions,

$$ d_{\ell \to k}(E_{\ell },E_{k}) = \frac{\text{d}N_{\ell \to k + X}}{\text{d}E} (E_{\ell }, E_{k}), $$
(3)

which are also well known for the decays of long-lived particles. Short-lived particles are not explicitly tracked, they implicitly contribute to these transfers or they are neglected if their production cross-sections are small.

Cascade equations are ideal to study the influence of changes in hadron production on average air shower variables. The equations offer exact solutions if all relevant processes are implemented, and solving them takes only a few seconds with modern programs. Since the mapping between inputs and outputs is analytical, one can also perform error propagation of inputs to outputs. In order to reproduce shower-to-shower fluctuations of quantities like \(X_{\text{max}}\) and \(N_{\mu }\), another approach is required where the cascade equations are not solved numerically, but with Monte-Carlo methods. Some codes use hybrid calculations, where the initial stages of the shower are solved with the Monte-Carlo method, then cascade equations are used when the particle count is large, and finally Monte-Carlo methods are used again in the final stages. An overview of the available codes is given in Sect. 2.3. In the next subsection, we discuss a useful toy model of an air shower, which has simple analytical solutions.

2.2 Heitler model of air shower development

Air showers can be computed accurately using cascade equations, which are solved either numerically or by Monte Carlo methods. To gain basic insights, simplified air shower models of the Heitler type have been very successful. They are not viable alternatives to solving cascade equations, but reveal essential relationships. In some cases, the derived formulas produce accurate predictions if the coefficients are taken from full air shower simulations. We briefly outline how the basic Heitler model is constructed and list key results that are relevant for the Muon Puzzle. For more details, we refer to the accessible treatments by Matthews (2005) and Ulrich et al. (2011).

Heitler (1984) introduced a simplified model of an electromagnetic cascade consisting only of photons and electrons. Starting from a primary particle of energy \(E\), the particle count doubles after travelling a fixed amount of slant depth. The energy of the parent is distributed equally among the children. The doubling stops when the energy per particle falls below a critical energy, which is reached when absorption becomes equally likely to another splitting. At this point, all particles are absorbed in the medium.

The concept was generalised to a cosmic-ray induced air shower by Matthews (2005). The shower is approximated by a pure pion shower, as sketched in Fig. 5. Charged and neutral pions are produced. The total number of pions produced is \(N_{\text{mult}}\). Neutral pions decay immediately into photon pairs, which are further treated with the aforementioned electromagnetic cascade. We call the energy fraction \(\alpha \) that is retained in charged pions. In the basic Heitler-Matthews model, \(\alpha \) is exactly \(2/3\), but it is instructive to keep \(\alpha \) as a parameter in the equations that can take on other values. In a real shower also other long-lived hadrons are produced, which increases \(\alpha \). Charged pions travel though a fixed amount of slant depth and then produce a fixed number \(N_{\text{mult}}\) of new pions. The energy of the parent is distributed equally among the children. The hadronic cascade stops when the energy per pion reaches the critical energy \(\xi _{h}\), which is reached when the decay length of a charged pion becomes equal to the interaction length. At this point, all charged pions decay into muons. Cosmic-ray nuclei can be included with the superposition model, which treats a shower initiated by a nucleus with \(A\) nucleons as \(A\) independent showers with energy \(E/A\).

Fig. 5
figure 5

First three steps of a hadronic cascade in the simplified Heitler-Matthews model (image from Ulrich et al. 2011). Solid (dashed) lines represent charged (neutral) pions. Only one of the splittings is shown after the second step

The Heitler-Matthews model connects basic air shower quantities, the primary energy \(E\) and mass \(A\) of the cosmic ray, the muon number \(N_{\mu }\), and the depth \(X_{\text{max}}\) of the maximum of the electromagnetic cascade, with features of hadronic interactions, the inelastic cross-section \(\sigma _{\text{inel}}\), the hadron multiplicity \(N_{\text{mult}}\), and the energy fraction \(\alpha \) retained in long-lived hadrons.

  • The number of steps \(k\) in the hadronic cascade increases logarithmically with the energy \(E\) as

    $$ k = \frac{\ln (E / \xi _{h})}{\ln N_{\text{mult}}} $$
    (4)

    which gives with \(\xi _{h} \approx 10~\text{GeV} \) and \(N_{\text{mult}}\approx 50\) values between 3 at \(1~\text{PeV}\) and 5 at \(1~\text{EeV}\).

  • The muon number scales sub-linearly with the cosmic-ray energy,

    $$ N_{\mu }(E, A) = A^{(1-\beta )} \left (\frac{E}{\xi _{h}} \right )^{\beta }\text{ with } \beta = \frac{\ln (\alpha \, N_{\text{mult}})}{\ln N_{\text{mult}}}. $$
    (5)

    The fact that \(\beta \approx 0.9\) is close to but less than 1 is the consequence of the energy transfer from the hadronic to the electromagnetic cascade without an equivalent feedback.

  • There is a linear relationship between the mean-logarithmic mass \(\langle \ln A \rangle \) and the mean-logarithmic muon number

    $$ \langle \ln N_{\mu }\rangle (E, A) = \langle \ln N_{\mu }\rangle (E, 1) + (1-\beta ) \, \langle \ln A \rangle , $$
    (6)

    where \(\langle \ln N_{\mu }\rangle (E, 1)\) is the mean-logarithmic muon number for proton showers. This follows from Eq. (5), see Dembinski (2018) for a more detailed discussion. In practice, \(\beta \) is taken from air shower simulations. Iron showers have about 40% more muons than proton showers at the EeV scale.

  • Likewise, there is a linear relationship between the shower depth \(X_{\text{max}}\) and \(\langle \ln A \rangle \)

    $$ \langle X_{\text{max}} \rangle (E, A) = \langle X_{\text{max}} \rangle (E, 1) - D_{p} \, \langle \ln A \rangle , $$
    (7)

    where \(D_{p} = \text{d}\langle X_{\text{max}} \rangle (E,1) / \text{d}\ln E\) is the so-called elongation rate for proton showers, which taken from air shower simulations in practice. Proton showers develop deeper by about \(100~\text{g}\,\text{cm}^{-2}\) on average than iron showers at the EeV scale. A more detailed discussion is given in Kampert and Unger (2012), Abreu et al. (2013).

Further conclusions can be drawn about the dependence of muon production in air showers and microscopic features of hadronic interactions which are listed below without derivation. They are confirmed overall by detailed simulations as described in Sect. 2.8, with minor modifications.

  • Both \(N_{\mu }\) and \(X_{\text{max}}\) depend weakly on the hadron multiplicity \(N_{\text{mult}}\).

  • The muon number \(N_{\mu }\) is independent of the inelastic cross-section \(\sigma _{\text{inel}}\) for pion interactions, while \(X_{\text{max}}\) is very sensitive to it.

  • The muon number is very sensitive to \(\alpha \), while \(X_{\text{max}}\) is (nearly) independent of \(\alpha \). The relative increase in \(N_{\mu }\) from Eq. (5) for a small change \(\Delta \alpha \) is to first order

    $$ \frac{\Delta N_{\mu }}{N_{\mu }} \approx \frac{\ln (E / \xi _{h})}{\ln N_{\text{mult}}} \, \frac{\Delta \alpha }{\alpha } = k \, \frac{\Delta \alpha }{\alpha } , $$
    (8)

    where \(k\) is the number of steps of the hadronic cascade. For an \(\text{EeV}\) air shower, \(k \approx 5\), which implies that a \(10\%\) change in \(\alpha \) introduces a \(50\%\) change in the muon number. This implies that we need to measure \(\alpha \) very precisely over a wide energy range and understand its extrapolation toward higher energies.

The Heitler-Matthews model is useful to build an intuition about air showers, but it is important to keep the approximations and simplifications in mind to not overinterpret the results. We summarise them here.

  • All secondaries receive the same energy fraction. In reality, the energy depends strongly on the pseudorapidity of the particles. Particles produced at forward pseudorapidity in the cms-system of a hadron-air collision carry the largest energies in the lab frame, and therefore quantities like \(N_{\text{mult}}\) and \(\alpha \) should be understood as averages of the subset of forward produced particles.

  • Hadronic interactions produce other long-lived particles in addition to pions. Also important are kaons, protons, and neutrons. The relative fractions of these other hadrons could be the key for solving the Muon Puzzle, since effects which enhance strangeness and baryon production keep more energy in the hadronic cascade and increase \(\alpha \).

  • The hadronic interaction length and the hadron multiplicity \(N_{\text{mult}}\) are not constant but weakly energy dependent. The impact of this was further studied by Montanus (2014).

  • The atmosphere does not have constant density which has an impact on the critical energy \(\xi _{h}\), which depends on the zenith angle of the shower. Vertical air showers develop in denser atmosphere and have a lower critical energy than inclined air showers. This effect is best described by full simulations with a realistic atmosphere, but it can be ignored if only showers with a fixed zenith angle are considered. For an isothermal atmosphere, \(\xi _{h}\) can be calculated analytically, see Kampert and Unger (2012).

  • Since each random process is replaced by its average process, the model describes an average air shower. Extensions of the basic model are needed to describe intrinsic shower fluctuations.

Several authors have refined the Heitler-Matthews model to make its predictions more accurate or performed additional calculations based on the model. Grimm et al. (2018) has investigated the leading particle effect that Matthews (2005) discusses only briefly, which is important in real air showers. Montanus (2014) has studied the effect of the weak energy dependence of certain hadronic interaction features that are set constant in the original treatment. Kampert and Unger (2012) present formulas for the first two moments of the cosmic-ray mass composition from measurements of depth \(X_{\text{max}}\) of the shower maximum, and discuss the energy-dependence of \(N_{\text{mult}}\) and the hadronic interaction length. Dembinski (2018) presents a discussion of the first two moments of the muon number \(N_{\mu }\), and the impact of additional fluctuations in the measurement on these estimates.

2.3 Air shower simulation codes

Accurate predictions of observable air shower features requires solving the cascade equations from Sect. 2.1 numerically or via Monte-Carlo methods. The full complexity with particle collisions, decays, hadronic and electromagnetic physics, elastic scattering, continuous energy losses as well as deflection in magnetic fields can only be handled by computer programs. Monte-Carlo methods are required when the full four-dimensional structure of the shower needs to be known and when shower-to-shower fluctuations are of interest. The field profits from a set of well-maintained and widely used air shower simulation programs that offered robust predictions over long periods of time, providing the foundation to compare data from past and present experiments to model predictions, as discussed in Sect. 2.6.

Special event generators are developed to describe hadron interactions up to the highest energies, but these models loose validity at low (\(\text{GeV}\) level) energies. Simulation codes therefore switch from a high-energy to a low-energy model when the lab energies fall below about \(100~\text{GeV}\). Available choices for the low-energy model are UrQMD (Lang et al. 2016), Fluka (Böhlen et al. 2014), and the now disfavoured Gheisha (Fesefeldt 1985). These generators are mostly limited to lab energies below \(\text{TeV}\) or less.

Both low-energy and high-energy generators have an impact on the resulting air shower cascades. Their impact approximately factorises in high-energy showers, but this is not true for showers below \(\text{TeV}\) energy, where a strong interplay between the high-energy and low-energy model is observed (Parsons and Schoorlemmer 2019; Pastor-Gutiérrez et al. 2021).

The low-energy models have a large impact on the lateral density profile of muons at the ground level (Swordy et al. 2002; Drescher and Farrar 2003b; Drescher et al. 2004), but minor impact on the total muon number. A discrepancy in the low-energy models cannot explain the observed increase in the muon discrepancy with the shower energy that is described in Sect. 2.6. The number of steps in the hadronic cascade calculated with the high-energy model increases with the shower energy, but remains constant for the low energy model. The low-energy model can therefore only contribute a constant to the discrepancy. More details about this point can be found in Sect. 2.2 and Sect. 2.8. Accordingly, this review focusses on the high-energy models, which are discussed in Sect. 2.4, where also the references are given. However, there are still opportunities to further improve the low-energy models with experimental data, which are discussed in Sect. 4.3.

There are only minor variations in the outputs of air shower codes, if the same hadronic interaction models are used. The outputs of several programs have been compared (Knapp et al. 2003; Roh et al. 2013; Bergmann et al. 2007; Ortiz et al. 2005) and generally agree within 5% in the predicted muon number \(N_{\mu }\). It is therefore unlikely that the Muon Puzzle originates entirely from an inaccuracy in the air shower codes, but further reductions of these uncertainties are desirable. Apart from the Muon Puzzle several other deviations between muon measurements and simulations have been observed, which are discussed in Sect. 2.7.

There is an ongoing effort to reduce the uncertainties in the lepton transport codes, especially in regard to simulation for underground detectors. Leptons which propagate through kilometres of dense matter undergo hundreds of interactions so that even small errors have a large compounding effect (Chirkin and Rhode 2004). Proposal is a recent lepton propagation code with the goal to compute atmospheric lepton fluxes to 1% accuracy (Koehne et al. 2013; Dunsch et al. 2019). The calculation includes radiative corrections (Sandrock et al. 2018) to the average energy loss, higher-order corrections to bremsstrahlung and pair-production cross-sections in \(Z\alpha \), and corrections due to the finite size of the nucleus (Sandrock and Rhode 2018). These predictions can be tested to the percent level with large samples of neutrino-induced muon tracks collected by large underground detectors (Soedingrekso et al. 2020). Proposal can be used standalone and as the lepton propagator in the new air shower program Corsika 8 (Engel et al. 2019). EmCa (Meighen-Berger et al. 2019) is another new code for the high-accuracy simulation of electromagnetic cascades. The EmCa cross-sections are also used in MCEq (Fedynitch et al. 2015).

An overview of widely used air shower programs is given in Table 1 and they are briefly introduced below.

  • Aires (Sciutto 1999) is the successor of Mocca (Hillas and Lapikens 1977; Hillas 1997). It was extended in particular to interface with different hadronic interaction models. Aires is an air shower program with a feature set similar to Corsika. It supports a wide range of high-energy hadronic interaction models, but does not allow the user to change the low-energy model. Low-energy hadronic interactions are implemented with the Hillas-Splitting-Algorithm (Hillas 1981) tuned to reproduce Gheisha. The electromagnetic cascades include the Landau-Pomeranchuk-Midgal (LPM) effect (Landau and Pomeranchuk 1953; Migdal 1956) and are simulated with custom code. Radio emission is generated with the ZHAireS extension (Alvarez-Muniz et al. 2012). Aires and Corsika have been used to simulate air showers for the Pierre Auger Observatory and were compared by Knapp et al. (2003). Predictions for the electromagnetic energy deposit and the produced muon number agree within 2–3%. Aires is faster than Corsika by a factor of 3 to 4, probably in part due to its simplified treatment of low-energy hadronic interactions.

  • Conex offers fast simulation of air showers by solving the cascade equations numerically (Bergmann et al. 2007). The computing time is independent of energy and less than a minute per event. Conex does not simulate the shower in four dimensions, only its longitudinal development along the shower axis. This limits the application since air shower experiments typically need full simulations, but Conex is ideal for studying the effect of hadronic interactions on longitudinal shower features, like the profile of electrons and photons and the depth \(X_{\text{max}}\) of shower maximum. Conex was developed in parallel with Corsika and supports the most relevant subset of interaction models implemented in Corsika. It is further able to perform hybrid simulations where a full Monte-Carlo simulation of the first interactions is performed and the output is used to feed cascade equations. This allows one to simulate shower-to-shower fluctuations.

  • Corsika is an air shower program (Heck et al. 1998) originally developed for the detector design and the physics interpretation of the KASCADE experiment (Antoni et al. 2003). It has since been used by most air shower experiments in the last 30 years to simulate showers from cosmic rays and \(\text{PeV}\) gamma rays, owing its widespread use to a significant amount of continuously invested resources to make it the most complete, well documented, and comprehensive tool for air shower simulations. It was designed from the beginning as a tool open to the community for adaptation and improvement. Corsika supports the widest range of low-energy and high-energy hadronic interaction models. Electromagnetic cascades are simulated with the EGS4 code (Nelson et al. 1994) that was extended with the LPM effect. Particle decays are taken from Pythia 6 (Sjostrand et al. 2006). Cherenkov photon emission is simulated with the Bernlöhr package (Bernlöhr 2009) and radio emissions are generated with CoREAS (Huege et al. 2007).

  • Corsika 8 is a modern reinvention of Corsika rewritten from scratch in C++ to allow for even more modularity and to support parallel HPC hardware and accelerators (Engel et al. 2019). It is currently in development and not yet available for physics production. Building on the experience of the Fortran version, it is a community project from the start, open to all collaborations and individuals to contribute. The goal of the project is to build the most comprehensive distribution of models, tools, and techniques, to fill the simulation needs of current and future astroparticle experiments and to make new sophisticated forms of data analyses possible. Authors can replace almost every aspect of the simulation to try out new ideas while relying on the rest of Corsika 8 to work. Through this feature and others, Corsika 8 will make it feasible to investigate systematic uncertainties beyond the existing boundaries of other codes. Corsika 8 simulates electromagnetic interactions with the lepton propagator Proposal (Koehne et al. 2013; Dunsch et al. 2019). Currently implemented hadronic interaction models are Sibyll2.3d and QGSJetII.04.

  • Cosmos is an air shower program (Kasahara and Cohen 2007) developed for the Akeno/AGASA experiments (Chiba et al. 1992; Takeda et al. 2003). It supports the simulation of air showers up to the highest energies with thinning and the simulation of Cherenkov photon emission. It allows users to choose between the high-energy interaction models QGSJet and DPMJet-III and low-energy interaction models JQMD, JAM (Koi et al. 2003), and Bertini (Heikkinen et al. 2003). Electromagnetic cascades are simulated with a custom code that includes the LPM effect. Cosmos and Corsika were compared by Roh et al. (2013). Some differences were found, but the number of muons was consistent within 5%.

  • MCEq is a program to numerically solve the cascade equations (Fedynitch et al. 2015). It was designed as a fast option to calculate atmospheric lepton fluxes (Fedynitch et al. 2018b) for neutrino observatories and can also be used to simulate an average air shower, which requires only a few seconds on modern hardware in both cases. MCEq is completely written in Python and achieves this speed by using sparse matrices and optimised linear algebra libraries. It supports a wide range of interaction models, Sibyll2.1 to 2.3d, QGSJet01c to QGSJetII.04, EPOS-LHC, DPMJet-III 3.0-6 and DPMJet-III 19.1, and also provides a range of models of the cosmic-ray flux as input to the atmospheric lepton flux calculations. MCEq can use electromagnetic cross sections from the EmCa code (Meighen-Berger et al. 2019). It is an ideal tool to study the systematic uncertainties that interaction models introduce in the simulation of atmospheric lepton fluxes and air showers (Aartsen et al. 2020b). Like Conex, MCEq does not simulate air showers in four dimensions. Fluctuations and structures in individual events, for example, muon bundles, are not present in the solutions to the cascade equations.

  • Seneca was a hybrid air shower simulation code (Drescher and Farrar 2003a). The first stages of the shower were fully simulated, then the output was fed into cascade equations, and finally the last stages of the shower were again fully simulated, by sampling particles from the energy distributions produced by the cascade equations. The goal was to greatly reduce the time needed to simulate EeV air showers. While Seneca is not further maintained, the principle ideas have been implemented also in Corsika since then.

Table 1 Comparison of dedicated air shower simulation codes that use full Monte-Carlo simulation (MC) and/or numerical solutions of cascade equations (CE) to compute hadronic and electromagnetic cascades up to ultra-high energies. Model legend. B:Bertini, D: DPMJET, EGS4: Nelson et al. (1994), E: EPOS, F: FLUKA, G: GHEISHA, HS: Hillas-Splitting, J: JAM, JQ: JQMD, LPM: LPM-effect (see text), Q: QGSJet, S: Sibyll, Tsai: Tsai (1974), U: UrQMD. Details and references for the hadronic models are given in the text and in Sect. 2.4. A star (⋆) indicates that development has been discontinued

2.4 Hadronic interaction models

An air shower is mainly driven by the outcomes of relativistic hadron-ion collisions with nitrogen and oxygen atoms under low momentum transfer in the non-perturbative regime of quantum chromodynamics (QCD). Since hadron production under these conditions cannot be calculated directly from first principles, effective theories and phenomenology are used. The codes that simulate hadron production are called generators or hadronic interaction models. They are the largest source of uncertainties in air shower simulations (Engel et al. 2011; Pierog 2018).

Specialised generators are developed for the simulations of extensive air showers (EAS) to address the needs of the astroparticle community, which differ from their pendants in the high-energy physics (HEP) or the heavy-ion collision (HIC) communities. The need to describe interactions at energies beyond the reach of colliders and to handle a variety of projectiles (nuclei, proton, charged pions and kaons) and targets (nitrogen, oxygen, argon) is specific to EAS generators. The production and decay of heavy-flavour (charm, beauty, top) and heavy bosons is important for HEP generators, but largely omitted in EAS generators. The predictive power of the generators is important (d’Enterria et al. 2011), since they are used to extrapolate to cms-energies that are one order of magnitude above the LHC and toward forward rapidities that are not well covered by LHC experiments.

The generators QGSJet (Ostapchenko 2006a,b, 2010, 2011) and Sibyll (Ahn et al. 2009; Riehn et al. 2017, 2020a,b) are focused on air shower simulation. They have a limited set of parameters (of the order of tens of parameters) and only implement physics which are important for shower development. These generators correspondingly focus on a small data set for tuning. The generators DPMJet (Ranft 1995, 1997, 1999; Ranft et al. 2003; Roesler et al. 2000; Fedynitch 2015; Fedynitch and Engel 2015) and EPOS (Werner et al. 2006; Pierog and Werner 2009; Pierog et al. 2015) have a more general focus on minimum-bias p-p and heavy-ion collisions and can be used for EAS simulations. Their parameter sets are larger (of the order of 100 parameters), but the data sets used to constrain them are also larger. In principle, all minimum-bias measurements from collider or fixed-target experiments can be used for tuning and validation.

There is also an effort to bridge the divide between HEP and EAS generators from the HEP side. Pythia (Sjostrand et al. 2006, 2008) is widely used at electron and proton colliders and recently expanded its focus toward heavy-ion collisions by adding the Argantyr model (Bierlich et al. 2018). There is also an interest to support EAS simulation. A technical obstacle is the Pythia’s initialisation scheme, which was not designed to switch collision energies and colliding particles frequently, but there are plans to mitigate this. Nevertheless, Pythia has been used for EAS simulation and compared to other EAS generators in a specialised study by d’Enterria et al. (2019).

The aforementioned generators are compared in Table 2 to give an overview of the physics implemented in these models. Pythia is included for reference. An explanation of the table content is given in the following.

  • Gribov-Regge field theory. To increase their predictive power in particular for the high energy extrapolation, all models used in shower simulation are based on the Gribov-Regge field theory (GRFT) from Gribov (1968), which links the inelastic cross-section (which can be easily constrained by data) to the particle production using a unique amplitude for the Pomeron exchange. In parton models like Pythia, there is no such fundamental link. A Pomeron is a colour-neutral object that can be exchanged between partons, the collective term for quarks and gluons.

    A flaw of classic GRFT is that energy conservation is taken into account only at the particle production level, but not for the calculation of the cross-section (Drescher et al. 2001). This is solved only in EPOS, where energy sharing is consistently used. This leads to a narrower multiplicity distribution, since events with multiple partonic interactions (MPI) are suppressed, in better agreement with data (Pierog 2018) than the Poissonian distribution of MPI given by classical GRFT and the parton model in Pythia.

  • Nuclear collisions. Generators need to support asymmetric nuclear collisions (projectile masses up to iron are relevant, \(A \le 56\); air targets are nitrogen, oxygen, and argon, \(A \le 40\)). Different approaches are used to predict nuclear collisions from elemental nucleon collisions. The classical Glauber approach (Glauber and Matthiae 1970; Miller et al. 2007) treats it as independent binary-pair collisions. In the semi-superposition model (Engel et al. 1992), a collision of a nucleus with \(A\) nucleons on a nucleus with mass \(B\) nucleons is treated as \(A \times {{\textit{p}}-B} \) collisions, where each nucleons carries an equal fraction of the energy of the nucleus, while the p-B cross-section is based on a Glauber calculation. The GRFT-motivated calculations consider the reduction the nucleon-nucleon cross-section with respect to p-p via higher-order Pomeron interactions (Drescher et al. 2001; Werner et al. 2006; Ostapchenko 2011).

  • Pomeron amplitude. In GRFT models, the Pomeron amplitude defines the evolution of the model as a function of energy and impact parameter to a large extent. There are two different approaches. In the soft+hard approach, the amplitude is the sum of a purely soft and a purely hard Pomeron based on external parton distribution function (PDF). In the semi-hard approach, the Pomeron is the convolution of a soft and a hard component based on Dokshitzer Gribov Lipatov Altarelli Parisi (DGLAP) equations (Dokshitzer 1977; Gribov and Lipatov 1972; Altarelli and Parisi 1977).

    In the soft+hard approach, one Pomeron is always connected to valence quarks with a large momentum fraction. The other MPI come from gluons at small momentum fraction \(x\). The elasticity in this approach (energy fraction carried by the most energetic particle) is constant with energy and the pseudorapidity distribution is narrower. LHC data suggests that it is too narrow already at the TeV scale (Pierog 2018).

    In the semi-hard approach, the parameters of the soft component can be tuned to PDF data, which increases predictive power, and the Fock states that are used as initial partons always carry a significant momentum fraction \(x\) of the projectile and target. This leads to a strong correlation between particle production at mid-rapidity due to MPI and beam energy loss which is linked to elasticity (Ostapchenko et al. 2016). It also leads to a higher average mass for each Pomeron, which broadens the pseudorapidity distribution in better agreement with LHC data.

    The Pomeron model (hard or semi-hard) also determines how the non-linear effects (screening or saturation) of the cross-section is handled at high energy. For a purely hard Pomeron, the only parameter which can be adjusted is the minimum integration limit \(Q_{0}\), which is parameterised as a function of energy to reproduce the inelastic cross-section. In the case of a semi-hard Pomeron, \(Q_{0}\) is constant and the non-linearity is introduced via higher-order Pomeron interactions, which are treated explicitly with enhanced graphs in QGSJetII and via an effective amplitude modified with a parametrisation in EPOS, which is obtained by tuning to cross-section and multiplicity data in various systems.

    Finally, in the case of the soft+hard approach, there are no beam remnants since the valence quarks (and diquark) of the projectile are always connected to the valence quarks (and diquark) of the target like in DPMJet-III and Sibyll. In case of semi-hard Pomeron, the beam remnant is used to carry the remaining energy and the parton Fock states. In QGSJetII, only one valence quark is exchanged and a low mass is given to the remnant for hadronization via string fragmentation. In case of EPOS, there is no limitation for the number of exchanged quarks, so that high-mass remnants can be produced with more than two or three quarks. These require a special hadronization scheme (micro-canonical). String fragmentation and resonance decay are also implemented, depending on the mass and quark content of the remnant. This scheme is supported by low-energy data on strange baryon production (Bleicher et al. 2002; Liu et al. 2003).

  • Parton distribution functions. The parton distribution functions (PDFs) are an important ingredient in parton shower models like Pythia. They are not a fundamental element of the GRFT models, which are based on the Pomeron amplitude. The leading-order PDFs can be interpreted as the momentum distributions of the partons inside the nucleon. At high momentum transfer, Pomerons can be connected to partons either using the soft+hard approach based on external PDFs and the standard mini-jet calculations of hard scatterings, or the semi-hard approach in which the PDFs are given by the Pomeron amplitude itself. EPOS and QGSJetII generate their own PDFs through the semi-hard approach, while Sibyll2.3d uses the GRV parametrisation (Gluck et al. 1995) as an external PDF. The models which target the HEP domain, DPMJet-III and especially Pythia 8, give more weight to calculating hard scatterings and use external state-of-the-art PDF parametrisations. In the case of DPMJet-III, CT14 (Dulat et al. 2016) is used, while Pythia 8 allows one to use various leading order and next-to-leading order PDFs (Kasemets and Sjostrand 2010), or a mix, where the former is used to simulate parton showers and the latter to simulate hard scatterings.

  • Diffractive dissociation. Collisions with diffractice dissociation (which is often shortened to diffraction) make up about 20% of the inelastic cross-section at the TeV scale and are an important ingredient for the elasticity of the interaction, which in turn is an important parameter for air shower evolution. No quantum numbers are exchanged between the beam particles in these collisions, but momentum is transferred and new particles are produced. One refers to low-mass diffraction when only a few GeV are transferred. It is handled with Good and Walker (1960) theory using a 2- or 3-channel eikonal, depending on whether only one or two diffractive states are considered. The beam particles are simply excited and hadronised. In high-mass diffraction, the momentum transfer is large enough so that the transferred Pomeron can produce new particles or even a jet, leading to much lower elasticity. To remain consistent with the energy sharing scheme in EPOS, all types of diffractive dissociation are computed with a special diffractive Pomeron which is added to the semi-hard Pomeron amplitude. The beam remnants from diffractive dissociation are usually hadronised by a simple resonance decay or via string fragmentation. The impact of remnant breakup on muon production was studied by Drescher (2008) and found to be potentially significant, but constant with shower energy.

  • Collective effects. The final hadronization of the excited system is very important to get the correct multiplicity and hadron ratios, in particular in regard to the muon production in air showers (Pierog and Werner 2008). All GRFT models are based on the fact that a Pomeron has a cylindrical topology which, once it is cut, produces two strings carrying the colour field between the beam remnant and the partons from the jets. The strings are hadronised based on different schemes; Lund (Sjostrand and Bengtsson 1987) or Area Law (Artru and Mennessier 1974; Drescher et al. 2001) or some custom process. The scheme is less important than the data on which the parameters are tuned. Sibyll and QGSJetII use p-p at various energies and consider only a subset of final-state particles for tuning. DPMJet-III, Pythia, and EPOS use data from \(e^{+}e^{-}\) collisions to tune the parameters based a larger list of particles. EPOS uses also heavy-ion collisions for tuning. Particle multiplicities at high energies and central rapidities are strongly affected by MPI, see Bartalini and Gaunt (2019) for a review. One of the reasons for some tension of Sibyll and DPMJet with high-energy multiplicity distributions are their relatively simple, Poissonian MPI models. Similar models lie at the basis of Pythia (Bartalini and Gaunt 2019) that improved the description of multiplicity and \(\langle p_{\text{T}} \rangle \) by introducing a new mechanism called colour reconnection (Ortiz Velasquez et al. 2013; Bierlich and Christiansen 2015), a form of string-string interaction. Two other new string-string interaction mechanisms in Pythia are string shoving (Bierlich et al. 2018) and rope hadronization (Bierlich et al. 2015; Bierlich and Christiansen 2015). EPOS keeps string fragmentation universal and uses the core-corona approach to describe high-energy p-p data (Werner et al. 2006; Pierog et al. 2015). A core of quark gluon plasma (QGP) droplets is formed in high density regions of the collision. The droplets are hadronised via a micro-canonical statistical decay that produces different particle ratios (more baryons and strangeness), while the corona is the low density region of the collision where strings are fragmented like in \(e^{+}e^{-}\) collisions. With this scheme, EPOS can reproduce all data from p-p to Pb-Pb and from SPS to LHC energy scales without introducing energy-depend parameters for the hadronization (Werner et al. 2014).

  • Charm production and decay. Some models include charm production, which (like heavy-flavour production in general) has no impact on conventional air showers and therefore on the Muon Puzzle, but it is important for the prediction of the prompt atmospheric neutrino flux, a major background for neutrino observatories. DPMJet and Pythia produce charm based on a pQCD calculation and in fragmentation. In Sibyll, central charm production is based on a parametrisation. In addition, intrinsic charm in the nucleon PDF has been introduced to correctly reproduce forward charm production.

Table 2 Comparison of the theoretical approaches in commonly used hadronic interaction models

2.5 Modern air shower detection

A cosmic-ray event is fully characterised by the energy \(E\) and mass \(A\) of the cosmic ray, and its arrival direction expressed by the zenith and azimuthal angles \((\theta ,\phi )\), and its arrival location \((x, y)\) at ground, the shower core. The orientation and location of the cosmic ray directly translates into that of the shower axis. This part of the measurement is relatively straight-forward. The energy is indirectly observed via the number of particles produced in the atmosphere and the mass via the longitudinal evolution of the shower and the number of muons produced.

The latest generation of high-energy air shower experiments are the Pierre Auger Observatory (Aab et al. 2015b) and the Telescope Array (Abbasi et al. 2018a). They use hybrid observations of the longitudinal development of the shower with fluorescence and Cherenkov telescopes and the particle footprint at the ground with particle detectors. An example of a hybrid detection is shown in Fig. 6. Hybrid observation of air showers at ultra-high energy was pivotal for the nearly model-independent observation of the discrepancy in the muon production in air showers.

Fig. 6
figure 6

Hybrid detection of an air shower illustrated with an event recorded by the Pierre Auger Observatory (from Aab et al. 2014a). Dots represent particle detectors on the ground, the four named structures at the edges of the array are fluorescence telescope stations. The shower axis (solid line) with its light intensity profile (grey dots) is indicated, as well the footprint in the ground array (black cylinders). The sizes of the dots and cylinders indicate the logarithm of the local light and particle density, respectively

Air shower observations with distributed detectors at ground were discovered by Kolhörster et al. (1938) and further exploited by Auger (Auger et al. 1939). The advantage of ground detection is the duty cycle of nearly 100%, the accurate measurement of the shower arrival direction, and the ability to measure the particle composition at ground, in particular the muon number \(N_{\mu }\). Muons that are measured with ground arrays have typical energies of 10 to \(100~\text{GeV}\) (Dembinski et al. 2010) and overwhelmingly originate from the last stages of the hadronic cascade. Only a single slice in the longitudinal evolution of the shower is observed. For that reason, measurements of the shower energy are usually fairly model-dependent for ground arrays. This can be compensated by building the ground array at high altitudes so that the slice is close to the shower maximum, which maximises the observable particle density and reduces the systematic uncertainty of the energy measurement.

The observation of showers via fluorescence telescopes was pioneered by the Fly’s Eye experiment (Baltrusaitis et al. 1985). It allows for a near-calorimetric measurement of the shower energy since most of the kinetic energy of an ultra-high energy cosmic ray is deposited into the atmosphere and a known fraction is converted into observable light (Aab et al. 2019; Ave et al. 2013). The observation of the depth of shower maximum \(X_{\text{max}}\) is found to be a very good estimator for the cosmic-ray mass. However, the duty cycle is limited to about 15%, since the detection requires dark nights. The combination of the two complementary techniques offers synergies that made high-precision measurements possible. Observing both \(X_{\text{max}}\) and \(N_{\mu }\) allows one to test hadronic interaction models used in air shower simulations, since both are sensitive to the cosmic-ray mass composition and if the implemented physics is correct, the mass composition inferred from \(X_{\text{max}}\) and \(N_{\mu }\) must be in agreement.

In summary, the gold standard to study hadronic interactions with air showers is the simultaneous and independent observation of the shower energy \(E\), the depth of the shower maximum \(X_{\text{max}}\) to determine the cosmic-ray mass, and the lateral muon density at the ground, \(\rho _{\mu }(r)\), which can be integrated to give the muon number \(N_{\mu }\). The lateral density \(\rho _{\mu }(r)\) depends on the energy \(E\) and mass \(A\), the zenith angle \(\theta \), the shower age (which is the slant depth between the shower maximum and observation level) and the energy threshold \(E_{\mu ,\text{min}}\) of the detector for muons.

It is not necessary to perform all these measurements simultaneously with the same detector, however, if external measurements in the same energy range are available. The average logarithmic muon number \(\langle \ln N_{\mu }\rangle \) depends on the average cosmic-ray mass composition quantified by \(\langle \ln A \rangle \) which can be measured by another experiment in the same energy interval via \(X_{\text{max}}\) observations. If this information is available, it is also possible to measure the muon discrepancy with a detector that can only independently measure the shower energy \(E\) and the muon number \(N_{\mu}\), but not \(X_{\text{max}}\). It is even possible to measure the muon discrepancy with a detector that only measures muons, if both the cosmic-ray mass composition and the energy spectrum are taken from another experiment. We will briefly discuss these different ways in which the muon discrepancy was observed.

To increase the experimental significance and to uncover more information about the origin of the Muon Puzzle, it is important to combine muon measurements from many experiments. The phase space of air shower parameters covered by the experiments that were used in the meta analysis by the WHISP group (discussed in Sect. 2.6 and shown in Fig. 7) are briefly discussed below.

  • The Pierre Auger Observatory (Aab et al. 2015b) is a hybrid experiment comprising 1660 water-Cherenkov stations positioned on a \(1.5~\text{km}\) triangular grid covering an area of \(3000~\text{km}^{2}\) that is overlooked by 27 fluorescence telescopes located at four sites at the periphery of the array. The observatory has the largest collected exposure and the largest collaboration among air shower experiments. The water-Cherenkov stations are very sensitive to muons and, due to their height of 1.2 m, also offer acceptance for horizontal air showers. The first and least model-dependent muon measurement at the EeV scale was obtained with highly-inclined air showers (Aab et al. 2015a) which are naturally muon-rich, followed by a measurement with vertical showers based on shower universality (Aab et al. 2016b). The measurement with highly-inclined showers was recently updated and also the relative shower-to-shower fluctuations in the muon number were measured for the first time in any air shower experiment (Aab et al. 2021). Since the start of regular operation, a number of enhancements have been installed at the site. AMIGA is a denser infill array with \(750~\text{m}\) spacing and buried scintillator detectors which are able to measure the isolated muon component also in vertical showers (Aab et al. 2016a). The infill array is overlooked by high-elevation telescopes optimised for the study of low-energy showers. As part of the on-going AugerPrime upgrade (Castellina 2019), the water-Cherenkov stations are being upgraded with scintillators. The upgraded surface detector array will allow for a model-independent measurement of the muon content in vertical showers with a duty cycle of 100%.

  • Telescope Array (Fukushima 2003) is a hybrid experiment that consists of a \(700~\text{km}^{2}\) array of 507 scintillator detectors with \(1.2~\text{km}\) spacing, overlooked by three telescope stations. Since scintillators have no acceptance for horizontal air showers, stations at large lateral distance to the shower axis have been used to measure the muon density in which the muon purity reaches 70% (Abbasi et al. 2018a). Telescope Array has a low-energy extension called TALE, which has been used to measure the cosmic-ray flux and composition from \(2~\text{PeV}\) to \(2~\text{EeV}\) (Abbasi et al. 2018b, 2021). A non-imaging optical array called NICHE (Bergman et al. 2020) is currently build, and the main array is upgraded to TA*4 to cover a four times larger area to gain more acceptance for energies above 50 EeV (Kido 2018).

  • Akeno Giant Air Shower Array (AGASA) (Chiba et al. 1992, now decommissioned) was a \(100~\text{km}^{2}\) surface array of 111 scintillation and 27 muon detectors with a spacing about \(1~\text{km}\). The array measured the cosmic-ray flux from about \(3~\text{PeV}\) to about \(30~\text{EeV}\). AGASA data on muons (Hayashida et al. 1995; Shinozaki and Teshima 2004) has been re-analysed and compared with modern air shower simulations (Gesualdi et al. 2020, 2021b) for inclusion in the meta-analysis.

  • The Yakutsk EAS Array is a \(8~\text{km}^{2}\) array of 58 scintillator stations, 48 non-imaging Cherenkov light detectors, six large-area muon detectors, and two partially-imaging Cherenkov light detectors (Anatoly 2013). It has been continuously taking data since 1974 (Glushkov et al. 2018) and has published air shower data in the energy range from \(1~\text{PeV}\) up to \(30~\text{EeV}\). The small size of the array (varying over time with a peak size of \(12~\text{km}^{2}\)) is somewhat compensated by an exceptionally long exposure time. The combination of data from the Cherenkov detectors, surface scintillators, and muon detectors allows for full hybrid observation of air showers. The array is located close to sea level (\(0.1~\text{km}\) altitude) and measures air showers well past their maximum. Determining the shower energy independently of the muon content with a surface array at sea-level is challenging. The energy calibration was recently updated by Glushkov et al. (2018) using a combination of shower simulations and data from all Yakutsk detectors. The cosmic-ray flux obtained from surface array data in this way is in agreement with the now excluded AGASA spectrum and incompatible with the flux that Yakutsk measured using only the Cherenkov detectors. This contradiction is unresolved. Glushkov and Saburov (2019) present measurements of the muon density.

  • The IceCube Neutrino Observatory (IceCube) consists of a cubic-kilometre ice-Cherenkov detector comprised of over 5000 optical sensors in the deep Antarctic ice (Aartsen et al. 2017) shielded by \(1350~\text{mwe} \) (vertical incidence), and a \(1~\text{km}^{2}\) surface array called IceTop that consists of 81 ice-Cherenkov detector stations with a spacing of \(125~\text{m}\) (Abbasi et al. 2013a). IceCube does not observe air showers with telescopes, but it has the unique capability to simultaneously measure the shower particles at the surface (electrons, photons, and muons) and high-energy (\(>300~\text{GeV} \)) muon bundles in the deep detector which are sensitive to the first interaction of the cosmic ray and offer unique insights into the hadronic physics of this interaction. The IceTop detector measures air showers at an altitude of \(2.8~\text{km}\) above sea level close to the shower maximum, which allows one to infer the shower energy with low model-dependence (Aartsen et al. 2013, 2019). The muon density at the surface is measured with a statistical technique that distinguishes muon hits far from the shower axis by their characteristic constant signal, while electrons and photons generate a continuum of signals that can be separated and subtracted (Dembinski 2017; Gonzalez 2019a,b).

  • The KASCADE-Grande experiment (Apel et al. 2010, now decommissioned) was a \(0.49~\text{km}^{2}\) array with 252 scintillator stations with \(13~\text{m}\) spacing (KASCADE) and additional 37 stations with an average spacing of \(137~\text{m}\) (Grande). The denser array contained both unshielded and shielded detectors to separately measure show electrons and muons. The array further contained several muon tracking detectors. The experiment was located close to sea level (\(0.1~\text{km}\) altitude) and measured air showers well past their maximum. Like in case of Yakutsk, it is challenging to determine the shower energy independently of the muon content at sea-level. A summary of muon results is given by Apel et al. (2017). The data of the KASCADE-Grande experiment were released to the public (Haungs et al. 2018) and analyses are still forthcoming.

  • The EAS-MSU Array (Fomin et al. 2016, now decommissioned) was a \(0.5~\text{km}^{2}\) array in its latest configuration with 76 unshielded charged-particle detector stations and additional underground stations which measured atmospheric muons under \(40~\text{m}\) water-equivalent with a threshold energy of \(10~\text{GeV}\) for vertical incidence. The array is similar in size and capabilities to the KASCADE-Grande array and also close to sea-level (\(0.15~\text{km}\) altitude). It faced the same challenges in determining the cosmic-ray energy independently from the muon content the shower. Muon data from the array was recently re-analysed by Fomin et al. (2017).

  • The SUGAR Array (Brownlee et al. 1968, now decommissioned) was a \(70~\text{km}^{2}\) array of 54 muon detectors placed at a depth of \((1.5 \pm 0.3)~\text{m}\) with a energy threshold of about \(0.75~\text{GeV}\) for vertical muons at an altitude of about \(250~\text{m}\) above sea level. The cosmic-ray flux in this experiment was measured by using the muon number \(N_{\mu }\) as an energy estimator, using a relationship between energy \(E\) and \(N_{\mu }\) predicted by air shower simulations. If the simulations suffer from a muon deficit, then the inferred cosmic-ray flux turns to be too high. Bellido et al. (2018) re-analysed the data in this regard by comparing the measured flux with a simulation based on the cosmic-ray flux measured near-calorimetrically by the Pierre Auger Observatory, which is not affected by the muon deficit, and two extreme mass composition assumptions of pure proton and pure iron showers. By attributing the discrepancy between measured and simulated flux solely to the muon deficit in simulations, the muon deficit was calculated.

  • The NEVOD-DECOR experiment (Barbashina et al. 2000; Petrukhin 2015) is a single large \(2000~\text{m}^{3}\) water-Cherenkov detector (NEVOD) with a \(70~\text{m}^{2}\) charged-particle detector array on its top (DECOR). The setup measures muon bundles produced by air showers. The density of the bundles and the arrival direction is recorded. Pure muon showers are selected with a cut on highly inclined events with \(\theta > 55^{\circ }\). The shower information collected by the NEVOD-DECOR experiment is even further reduced than in case of the SUGAR array. Neither the shower energy nor the shower core location are known event-by-event, since the detector is point-like compared to the lateral shower extension. A muon deficit in simulations has nevertheless been detected in a similar way as in case of the SUGAR array (Bogdanov et al. 2010, 2018). Expected density spectra of muon bundles were simulated based on a model of the cosmic-ray flux fitted to selected world data, which closely follows fluxes that have been measured near-calorimetrically, and two extreme mass composition assumptions of pure proton and pure iron showers. The muon deficit is then inferred from the discrepancy between the simulated and observed muon density spectra. Assigning a shower energy to the observed discrepancy is a challenge, since the measured local muon density spectra are the product of an integral over a wide range of shower energies.

Fig. 7
figure 7

Air shower experiments have measured the muon density at ground under various conditions, which are shown here (image from Soldin 2021). Points and lines indicate a measurement in a narrow bin of the parameter, while boxes indicate integration over a parameter range. Left: Zenith angle of air showers versus shower energy. Centre: Lateral distance of the muon density measurement versus shower energy. Right: Energy threshold for the muons that are counted in the experiment. Some experiments measure muons below a shielding, which increases the muon energy threshold

In addition to these experiments, other air shower experiments have performed muon measurements or are currently taking data. The HiRes-MIA collaboration performed the first combined measurement of the muon content of air showers together with optical telescopes (AbuZayyad et al. 2000), as a precursor to the Pierre Auger Observatory and Telescope Array. Haverah Park experiment has published a comprehensive series of muon measurements in air showers (Armitage et al. 1987; Coy et al. 1997; Blake and Nash 1995a,b, 1998, 2000) which were used to estimate the cosmic-ray mass composition (Ave et al. 2003) and put limits on the ultra-high photon flux (Ave et al. 2002). Data from Haverah Park and HiRes-MIA were not included in the meta-analysis, since the data were not re-analysed with modern air shower simulation codes. The GRAPES-3 air shower experiment (Hayashi et al. 2005) has a large-area muon tracking detector similar to the KASCADE experiment, but not yet published muon measurements.

2.6 Observation of the muon discrepancy in air showers

The report by Aab et al. (2015a) about a muon deficit in simulations compared to measurements from the Pierre Auger Observatory sparked renewed interest in the muon discrepancy and were followed by new muon measurements and the re-analysis of previously collected data from the EAS-MSU Array (Fomin et al. 2017), the IceCube Neutrino Observatory (Gonzalez 2019b), the KASCADE-Grande experiment Apel et al. (2017), the NEVOD-DECOR detector Bogdanov et al. (2018), the SUGAR array Bellido et al. (2018), Telescope Array Abbasi et al. (2018a), and the Yakutsk array Glushkov and Saburov (2019). The Pierre Auger Observatory also followed up with independent measurements using vertical showers, first using shower universality (Aab et al. 2016b) and then using direct muon measurements at lower energies with the AMIGA sub-array (Müller for the Pierre Auger collaboration 2019; Sánchez 2020). The Pierre Auger Observatory was not the first to report a discrepancy in the number of muons, but it offered the first nearly model-independent measurements with well-controlled systematic effects in comparison with post-LHC models. Discrepancies in the muon number had been reported before by the HiRes/MIA (AbuZayyad et al. 2000) and NEVOD-DECOR experiments (Bogdanov et al. 2010), but not by all experiments. The AGASA experiment, for example, did not report a muon discrepancy (Shinozaki and Teshima 2004).

The newfound wealth of data created the necessity for a meta-analysis of muon measurements. This led to the foundation of the Working group for Hadronic Interactions and Shower Physics (WHISP) formed by members of the aforementioned experiments (no contact could not be established for the HiRes/MIA experiment) with the goal to develop a common framework to compare all measurements, since a direct comparison is usually not possible. The muon density at the ground depends on many parameters which differ from experiment to experiment:

  • Cosmic-ray energy \(E\),

  • Zenith angle \(\theta \),

  • Shower age (depends on altitude of the experiment, local atmosphere, and zenith angle of the shower),

  • Lateral distance \(r\) from shower axis,

  • Energy threshold \(E_{\mu ,\text{min}}\) of the detectors for muons.

The WHISP introduced the abstract muon scale parameter \(z\) (Dembinski et al. 2019; Cazon 2020; Soldin 2021; Gesualdi et al. 2021a) defined as

$$ z = \frac{\ln \langle N_{\mu }\rangle - \ln \langle N_{\mu }\rangle _{p}}{\ln \langle N_{\mu }\rangle _{\text{Fe}}- \ln \langle N_{\mu }\rangle _{p}}, $$
(9)

which can be computed from the data of each experiment, where \(\langle N_{\mu }\rangle \) is the muon number or anything proportional to it averaged over showers in a narrow shower energy interval, and \(\langle N_{\mu }\rangle _{p}\) and \(\langle N_{\mu }\rangle _{\text{Fe}}\) are the corresponding values obtained from simulated air showers which undergo a full detector simulation and the same analysis as the real events. This definition cancels potential biases and is insensitive to a mismodelling of the \(N_{\mu }\) resolution in the experiment (Dembinski 2018). The natural range of \(z\) in absence of a muon discrepancy is \(0 < z < 1\), since proton and iron showers limit the range of observed cosmic-ray masses in practice.

By construction, \(z\) depends on the hadronic interaction model used in air shower simulations and therefore different values are obtained for each model. The simulations account for differences in the experimental conditions. This approach is feasible since simulations mainly differ by a global offset in the number of muons and less in other aspects like the lateral density profile or the zenith angle dependence, as shown in several studies, see e.g. Dembinski et al. (2010), Aab et al. (2014c).

Another important step was to cross-calibrate the energy scales of the participating experiments. Since the muon number \(N_{\mu }\) scales almost linearly with shower energy \(E\) as described in Sect. 2.2, the measured muon number needs to be compared in Eq. (9) with showers simulated at the exact same energy. Unfortunately, the overall calibration of the energy-scale of each experiment is only known with an accuracy of 10–20%. A 20% energy offset between two experiments translates to a shift of about 0.5 in \(z\), half the difference between proton and iron showers. These energy-scale offsets are well-known to affect the cosmic-ray flux measured by different experiments, leading to shifts in the flux that are consistent within the uncertainties of the respective energy scales, see e.g. Hoerandel (2003), Berezinsky et al. (2006, 2005), Dembinski et al. (2018), AbuZayyad et al. (2019). The WHISP removed the energy-scale offsets in \(z\) based on the observed offsets in the measured cosmic-ray fluxes, assuming that the flux offsets are caused entirely by shifts in the respective energy scales. This lead to a remarkable reduction in variance in \(z\).

The cross-calibrated \(z\)-values show an upward trend in \(z\) with increasing shower energy. The post-LHC models EPOS-LHC, QGSJetII.04, and Sibyll2.3 give a better description of air shower data than the pre-LHC models, but the trend is observed for all models. A modulation in \(z\) is observed, which follows that of \(z_{\text{mass}}\), the expected value of \(z\) for a cosmic-ray mass composition that agrees with the independent \(X_{\text{max}} \) measurements. After subtracting this expected modulation, the difference \(\Delta z = z - z_{\text{mass}}\) shows an approximately linear increase with shower energy \(E\), which is shown in Fig. 8.

Fig. 8
figure 8

Data from Fig. 3 after subtracting \(z_{\text{mass}}\), the expected variation in \(z\) due to changes in the cosmic-ray composition (image from Soldin 2021). The expected variations are taken from the GSF model (dashed line), which in turn is derived primarily from \(X_{\text{max}}\) measurements (grey band). Solid lines represent fits that assume different levels of correlated experimental uncertainties. The deviation of the slope \(b\) from zero in standard deviations is shown in the inset as function of the assumed correlation. Also shown in the inset is the value of the slope, scaled by a factor of 10

The quantity \(\Delta z\) can be interpreted as the relative muon deficit in units of proton-iron difference. The value \(\Delta z \approx 1\) at \(20~\text{EeV}\) corresponds to a muon deficit in simulations of about 40%. However, since the cross-calibration of energy-scales only removes relative offsets, a residual global offset of the energy cannot be excluded, which is roughly estimated at the level of 10% (Dembinski et al. 2019). It follows that all points in Fig. 8 can be moved up and down by about \(\pm 0.25\).

A fit of a line model \(\Delta z = a + b \log _{10}(E / \text{eV})\) yields a slope \(b\), which deviates positively from zero with 8 standard deviations or more. The slope is independent of the previously mentioned global shifts. Its fitted value depends only weakly on the model and the assumed correlation coefficient for the reported experimental uncertainties. The latter are known to be largely positively correlated, but the correlation coefficients for each experiment are not generally known. Therefore, a scan is performed over all values of the correlation coefficients from zero to one. To account for residual discrepancies in the data, the raw covariance matrix of the fitted parameters in this analysis is scaled by the \(\chi ^{2}\) of the fit divided by the degrees of freedom \(n_{\text{dof}}\), see Dembinski et al. (2019) for details. This is procedure effectively scales the uncertainties of all data points with a common factor so that \(\chi ^{2} = n_{\text{dof}}\), the same technique is used in similar cases by the Particle Data Group Zyla et al. (2020).

The result of the meta-analysis is remarkable in several ways. It is the first unified statement of this kind from nine air shower experiments and it demonstrates that results are fairly consistent if the well-known systematic effect of the energy-scale uncertainties is removed. We list the main conclusions:

  • The origin of the muon discrepancy can be studied at the LHC. The deviation starts around \(40~\text{PeV} \), which corresponds to a cms-energy \(\sqrt{s_{\text{NN}}} \approx 8~\text{TeV}\) within the reach of the LHC. Below this threshold data and simulations are consistent within uncertainties.

  • The previous point implies that late shower stages in which the hadron energy is below \(40~\text{PeV}\) are sufficiently well described by simulations. The origin of the muon puzzle therefore is likely to be found in the first stages of the shower, not in the late stages.

  • The relative muon deficit increases approximately linearly with \(\ln E\) above this threshold. The fact that the number of shower stages above this energy also increases linearly with \(\ln E\) points toward a compounding effect, as described in Sect. 2.2.

The observed behaviour makes a non-exotic explanation more likely in which a comparably small change in the hadronic interactions causes large changes in \(N_{\mu }\) over several shower stages, see Sect. 2.2, Sect. 2.8, and Sect. 3 for more details.

Although the data taken as a whole shows a muon discrepancy with high significance, one can distinguish two groups of experiments. The experiments which either measure the shower energy near-calorimetrically (Auger, AMIGA, IceCube) or do not measure the shower energy at all (NEVOD-DECOR, SUGAR) show a strong deficit, while the experiments which use an energy estimator that is correlated to the muon number (KASCADE-Grande, Yakutsk, EAS-MSU) show a much weaker deficit. This is still under study, but the discrepancy could be caused by this correlation. In case of AGASA, no muon discrepancy was originally reported, since the original energy scale of the experiment was shifted, but it appeared after the energy-scale offset was adjusted. In regard to the Yakutsk data, there is an unresolved discrepancy between the Cherenkov and surface detector measurements, which may affect these results.

2.7 Results connected to the Muon Puzzle

The Muon Puzzle specifically refers to a deficit in GeV muons that are produced near the end of the hadronic cascade in a simulated air shower. This is not the only discrepancy between muon measurements and simulations, however. We list here other measurements that found discrepancies and results potentially connected to the Muon Puzzle.

  • Fluctuation of the muon number. The shower-to-shower fluctuations of the muon number \(N_{\mu }\) are sensitive to the first interactions in the hadronic cascade and to the cosmic-ray mass composition (Fukui et al. 1960; Cazon et al. 2018). They provide important evidence toward the source of the muon deficit in simulations. The fluctuations have been measured recently by the Pierre Auger Observatory (Aab et al. 2021) for the first time and reasonable agreement was found between the measurement and the post-LHC models EPOS-LHC, QGSJetII.04, and Sibyll2.3d. This limits exotic explanations of the Muon Puzzle, in which an extreme change to the physics of the first interaction is proposed, followed by ordinary QCD for the rest of the shower development. For example, the Chiral Symmetry Restoration toy model by Farrar and Allen (2013) would drastically reduce the muon number fluctuations. A very large sample of \(N_{\mu }\)-measurements could be used to measure the \(\pi ^{0}\) production cross-section (Cazon et al. 2021).

  • Muon production depth. The longitudinal muon density profile cannot be observed optically, since the small muon signal is overwhelmed by the electron signal, the related profile of muon production depths has been observed based on the signal arrival times in the surface detectors of the Pierre Auger Observatory (Aab et al. 2014b) and via the muon tracking detectors of the KASCADE-Grande experiment (Apel et al. 2011). The profile contains sensitive information about the hadronic cascade and about the cosmic-ray mass composition. The muon production depth is correlated to the depth \(X_{\text{max}}\) of the electromagnetic shower maximum, but they are not identical and arise from different shower physics. At the EeV scale, QGSJetII.04 describes both quantities consistently, while EPOS-LHC does not. Deviations were also found at the PeV scale in comparison to simulations with the older model QGSJetII.02.

  • Muon attenuation with zenith angle and mass overburden. Several experiments have reported deviations between simulated and measured the number of muons as a function of the zenith angle and mass overburden. KASCADE-Grande expressed the measurement in form of the effective attenuation length for muons in air showers (Apel et al. 2017). The measurement is independent of assumptions regarding the cosmic-ray flux, since the cosmic-ray flux is isotropic and therefore the intensity is known in each zenith angle interval. Based on this, the attenuation of the muon flux can be computed. Simulations with the models QGSJetII.02, Sibyll2.1, QGSJetII.04, and EPOS-LHC consistently show smaller attenuation lengths than the experimental result, although the deviations is weaker for the two post-LHC models. In other words, the muon number in simulations decreases more rapidly with zenith angle than in the measurement, which could indicate that the muon energy spectrum in simulations is steeper than in data. The measured attenuation also shows a dependence on the lateral distance to the shower axis, which is not reproduced correctly by current air shower simulations.

    Some underground experiments observed the opposite effect for TeV muons that originate primarily from the first interaction in contrast to GeV muons in air shower experiments like KASCADE-Grande. The Fréjus and AMANDA experiments have measured the muon rate as a function of the mass overburden which is computed from the zenith angle (Desiati and AMANDA Collaboration 2001; Berger et al. 1989). Simulations were provided by (Rhode 2002, Chap. 5) and (Schröder 2001, Chap. 6). A muon deficit is seen in simulated showers with vertical incidence which either disappears or turns into an excess at larger zenith angles with large mass overburden, depending on the simulation code. A re-interpretation of these old measurements would require new simulations with recent models of the cosmic-ray flux and composition and hadronic interaction models.

  • High-energy atmospheric muon flux. Only muons with energies above \(300~\text{GeV}\) penetrate the Antarctic ice and reach the deep-ice detector of the IceCube Neutrino Observatory. Shower muons arrive in bundles that appear like a single track. Although individual muons cannot be resolved, the experiment can distinguish between bundles with and without high-energy muons based on the presence of stochastic energy losses from bremsstrahlung. IceCube has measured the high-energy muon flux up to PeV energies in this way (Aartsen et al. 2016a) which provides evidence for a prompt muon flux in air showers. This flux mainly originates from decays of charmed and unflavoured hadrons (Fedynitch et al. 2018b). While the conventional high-energy muon flux from decays of light hadrons above \(10~\text{TeV}\) is well reproduced by Sibyll2.1, discrepancies between simulations and experimental data are found in the zenith angle distribution of muons that are unresolved (Soldin 2018).

  • Simultaneous measurements of GeV and TeV muons. The IceCube Neutrino Observatory is capable of simultaneously measuring an air shower in the deep-ice detector and in the IceTop array and to study the low-energy (GeV) and high-energy (TeV) muon component event-by-event and their correlation. The measurement of muons at two vastly different energies provides information about the energy sharing between low- (late) and high-energy (early) interactions during the shower development and is of particular interest in regard to the Muon Puzzle. Preliminary studies by De Ridder et al. (2018) indicate that the yield of low- and high-energy muons differs among hadronic interaction models. A consistent interpretation of the experimental data is obtained for Sibyll2.3 and QGSJetII.04, while Sibyll2.1 and EPOS-LHC show discrepancies. As described by Riehn et al. (2020a), these measurements can be used to constrain hadronic interaction models and provide unique tests of muon production models in EAS.

  • Lateral separation of TeV muons. The IceCube Neutrino Observatory observes events with well-separated pairs of near-parallel tracks in the deep-ice detector with lateral distances between \(135~\text{m}\) to \(400~\text{m}\) (Abbasi et al. 2013b). Simulations have shown that these events are dominantly caused by the production and decay of hadrons with large transverse momentum \(p_{\text{T}}\) in the first interaction of an air shower (Soldin 2018). The observed distribution of lateral separations (LS distribution) can be related in a model-dependent way to the \(p_{\text{T}}\) distribution of the decaying mesons. The momentum threshold \(p_{\text{T,min}}\simeq 2~\text{GeV}\,c^{-1}\) for the observed mesons is large, which means that their production rates can be calculated in perturbative QCD and since the \(p_{\text{T}}\) distribution is also sensitive to the mass of the cosmic ray, it may offer an alternative way to measure the mass composition of cosmic rays with independent systematic uncertainties (Klein 2008; Gerhardt and Klein 2009; Soldin 2015, 2016). The LS distribution for PeV showers are compatible with predictions by Sibyll2.1, while QGSJetII.04 and EPOS-LHC show deviations. Sibyll2.1 does not describe the zenith-dependency of the LS distribution, however. The tensions between experimental data and simulations are not yet understood. Underground detectors like MACRO (Ambrosio et al. 1999) and Fréjus (Berger et al. 1989) have previously measured the LS distribution at greater depths, but these measurements have not been studied in the context of modern hadronic interaction models.

  • Seasonal variations of atmospheric muon and neutrino fluxes. It has been proposed long ago to use the correlation between the variation in atmospheric temperature and muon flux as a probe of hadronic interactions (Barrett et al. 1952). Seasonal temperature variations cause density variations that modify the mean-free path for mesons in the atmosphere. If the density is lower, mesons are more likely to decay than to hit another target and produce more mesons, which reduces the muon flux. Measurements of the flux variations yield important information about the dynamics in the central stratosphere, such as ozone hole dynamics and the temporal behaviour of the stratospheric temperatures (Tilav et al. 2020). Furthermore, the size of this effect depends on the life-time of the meson. Charged pions have the largest life-times and are more affected by density variations than kaons. This can be exploited to infer the \(K/\pi \)-ratio from the size of the flux variation for a given temperature variation (Grashorn et al. 2010; Desiati and Gaisser 2010). Model-dependent measurements of the \(K/\pi \)-ratio via variations in the muon flux have been performed by MINOS (Adamson et al. 2010, 2014), MACRO (Ambrosio et al. 1997), and IceCube (Tilav et al. 2010; Desiati et al. 2011). The flux-based measurements are compatible with direct measurements of both \(pp\) and in heavy-ion collisions, due to the comparably large uncertainties.

    Another approach to measure the \(K/\pi \) ratio is the study of the electron neutrino and muon neutrino flux as a function of energy and zenith angle (Fedynitch and Yáñez 2020). At energies below \(\sim 80~\text{GeV} \) (for vertical incidence), muon neutrinos follow the flux of parent pions, while at higher energies that of charged kaons. Low-energy electron neutrinos come from muon decays and at higher energy from charged and neutral kaon decays. The features of the observed neutrino spectrum and angular distribution provide constraints on the \(K/\pi \)-ratio. IceCube recently reached sufficient exposure to also observe seasonal variations in the atmospheric neutrino flux (Heix et al. 2020) that provide an additional probe of the \(K/\pi \)-ratio.

  • Inclusive muon and neutrino fluxes. Muons and atmospheric neutrinos are closely related since they are produced in decays of the same ancestors (Gaisser and Honda 2002), but the relative contributions of ancestor particles to the different lepton fluxes vary, as shown in Fig. 9. A muon deficit in simulated air showers is therefore expected to be connected to a corresponding deficit in the conventional atmospheric lepton flux, but the correspondence is not trivial. The atmospheric lepton fluxes are integrals over the contributions from individual showers weighted by the flux of the primary cosmic rays, as discussed in Sect. 2.1. The steep cosmic-ray spectrum emphasises the particle production phase space at \(x_{\text{lab}} > 0.1\), and suppresses the importance of pion-air and kaon-air interactions (Gaisser et al. 2016; Fedynitch et al. 2018b). If the muon deficit in simulated air showers is caused by a compounded effect over several steps of the hadronic cascade, then the atmospheric lepton fluxes should be less affected. Above PeV energies, the prompt component takes over which is not directly linked to the muon deficit in air showers.

    Fig. 9
    figure 9

    Top panel: Conventional and prompt lepton fluxes in \(\text{GeV}^{2}/(\text{cm}^{2}\,\text{s}\,\text{sr})\) and fractional contributions from parent mesons for \(\theta = 60^{\circ }\). The calculation uses MCEq, the Sibyll2.3d hadronic interaction model and the GSF cosmic-ray flux model as described in the text. Rare decays of unflavoured vector mesons and electromagnetic pairs contribute to the difference between prompt muons (dotted line) and neutrinos (solid line). Lower panels: flux fraction by the type of lepton ancestor. Prompt leptons originate predominantly from decays of charm mesons

    Since intrinsic shower fluctuations do not play a significant role in the calculation of atmospheric lepton fluxes, the cascade equations Eq. (1) can be solved directly, either with semi-analytical approximations (Zatsepin and Kuz’min 1962; Volkova 1980; Gaisser et al. 1983; Naumov et al. 1994; Lipari 1993), iterative semi-analytical approaches (Kochanov et al. 2008; Sinegovskaya et al. 2015) or iterative numerical solvers such as MCEq (Fedynitch et al. 2015). The accuracy of these methods is comparable (Gaisser et al. 2020; Morozova et al. 2017), but challenged by the increasing precision of muon flux measurements.

    An alternative approach is to simulate many air showers over a wide range of energies and zenith angles and weight the results with the cosmic-ray flux (Battistoni et al. 2000; Barr et al. 2004; Honda et al. 2007; Fedynitch et al. 2012). Honda et al. (2007) showed that simulations of the atmospheric muon flux with the DPMJet-III model underestimate measurements by up to 20% at \(1~\text{TeV}\), while showing no discrepancy at \(10~\text{GeV}\). Recent computations with post-LHC interaction models confirm this observed deficit (Fedynitch et al. 2018b), which remains after considering systematic uncertainties of modern muon flux and charge-ratio measurements (Yáñez et al. 2020). As in the case of the interpretation of air shower data, significant uncertainties arise from a lack of data on forward particle production and in general the hadronic phase-space coverage at high energies (Barr et al. 2006; Honda et al. 2019). At high energies (\(>100~\text{GeV} \)), uncertainties of the models for the cosmic-ray spectrum and mass composition (Gaisser 2012; Gaisser et al. 2013; Fedynitch et al. 2012; Dembinski et al. 2018; Evans et al. 2017) introduce errors up to several tens of percent (Barr et al. 2006; Fedynitch et al. 2018a). The GSF model (Dembinski et al. 2018) helps in breaking the degeneracy between uncertainties in the cosmic-ray flux and hadronic interactions. It is an nearly model-independent parametrization (GSF stands for Global Spline Fit) of the world data on cosmic rays and provides a parameter covariance matrix that accounts for experimental systematic uncertainties through cross-calibration between experiments.

    A viable strategy to improve the accuracy of the calculated atmospheric neutrino flux is to exploit the link between muon and neutrino fluxes by calibrating against muon flux data (Honda et al. 2007, 2019). However, this ansatz is challenged by a lack of accurate muon flux data above PeV energies in the range relevant for IceCube and future astrophysical neutrino observatories. A further complication in this energy range is the onset of the prompt component (Volkova and Zatsepin 1983; Gaisser et al. 1983; Inazawa et al. 1986; Volkova and Zatsepin 1985; Bugaev et al. 1985; Gondolo et al. 1996; Pasquali et al. 1999; Costa 2001; Martin et al. 2003; Enberg et al. 2008). Prompt leptons originate from semi-leptonic decays of charm and bottom mesons. In case of muons, additional prompt components are expected from decays of unflavored vector mesons (Illana et al. 2009) and from electromagnetic muon pairs (Gámez et al. 2020; Meighen-Berger and Li 2020). The prompt atmospheric neutrino flux has not yet been conclusively measured. The best limit is set by IceCube (Aartsen et al. 2016b) to 1.04 times the central prediction of Enberg et al. (2008) based on a dipole model. The prompt atmospheric flux is a significant background for the measurement of astrophysical neutrinos in track-based methods (Aartsen et al. 2016b; Stettner 2020). The veto-based analyses by IceCube (Abbasi et al. 2020; Aartsen et al. 2020a) are less affected, since atmospheric backgrounds including prompt ones are efficiently suppressed.

    Modern calculations (Garzelli et al. 2015; Bhattacharya et al. 2015; Gauld et al. 2015; Garzelli et al. 2017; Bhattacharya et al. 2016; Sinegovsky and Sorokovikov 2020; Zenaiev et al. 2020) aim to reduce the theoretical uncertainties of the prompt neutrino flux with LHC data. The production cross-sections for charm hadrons are not as tightly constrained theoretically as they are experimentally by direct measurements. The theoretical calculations are limited by the uncertainties in the scale, charm mass, and the nuclear PDFs. A non-perturbative intrinsic charm component proposed by Brodsky et al. (1980) may also contribute (Bhattacharya and Cudell 2018). Prompt lepton production is not directly linked to the Muon Puzzle in air showers and primarily constrained by measurements of heavy-flavour production, but there is an overlapping interest in the analysis of proton-oxygen collisions at the LHC to better understand potential nuclear effects in the production of light and heavy flavour.

2.8 Impact of changes in hadronic interaction features on air shower features

A first connection between features of microscopic hadronic interactions and air shower features was made in Sect. 2.2 with the Heitler-Matthews model. Due to the involved simplifications and approximations, these can only guide the research. It is essential to study these connections with full air shower simulations, but such studies are challenging due to the complexity of air showers and the computational cost of simulating a large number showers with varied parameters in the hadronic interaction model. Major accomplishments in connecting microscopic and macroscopic shower physics are works on the impact on emission of Cherenkov light from a gamma shower (Fortson et al. 1999), on the asymmetries of muon footprints on the ground (Ave et al. 2000), on the time structure of signals in the water-Cherenkov detectors of the Pierre Auger Observatory (Abraham et al. 2008), on the impact of basic features of hadron production in a hadronic shower on the moments of \(X_{\text{max}}\) and \(N_{\mu }\) (Ulrich et al. 2011), and on the corresponding impact on atmospheric lepton fluxes (Fedynitch et al. 2012).

The Heitler-Matthews model has shown the following basic parameters of interest with a large impact on air showers: the inelastic hadronic cross-section \(\sigma _{\text{inel}}\), the multiplicity \(N_{\text{mult}}\) of secondary particles, the elasticity \(E_{\text{leading}}/E_{0}\) defined as the energy fraction carried by the most energetic particle, and the energy ratio \(R = E_{\gamma }/E_{\text{long-lived hadrons}}\) of energy carried by photons from short-lived hadrons like the \(\pi ^{0}\) to the energy in long-lived hadrons (Baur et al. 2019).

These parameters need to be well-known especially for the most common \(\pi \)-air interaction in air shower. The interactions of other long-lived particles such as the \(K\) meson are less important for the air shower development and the total number of muons produced in air showers (Maris et al. 2009) due to their lower mean multiplicity, also compare Fig. 9. For forward production in high-energy collisions, the quark flavour in the projectile is not important, since most quarks are produced from the vacuum and strong interactions are flavour-blind. This also means that high-energy \(\pi \)-air interactions can be constrained with collider measurements of p-air.

Ulrich et al. (2011) introduced an ad-hoc model for air shower simulations in which these parameters are changed during the run-time of an air shower simulation as a function of the energy \(E\) of the colliding hadron in the frame where the target is at rest. It uses the original predictions of a particular event generator as the baseline (in this case Sibyll2.1, but any generator can be used), which are scaled with an energy-dependent factor \(f(E)\). The factor is 1 below a chosen energy threshold of \(1~\text{PeV}\) and grows logarithmically above. This is motivated by the fact that generators are fairly constrained by accelerator data at low energies, but diverge logarithmically when they are extrapolated to higher energy where accelerator data is missing. The size of the modification is governed by the parameter \(f_{19}\),

$$ f(E) = 1 + (f_{19}-1) \cdot \textstyle\begin{cases} 0 & E < 1~\text{PeV} \\ \frac{\log _{10}\left (\frac{E}{1~\text{PeV}}\right )}{\log _{10}\left (\frac{10~\text{EeV}}{1~\text{PeV}}\right )} & E \ge 1~\text{PeV}, \end{cases} $$
(10)

which is the size of the modification for a hadron with \(10~\text{EeV} = 10^{19}~\text{eV} \) (an arbitrary scale). Equation (10) is applied in an air shower simulation to each individual hadron collision to modify the respective parameter, as listed above. Very large distortions of \(f(E)\) would correspond to exotic modifications of QCD and may conflict with more recent LHC measurements, while small deviations may be within the realm of conventional scenarios and compatible with LHC data.

The results of the study by Ulrich et al. (2011) for the mean and standard deviation of the muon number \(N_{\mu }\) and the depth of shower maximum \(X_{\text{max}}\) for a \(10^{19.5}~\text{eV} \) proton shower are shown in Fig. 10 as a function of the modification factor at the LHC energy scale of nucleon-nucleon collisions at \(13~\text{TeV}\) which corresponds to a projectile energy of \(90~\text{PeV}\) in the fixed-target system. We note that the fraction of neutral pions among all pions was modified in the ad-hoc model in the original study and not the energy ratio \(R\), but the effect of modifying this fraction and modifying \(R\) are numerically similar. The most effective way to increase the muon number in air showers is to decrease the \(\pi ^{0}\)-fraction. A 10% reduction increases \(N_{\mu }\) by 13%. This is less than predicted by the Heitler-Matthews in Sect. 2.2, but can be understood by the fact that the modification in this study only affects the shower evolution above \(1~\text{PeV}\), while in Sect. 2.2 a modification of the whole hadronic cascasde was considered. The muon number also increases with the multiplicity, but the effect is much weaker. A 30% increase would increase the muon number by only 9%. Changes to the inelastic cross-section and elasticity have negligible impact on the muon number.

Fig. 10
figure 10

Impact of changing basic parameters of hadronic interactions (see text for details) on the means and standard deviations of the logarithm of the muon number \(N_{\mu }\) (top row) and the depth \(X_{\text{max}}\) of the shower maximum (bottom row) for a \(10^{19.5}~\text{eV} \) proton shower simulated with Conex using Sibyll2.1 as the baseline model, as described in the text. Relative shifts to the mean values are shown on the left-hand side. Fluctuations are shown on the right-hand side. The original data from Ulrich et al. (2011) was refitted for this plot with monotonic cubic splines and are shown as a function of the modification in the nucleon-nucleon system at a cms-energy \(\sqrt{s_{\text{NN}}} = 13~\text{TeV} \), which is extrapolated logarithmically towards higher energies as described in the text. The shaded bands highlight a \(\pm10\%\) and \(\pm30\%\) modification, respectively

The impact on the standard deviation of the muon number is also important, which has been measured recently for the first time by the Pierre Auger Observatory (Aab et al. 2021). Reasonable agreement between the measurement and the post-LHC models EPOS-LHC, QGSJetII.04, and Sibyll2.3d was found. This puts strong constraints on changes to the elasticity, which is the only one of the four considered parameters with a large impact on the \(N_{\mu }\)-fluctuations. The measured \(N_{\mu }\)-fluctuations could be used to severely constrain the elasticity. A reduction of the \(\pi ^{0}\)-fraction by 10% would only change the \(N_{\mu }\)-fluctuations by one percentage point.

Since air shower simulations with post-LHC models give a reasonable description of the depth of the shower maximum, \(X_{\text{max}}\), it is important to also consider the impact of changes on \(X_{\text{max}}\). Air shower simulations for proton and iron showers bracket the measurements over a wide range of shower energies and the mass composition inferred from \(X_{\text{max}}\) is astrophysically plausible. This suggests that the parameter values that influence \(X_{\text{max}}\) cannot deviate too much from those in current models without destroying the consistency. The depth of the shower maximum is most sensitive to the inelastic cross-section which has been measured very precisely in proton-proton collisions at the LHC. A remaining theoretical uncertainty arises from the extrapolation of these data to the p-air and \(\pi \)-air cross-sections. Modifications of the multiplicity, elasticity, and \(\pi ^{0}\)-fraction all have a similar impact on \(X_{\text{max}}\).

The standard deviation of \(X_{\text{max}}\) is even more sensitive to the inelastic cross-section than its average. It weakly depends on changes to the elasticity and is unaffected by changes to multiplicity and \(\pi^{0}\)-fraction. The standard deviation has been measured precisely by the Pierre Auger Observatory (Aab et al. 2014a), and it is found that is challenging for QGSJetII.04 to simultaneously describe the mean and the fluctuation of \(X_{\text{max}}\), while EPOS-LHC gives a consistent interpretation. The sensitivity of the \(X_{\text{max}}\)-fluctuations to the inelastic cross-section has been further exploited to infer the inelastic p-air cross-section from air shower measurements by the Pierre Auger Observatory (Abreu et al. 2012), although the analysis uses the shape of the tail towards large \(X_{\text{max}}\) values instead of the standard deviation, which is beneficial in presence of a mixed composition of cosmic rays.

These individual results can be combined to reveal an interesting point illustrated in Fig. 11 (Citron et al. 2019; Baur et al. 2019), where the impact of changes in the hadron multiplicity \(N_{\text{mult}}\) and the energy ratio \(R\) on the means of the logarithm of the muon number and \(X_{\text{max}}\) is shown for \(10^{19}~\text{eV} \) showers and compared to a measurement by the Pierre Auger Observatory. In this double-logarithmic scale, any possible mass composition of cosmic rays between the two extremes of pure proton showers (bottom right) and pure iron showers (top left) produces a point on a straight line. The standard prediction by EPOS-LHC is indicated by the grey line (hard to see under the other colored lines). The muon deficit in simulations is the reason why this line does not overlap with the data point.

Fig. 11
figure 11

Impact of modifying the average hadron multiplicity \(N_{\text{mult}}\) and the ratio \(R\) of electromagnetic to hadronic energy in an air shower on predictions for the depth of shower maximum \(X_{\text{max}}\) and the number of muons \(N_{\mu }\) for \(10^{10}~\text{GeV}\) showers (image from Baur et al. 2019). The representative data point is taken from the Pierre Auger Observatory (Aab et al. 2015a)

In addition to the standard prediction by EPOS-LHC, also ad-hoc modified predictions are shown. The multiplicity \(N_{\text{mult}}\) (blue and red lines) and the ratio of energy going into neutral pions \(R\) (yellow and green lines) are changed, respectively, by up to \(\pm 20\%\). While the former shifts the lines along themselves and has no power to resolve the muon mystery, the change of \(R\) has an effect perpendicular to the lines and has a large impact on the interpretation of the data. This strongly suggests that the solution to the Muon Puzzle lies in a modification of the energy ratio \(R\).

3 Possible solutions for the Muon Puzzle

The muon deficit in air shower simulations starts to appear when the cms-energy in the nucleon-nucleon system reaches \(\sqrt{s_{\text{NN}}} \approx 8~\text{TeV} \). Since there is no large deficit observed in post-LHC models at lower energies, it seems that a new phenomenon in unbiased hadron collisions becomes important at this energy scale that can be neglected at lower energies. As discussed in Sect. 2.2 and Sect. 2.8, the only plausible way to increase the muon number sufficiently is to decrease the energy fraction lost to photon production (mainly from \(\pi ^{0} \) and \(\eta \) decay) in hadron collisions.

A comprehensive summary of rejected attempts to explain the Muon Puzzle in some other way is given by Farrar and Allen (2013). The effect must come from soft-QCD, since hard scatterings that produce new heavy particles are too rare to significantly change the hadron composition. The standard model of soft low-energy hadronic interactions leaves no room to change the hadron composition. It uses string fragmentation and remnant excitation to produce hadrons. Both effects are believed to produce an universal hadron composition independent of the collision energy. A string is a colour-neutral flux tube made of virtual quarks and gluons that spans between two colour charges. As the colour charges move away from each other, the string gains potential energy, which is subsequently converted into quark/anti-quark pairs created from the vacuum. String fragmentation produces the same composition of hadrons however the strings are created (Andersson 2005). However, basic arguments suggest that this picture has to be modified at high \(\sqrt{s_{\text{NN}}} \) and in nuclear collisions with high string density.

Before following this thought further, we want to mention two earlier proposals for solving the Muon Puzzle that are disfavoured by the early onset of the muon discrepancy around \(\sqrt{s_{\text{NN}}} \approx 8~\text{TeV}\). The proposals have in common that an extreme change in the first or the first few generations of the hadronic cascade is introduced at cms-energies outside the reach of colliders, while the rest of the shower develops normally. Farrar and Allen (2013) proposed a toy model based on Chiral Symmetry Restoration to suppress pion production, which effectively replaces mesons with baryons in a fraction of the high energy events generated by EPOS and some minor modifications of the average multiplicity and the inelastic cross-section. The modified model reproduces \(X_{\text{max}}\) measurements with proton cosmic rays and a range of other measurements available at the time while increasing the muon number by a factor 2. Anchordoqui et al. (2017) assume that cosmic rays at the highest energies are heavy or at least medium mass nuclei, and proposes that the first interaction creates quark gluon plasma (QGP), which we discuss further below. They then argue that the high baryochemical potential of this initial state (the excess of \(u\) and \(d\) quarks) leads to a shifted equilibrium with enhanced strangeness. This correspondingly leads to a reduction of the \(\pi^{0}\)-fraction and an increase in the muon content by 40%. To achieve this with a modification of only the first interaction requires an extreme combination of temperature and baryochemical potential not reproduced in heavy ion collisions at the LHC. LaHurd and Covault (2018) investigated the impact of QGP formation with conventional parameters if it affects only the first interaction and found only a small increase in the muon number, not sufficient to explain the Muon Puzzle.

The issues with these approaches are avoided if the responsible effect is weaker, but potentiates over several steps in the hadronic cascade and starts well below the EeV scale. Recent LHC data and theoretical considerations suggest a mechanism which fits this description qualitatively. We start by noting that the parton density function (PDF) of the gluon inside the nucleon rises rapidly with decreasing momentum fraction \(x\), as shown in Fig. 12. Producing hadrons requires a minimum 4-momentum squared \(Q_{\text{min}}^{2} \propto x_{1} \, x_{2} \, s\), which means that the growth in hadron multiplicity as a function of \(\sqrt{s}\) is driven by the increase in gluon density at small \(x\). Partons in high-energy collisions are dominantly produced by gluon fusion, while forward produced hadrons originate dominantly from quark-gluon scattering.

Fig. 12
figure 12

Parton density functions (PDF) times \(x\) of quarks and gluons in the proton as obtained from CT18 global analysis by Hou et al. (2021), evaluated at the scale \(Q = 2~\text{GeV} \). Bands indicate one standard deviation. Values were taken from APFEL Web by Carrazza et al. (2015)

At large \(\sqrt{s}\), the high gluon density leads to simultaneous multi-parton interactions and high string densities. Since strings are color-neutral, they do not interact at a distance, but lattice calculations show that strings are flux tubes with a finite radius (Cea et al. 2014, 2016). It follows that strings must eventually overlap in space, since the string density grows much faster with \(\sqrt{s}\) than the radii of the colliding hadron discs. In hadron-nucleus collisions this effect happens at lower \(\sqrt{s}\) due to the larger parton densities in the nuclei, but it is expected to happen in any system at sufficiently high energy. If the string interactions lead to a sufficiently strong modification in the hadron composition of forward produced particles, it can solve the Muon Puzzle.

Heavy-ion collisions have already demonstrated a modifications of the hadron composition with respect to classic string models. Strangeness and baryon enhancement have been observed at the SPS, RHIC, and the LHC (Capella 1995; Arsene et al. 2005; Adcox et al. 2005; Adams et al. 2005; Becattini and Manninen 2008). These collisions are very successfully described with models that assume the formation of a new state of matter, the QGP, which hadronises collectively after reaching thermal equilibrium.

It was commonly believed that QGP does not form in small systems like p-p and p-A collisions, since these systems do not seem to offer enough time for the colliding partons to thermalize into a QGP. However, several QGP-like signatures were also found in these systems. Recent overviews on these collective phenomena and theories to explain them are given by Dusling et al. (2016), Nagle and Zajc (2018), Citron et al. (2019). A range of models have been developed that explain these effects with and without the formation of QGP. The latter explain the observations through new properties of the initial state of the colliding hadrons at high \(\sqrt{s_{\text{NN}}}\), which can be described as a colour glass condensate (CGC), or by string interactions. Mixed forms are also explored, for example, a CGC that decays into a QGP.

The potential key to the Muon Puzzle was discovered by the ALICE collaboration, which found a universal strangeness and baryon enhancement in p-p, p-Pb and Pb-Pb (Abelev et al. 2014b; Adam et al. 2016c, 2017b; Vasileiou 2020). As shown in Fig. 13, the enhancement only depends on the multiplicity of the event at mid-rapidity, and not on the details of the collision system like its size or \(\sqrt{s_{\text{NN}}} \). This is highly remarkable: if this universality holds, it allows one to predict the hadron composition in average collisions of p-air at \(\sqrt{s_{\text{NN}}} \gg 10~\text{TeV} \) well beyond the reach of colliders based on reference measurements in central Pb-Pb collisions at current LHC energies.

Fig. 13
figure 13

Yield ratios of strange hadrons and pions in p-p, p-Pb, and Pb-Pb as a function of the multiplicity at mid-rapidity (image from Adam et al. 2017b)

Most of the data on collective effects was obtained with particles emitted at mid-rapidity \(|y| < 2\). This region is not directly relevant for air showers as shown in Fig. 15. It has not been shown experimentally yet that these effects can also be seen in hadrons produced at forward rapidities. Theoretical calculations suggest that hadrons produced from QGP decay reach well into the forward region, in p-p at \(13~\text{TeV}\) to \(y \simeq 6.5\) (Busza et al. 2018; Baur et al. 2019). It is therefore important to search for collective effects also at forward rapidities \(y > 2\) as a function of multiplicity in small systems, p-p, p-Pb, p-O, and O-O. The most important effect to search for is a strangeness and baryon enhancement.

We briefly summarise the other known signatures of QGP and why do they do not relate to the Muon Puzzle directly.

  • Short-range and long-range correlations of hadron momenta and azimuthal modulation of multiplicities. Because of collective flow in the QGP, hadron momenta become correlated and modulated along the azimuthal angle. This has been observed experimentally in systems ranging from p-p to Pb-Pb (Khachatryan et al. 2010a; Chatrchyan et al. 2013b; Khachatryan et al. 2017b; Aaboud et al. 2019a; Acharya et al. 2021b). The correlations are a key signature of collective hadronization, but there is a wide range of models with and without QGP formation that describe the data. Flow has no direct consequences for the Muon Puzzle or for the air shower development in general. The effects would average out over the shower development and are negligible compared to intrinsic shower fluctuations.

  • Jet quenching. A key signature of QGP formation is the energy loss of high-\(p_{\text{T}}\) partons traversing the medium. Jets are produced by partons generated in a hard scattering. If a parton has to traverse a QGP medium, it looses energy compared to free propagation. This effect was first discovered at RHIC in Au-Au collisions and has been extensively studied in Pb-Pb and Xe-Xe at the LHC (Adcox et al. 2002; Arsene 2013; Balek 2014; Jung 2014; Sirunyan et al. 2018d, 2017b; Aaboud et al. 2019b; Sirunyan et al. 2018a, 2019b; Khachatryan et al. 2017a; Acharya et al. 2019d, 2018; Adam et al. 2016b,a; Festanti 2014; Grelli 2013). It is measured with the nuclear modification factor \(R_{AA}\), which compares the \(p_{\text{T}}\)-dependent yields for particle production in A-A collisions with the corresponding p-p collision. This effect is absent in p-Pb (Abelev et al. 2013c; Balek 2014; Aad et al. 2015; Jung 2014; Li 2014; Adam et al. 2015e,b; Khachatryan et al. 2017a; Sirunyan et al. 2017d; Balek 2017; Aaij et al. 2017; Sun 2019; Acharya et al. 2021a, 2018,c), which implies it is also absent in p-air and therefore irrelevant for the Muon Puzzle. Some modification is observed at low \(p_{\text{T}}\) in p-Pb collisions, but that is attributed to the initial state, such as modifications of the parton density functions of the nucleon inside a nucleus compared to free protons. These so-called cold-nuclear matter effects are relevant for air showers and need to be measured for oxygen in p-O collisions.

  • Enhancement of \(\langle p_{\text{T}} \rangle \) for hadrons with higher mass. When hadrons form in the rest frame of a moving fluid, they receive a shift towards higher momentum as a function of their mass that can be observed in their \(p_{\text{T}}\) distribution. This effect is not relevant for the Muon Puzzle, since the \(p_{\text{T}}\) distribution in the first few interactions has no impact on the later shower development.

In the following, we give an overview of different approaches to explain QGP-like phenomena. Most theoretical works focus on the mid-rapidity region. For a potential solution to the Muon Puzzle, more theoretical investigations into the forward hadron production in hadron-ion collisions at energies \(\sqrt{s} > 10~\text{TeV} \) are needed. So far, the only predictions for air showers come from the EPOS model which assumes QGP formation in small systems.

3.1 Quark gluon plasma in small systems

Quark gluon plasma (QGP) is a high temperature state of quark and gluon matter in which the partons are no longer bound into colour-neutral hadrons. The formation of QGP is the standard model for high-energy heavy-ion collisions in very good agreement with observations. Recent overviews on QGP and related phenomena are given by Dusling et al. (2016) and Busza et al. (2018). The QGP medium behaves like a super-fluid with zero viscosity; its equation of state has been computed with lattice QCD. The expansion is described hydrodynamically. QGP evolution is therefore theoretically well-understood and the thermalization (at least in heavy-ion collisions) hides details in the initial state which are less well understood. Both aspects give QGP models high predictive power. Because of the universal aspects, the remaining parameters of QGP models can be tuned in reference systems measured at colliders and then should work for any other collision system.

QGP models enjoy great success in describing heavy-ion collisions, but there is a controversy whether QGP also forms in small systems. To illustrate this we describe the standard picture of a heavy-ion collision. When two nuclei collide, they appear to each other as Lorentz-contracted thin discs, which have a short traversal time \(\ll 1~\text{fm}/c\). So the initial state of the collision is a pre-equilibrium with a very inhomogeneous distribution of deposited energy in the plane perpendicular to the beam direction. Some time is required to equilibrate into the QGP fluid, during which the matter expands in the longitudinal direction and radially in the transverse plane. The QGP fluid expands and cools until it reaches the QGP freeze-out temperature \(T \simeq 170~\text{MeV} \) (see Aoki et al. 2006) and then breaks up into hadrons. The hadron density is initially large and scattering is frequent. This phase is called a hadron gas. As the gas expands and the density drops, it experiences a chemical freeze-out when inelastic interactions stop and, shortly after, a kinetic freeze-out when elastic interactions stop. At this point the hadrons have the final-state momenta that are measured experimentally.

It has been argued that small systems expand too fast for the initial state to equilibrate into a QGP. While small systems display several phenomena that are QGP-like, there are also other plausible mechanisms that produce these signatures. One of the key signatures of QGP formation is jet quenching, which is particularly difficult to explain without QGP, but jet quenching is not observed in p-Pb collisions. This seems at variance with QGP formation. Sievert and Noronha-Hostler (2019) have simulated O-O collisions (a small system) and found that the initial parton density and thus temperature fluctuations in O-O are larger than in Pb-Pb. Locally, very high temperatures can be reached. This suggests that only local QGP droplets form in small systems instead of a large QGP medium and could explain the lack of jet quenching. How small QGP droplets can become has been investigated theoretically by Chesler (2015, 2016), with the conclusion that the minimum radius is given by the inverse of the temperature. QGP droplets as small as the size of a proton in p-A collisions seem theoretically possible.

EPOS (Pierog and Werner 2009; Pierog et al. 2015) was the first generator that combined a string model with QGP formation to describe both small and large collision systems. In EPOS, strings are formed in all systems. If the string density surpasses a threshold, strings are merged to form QGP, while they hadronise classically below that threshold. In general, this gives rise to two hadronization regimes, called core and corona (Becattini and Manninen 2008). The core consists of particles from QGP hadronization, while the corona is dominated by string fragmentation and contributions from the excited remnant. The core dominates at mid-rapidity but extends well into the forward region which is important for air showers. Baur et al. (2019) showed that there is still a significant contribution at \(\eta \simeq 7\) in p-p collisions at \(13~\text{TeV}\).

How large the core contribution is depends on tuning parameters in the EPOS model. In the widely used version EPOS-LHC, the core contribution seems to be underestimated in light of the recent ALICE data on strangeness enhancement. Larger contributions would be compatible with LHC data and quantitative projections from Baur et al. (2019) have shown that an enhancement of the core is a potential solution to the Muon Puzzle. Measurements of strangeness and baryon enhancement at forward rapidities as a function of multiplicity are a stringent test for this theory, as well as measurements of the ratio of electromagnetic to hadronic energy as a function of the multiplicity in the forward region, which have been performed in p-p collisions at \(13~\text{TeV}\) by CMS Sirunyan et al. (2019c) and should be repeated for p-Pb collisions.

3.2 Color glass condensate and glasma

The colour glass condensate (CGC) model is an effective field theory which offers an alternative explanation for many phenomena attributed to QGP formation. For introductions to the CGC see Iancu and Venugopalan (2003), Gelis et al. (2010), Dusling et al. (2016).

Colliding hadrons in a high-energy collision appear to each other as flat Lorentz-contracted discs that are densely packed with gluons. Gluons with a small momentum fraction \(x\) are much more numerous and have short life-times compared to gluons at large \(x\) whose life-times are time dilated. The small \(x\) gluons can be approximated by static classical fields for which the large \(x\) gluons act as sources. The large \(x\) gluons then appear weakly coupled because of the screening provided by the small \(x\) gluon fields. The fields have the form of Lorentz-boosted Coulomb fields of electrodynamics with random colour, polarisation and density. The spectrum of fluctuations can be computed in the CGC framework and has a universal solution for small \(x\) for any hadron. This gives CGC models high predictive power.

When two sheets of coloured glass collide, the classical colour fields change from transverse orientation and, being confined to the thin sheets, to longitudinal colour electric and colour magnetic fields that span longitudinally along the direction of motion between the sheets. The initial distribution of these longitudinal fields and their evolution until thermalization is referred to as the Glasma. The glasma eventually decays into quarks and gluons which are close in description to that of a QGP.

CGC calculations can reproduce the flow in heavy ion collisions observed at the LHC and RHIC (Dumitru et al. 2011; Dusling and Venugopalan 2012; Schenke et al. 2012; Mäntysaari et al. 2017), and recently have been shown to also reproduce strangeness enhancement in p-p collisions (Siddikov and Schmidt 2021).

3.3 Enhanced parton models

QGP-like phenomena have been also reproduced by parton shower models that are extended with string-string interactions or with parton scattering. The former idea was already described in the introduction to this section. Strings are colour-neutral and thus do not interact at a distance, but they have a finite radius and therefore should influence each other when they overlap. The overlap can be neglected in low-multiplicity events, but not in high-multiplicity events. Parton transport models instead implement parton-parton scattering to achieve a similar effect, gluons and quarks are treated as quasi-particles which scatter during the collision and effectively produce hydrodynamic-like flow patterns. It is a weakly interacting scenario in contrast to the strongly coupled hydrodynamics. Notable parton transport models are BAMPS (Xu and Greiner 2005) and AMPT (Lin et al. 2005). Whether these ideas are a dual picture to hydrodynamics or a real alternative is an open question. The details of these interactions are not well understood, which leaves freedom for model builders.

Several kinds of string interactions have been proposed. Colour reconnection (Ortiz Velasquez et al. 2013; Bierlich and Christiansen 2015) is implemented in Pythia 8 and produces flow-like effects. It can occur in events with multi-parton interactions, when several hard scatterings produce parton pairs simultaneously. Partons that cross existing strings created by other partons are colour reconnected in such a way that the total string length becomes as short as possible. This introduces momentum correlations for the hadrons produced by the two independent hard scatterings and enhances baryon production, but not strangeness. String shoving in Pythia 8 (Bierlich et al. 2018) is another approach to produce correlated flow. A repulsive force is assumed between overlapping strings which introduces a transverse velocity that is transferred to the hadrons after string fragmentation.

Rope hadronization (Bierlich et al. 2015; Bierlich and Christiansen 2015) is a mechanism that produces strangeness and baryon enhancement. It assumes that parton pairs which form next to each other in geometric space act coherently to form a colour rope instead of two independent strings. The rope has a higher effective string tension, which results in more strange quarks and diquarks produced in its fragmentation. Rope hadronization is implemented in the DIPSY model that partially reproduces the strangeness enhancement in ALICE data (Adam et al. 2017b).

4 LHC measurements: status and prospects

The Large Hadron Collider (LHC) accelerated protons up to \(13~\text{TeV}\) and lead ios up to \(5.02~\text{TeV}\) per nucleon during Run 1 and 2 to study proton-proton, proton-lead, and lead-lead collisions. A pilot run with Xe-Xe collisions followed in 2017 and not-fully-ionised lead nuclei in 2018. Due to the success of the LHC heavy ion program in Run 1 and Run 2, an extension towards lighter ions has been proposed in Citron et al. (2019) for the upcoming Run 3 and Run 4. A pilot run with oxygen beams has been proposed for the end of the upcoming Run 3 in 2023. These collision systems are visualised in Fig. 14.

Fig. 14
figure 14

Collision systems with data recorded at the LHC in Run 1 and 2 (black). Also shown are possible configurations with oxygen beams that have been proposed for Run 3 (red). The nucleon number is shown in log-scale since typically the relevant physics varies on a logarithmic scale

The LHC as a p-p and p-A collider with the highest available cms-energies \(\sqrt{s_{\text{NN}}} \) in the nucleon-nucleon system from 0.9 to \(14~\text{TeV}\) is an essential source of data for the modelling and understanding of extensive air showers. For a proton cosmic ray with energy \(E_{0}\) hitting an air nucleus at rest, the conversion between \(E\) and \(\sqrt{s_{\text{NN}}} \) is in the ultra-relativistic limit

$$ \sqrt{s_{\text{NN}}} = \sqrt{2 E_{0} m} \quad \leftrightarrow \quad E_{0} = \frac{s_{\text{NN}}}{2m}, $$
(11)

where \(m\) is the nucleon mass. LHC data is used to provide anchor points for the tuning of parameters of hadronic interaction models used for air shower simulations. The models are essential to extrapolate into the uncharted phase-space of typical hadronic interactions in air showers since LHC data does not directly mimic these interactions. The following extrapolations are involved.

  • Towards higher centre-of-mass energies. The LHC energies from 0.9 to \(14~\text{TeV}\) cover projectile lab energies in air showers from 0.4 to \(104~\text{PeV}\). The cosmic-ray energy spectrum extends at least three orders of magnitude further. The highest-energy cosmic-ray event ever recorded by the Fly’s Eye experiment had an energy of \((320\pm 90)~\text{EeV} \), see Bird et al. (1995b), corresponding to \(\sqrt{s_{\text{NN}}} = (780 \pm 110)~\text{TeV} \). The Pierre Auger Observatory also has recorded events that exceed \(100~\text{EeV}\) (Aab et al. 2020).

  • Towards hadron-nuclear collision systems. The ability to describe hadron-nuclear collisions is very important for generators used in air shower simulations. The most common interaction in an air shower is \(\pi \)-N and the most important first interaction is p-N, since nuclear projectiles in air showers behave in good approximation like a superposition of elementary nucleon interactions as far as the projectile is concerned (the situation is kinematically different for the target, which cannot be approximated by a superposition of nucleons). These systems are far away from both p-p and p-Pb, as indicated in Fig. 14. Generators mostly extrapolate from h-p to h-N without using the p-Pb data. Sibyll2.3d is strict about this limitation and rejects projectiles heavier than iron and targets heavier than argon.

  • Towards forward rapidities. The mid-rapidity region \(|\eta | < 2\), which is most precisely measured at LHC, is only indirectly relevant for air showers. This is illustrated in Fig. 15, which shows that the muons in an air shower are dominantly produced by long-lived hadrons emitted in the forward region at \(\eta > 2\). Except for LHCb, the LHC experiments were not designed to perform precision tracking and PID at forward rapidities \(\eta > 2\), also since there are considerable technical challenges for forward measurements at high luminosity conditions at the LHC. The radiation damage can become severe.

    Fig. 15
    figure 15

    Simulated densities of prompt particles (solid lines) in high-energy p-p, p-Pb, and p-O collisions. Dashed lines show the estimated number of muons produced by the secondaries if they were propagated through the atmosphere, assuming \(N_{\mu}\propto E_{\text{lab}}^{0.93}\), where \(E_{\text{lab}}\) is the energy of the secondaries in the boosted system

Because of the limited ability of most current hadronic interaction models to describe heavy-ion collisions, the LHC configurations with p-Pb, Pb-Pb, and Xe-Xe currently cannot be fully used for parameter tuning and model validation. This is a severe drawback in regard to the rich results obtained with the heavy ion program of the LHC and the recent discovery of QGP-like effects in light systems including p-Pb from p-p. LHC collisions with lighter nuclei are needed to resolve this limitation.

Basic features of hadronic interactions which are important for air shower simulation are shown in Fig. 16, compare with Sect. 2.8. The inelastic cross-section has a high impact on the predicted value of the depth of shower maximum \(X_{\text{max}}\) but not on the muon number \(N_{\mu }\). LHC measurements have constrained the inelastic p-p cross-section to very high precision and resolved the \(1.9\sigma \) ambiguity in earlier Tevatron data (Abe et al. 1994; Amos et al. 1992). The measurement of the p-Pb inelastic cross section (Khachatryan et al. 2016) at \(5.02~\text{TeV}\) is also important, since it validated the standard Glauber model to better than \(\approx 10\%\). This had a noted impact on the systematic uncertainty of \(X_{\text{max}}\) predictions. There is still a remaining uncertainty in the extrapolation of the inelastic cross-section from p-p to p-air, which could be reduced with future data from p-O collisions. The \(\pi \)-air cross-section remains weakly constrained.

Fig. 16
figure 16

Basic features of hadronic interactions as a function of \(\sqrt{s_{\text{NN}}} \) (images from Pierog 2018). From left to right: p-p, p-air, and \(\pi \)-air. From top to bottom: total inelastic cross-section, hadron multiplicity at mid-rapidity \(|\eta | < 2.5\), and inelasticity (complement of the energy fraction carried by the most energetic particle). The data points shown in the two upper left plots are taken from several experiments, see Pierog (2018) for details

The charged particle multiplicity has a significant impact on both \(X_{\text{max}}\) and \(N_{\mu }\). The multiplicity in p-p collisions at mid-rapidity is well constrained by LHC data up to \(13~\text{TeV}\) to the level of a few percent; the most recent data is not shown in Fig. 16. We emphasise that the multiplicity at mid-rapidity is important for tuning and a benchmark point for models, but has no direct impact on air shower development. Of direct interest is the model variance of the hadron multiplicity in the forward region \(\eta \gg 2\), since forward-produced particles carry the highest energies, in particular if the system is boosted to the fixed target air shower frame, thus, they produce the largest sub-showers in the next step of the hadronic air shower cascades. This forward region can be constrained with LHC data up to \(\eta < 6.4\) with TOTEM.

The inelasticity is the complement of the energy carried away by the most energetic particle in an inelastic collision. The inelasticity is an important quantity in air shower physics with a high impact on \(X_{\text{max}}\), but small impact on \(N_{\mu }\). It is not directly measurable at the LHC, but measurements related to inelasticity are the diffractive cross-section and the far-forward production cross-section of photons and neutrons.

4.1 LHC experiments

The acceptances of the LHC experiments discussed here are compared in Fig. 17. We will briefly discuss the advantages of each experiment in regard to measurements for air shower physics in the following.

  • ALICE was designed for heavy-ion physics at the LHC (Abelev et al. 2014d). Its strength is the high-resolution tracking system at mid-rapidity with excellent hadron particle identification (PID) capabilities. The tracking system can handle even central Pb-Pb collisions, which can have more than 10000 tracks at \(|\eta | < 0.5\). The electromagnetic calorimeter (ECal) partially covers the same region as the PID system, it captures photons and provides electron identification. ALICE has no hadronic calorimeter (HCal), since it would not provide significant additional information. A small single-arm muon system in the backward region capture muons from decays of charm and beauty. A system of charged-particle counters without tracking capabilities extends the acceptance of ALICE into the forward and backward region. A sophisticated system of zero-degree calorimeters (ZDCs) detects protons, neutrons, and photons emitted at small beam angles.

    ALICE has provided a wealth of high-precision data of identified hadron spectra at mid-rapidity, differential production cross-sections as a function of \(\eta \), \(p_{\text{T}}\), and multiplicity, which have been one of the main sources for parameter tuning and validation of hadronic interaction models. The models are tightly constrained at mid-rapidity by these measurements; they fix the multiplicities of various light hadrons, the average charged particle multiplicity, the multiplicity spectrum. Various QGP-like effects at mid-rapidity have also been discovered, the strangeness enhancement is most important for air showers which may have a profound impact on the models beyond tuning. The backward muon system has been used to measure the differential production cross-sections of \(D\) and \(B\) mesons, which are inputs to constrain the heavy-quark parton density functions of the free proton and the bound nucleon. These in turn are used to predict the prompt atmospheric lepton flux which forms the principal background for high-energy neutrino observatories.

  • ATLAS is a general purpose symmetric spectrometer with wide acceptance (Aad et al. 2008, 2009). The systems ordered by increasing pseudorapidity are: the central tracker with \(|\eta | < 2.5\), the muon system with \(|\eta | < 2.7\), the ECal with \(|\eta | < 3.2\), and the HCal with \(|\eta | < 4.9\). Notable is the ALFA Roman Pot system to precisely measure the total and elastic cross-section via the optical theorem. ATLAS profits from combined measurements with the LHCf experiment, which is described further below.

    Relevant strengths of ATLAS are its wide acceptance, the precisely measured luminosity, and its ability to combine measurements with LHCf. The collaboration has measured charged particle spectra in the central region as a function of \(p_{\text{T}}\) with high precision at the level of 1% at mid-rapidity and the forward energy flow, which constrain the hadron multiplicity in models. The wide acceptance was used to measure the cross-section for diffractive events, interactions in which the beam particles exchange no colour. The inelasticity of a hadronic interaction is sensitive to these interactions, which produce very few but high-energy particles far forward. Thanks to its ALFA detectors, ATLAS provided the most precise measurements of the inelastic p-p cross-section so far, a direct input for hadronic models which extrapolate this to p-air and higher energies.

  • CMS is the second general purpose symmetric spectrometer at LHC (Bayatian et al. 2006). The systems ordered by increasing pseudorapidity are: the central tracker and the muon system with \(|\eta |< 2.4\), the ECal with \(|\eta | < 3\), and the HCal with \(|\eta | < 5\). A system of zero-degree calorimeters (Surányi et al. 2021) detects neutral particles emitted at extremely small beam angles. Notable is also the single-arm very forward calorimeter system CASTOR (Khachatryan et al. 2021), which covers \(-6.6 < \eta < -5.2\). Moreover, CMS profits from combined measurements with the TOTEM experiment, described further below.

    Particular strengths of CMS are its wide acceptance covering parts of the forward phase-space, single-particle identification at central rapidities, as well as high-accuracy luminosity measurements. CMS has measured, for example, charged particle spectra and forward energy flow to constrain the hadron multiplicity in models, the cross-section for diffractive dissociation and total inelastic cross-sections. CMS has used particle identification capabilities of its tracker (via specific energy loss \(\text{d}E/\text{d}x\), see Sirunyan et al. (2017a) for a recent analysis) for low momentum particles at mid-rapidity. The method is technically limited to small momenta \(p \lesssim 2~\text{GeV} \,c^{-1}\), measurements are therefore restricted to the mid-rapidity region \(|y| < 1\) and \(p_{\text{T}}\lesssim 2~\text{GeV} \,c^{-1}\). This acceptance for particle identification overlaps with ALICE.

    Thanks to the CASTOR system, CMS has the widest nearly continuously covered acceptance in \(\eta \) over 13.2 units of rapidity. CASTOR has only a single-arm, but that is not a limitation for most measurements; the recorded collisions are either symmetric or both beam configurations were recorded like in the case of p-Pb. CASTOR is especially powerful for the measurement of diffractive events, low-\(x\) parton physics, and forward energy flow. The ability of the CASTOR system to distinguish between hadronic and electromagnetic energy deposit is particularly useful. CASTOR is currently decommissioned as part of the LHC-wide upgrade for Run 3 due to changes in the beam pipe geometry and shielding at CMS.

  • LHCb is a single-arm general purpose forward spectrometer designed for the study of flavour physics (Alves et al. 2008; Aaij et al. 2015a). It features a tracking and muon system, an ECal and HCal system, and a ring-imaging Cherenkov detector for particle identification over an acceptance \(1.9 < \eta < 4.9\). The tracker has acceptance also at \(-3.5 < \eta < -1.5\), but without magnetic field. The single-arm geometry is not a caveat for most measurements, since the recorded collisions are either symmetric or both beam configurations are recorded. The LHCb experiment also features a set of forward scintillators around the beam pipe called HeRSCheL (Akiba et al. 2018) located upstream and downstream of the main experiment. The HeRSCheL system is designed as a veto for diffractive and inelastic events for the study of central-exclusive production \(p+p \to p + X + p\). This system is decommissioned as part of the LHC-wide upgrade for Run 3. LHCb features an integrated gas target to study fixed-target collisions at \(\sqrt{s_{\text{NN}}} = 68\) to \(110~\text{GeV} \) with the LHC beams, which described in more detail in Sect. 4.3.

    LHCb is the only fully instrumented general-purpose spectrometer in the forward region \(1.9 < \eta < 4.9\) with tracking and particle identification. This range is of direct interest for air shower physics since it covers the onset of the forward region where most of the energy is directed to. LHCb was designed to measure the forward production cross-sections of prompt \(D\) and \(B\) mesons. The time-dilation that forward produced particles experience together with the high resolution tracker allows one to distinguish prompt and non-prompt production and to tag \(b\)-decays. LHCb data uniquely allows one to constrain heavy-quark parton density functions of the free proton and the bound nucleon at very low \(x\), which are used in turn to predict the prompt atmospheric lepton flux which forms the principal background for high-energy neutrino observatories. Further opportunities in the forward region are the study of identified charged hadron spectra, energy flow (separated by electromagnetic and hadronic flow), production cross-sections for photons and \(\pi ^{0}\), and production cross-sections for \(V^{0}\) particles important for air shower physics like \(\rho ^{0}\), \(K^{0}\), and \(\Lambda \). The HeRSCheL system could be used to measure the diffractive cross-section and improve the LHCb measurement of the inelastic cross-section.

Fig. 17
figure 17

Acceptances of the LHC experiments ALICE, ATLAS, CMS, LHCb, TOTEM, and LHCf, see Abelev et al. (2014d), Aad et al. (2008, 2009), Bayatian et al. (2006), Surányi et al. (2021), Alves et al. (2008), Aaij et al. (2015a), Khachatryan et al. (2021), Adriani et al. (2008), Anelli et al. (2008) for details. In the legend, a tracker follows individual particles in a magnetic field, while a counter measures particle densities in \(\eta\)-intervals. Muon refers to a special muon tracker, PID refers to the ability to identify individual particles, ECal and HCal refer to electromagnetic and hadronic calorimeters, respectively. The inner part \(|\eta | < 1\) of the CMS tracker is partially marked red to indicate its PID capabilities for particles with \(p_{\text{T}}\lesssim 2~\text{GeV} \,c^{-1}\). Similarly, the acceptance \(|\eta | > 8.4\) of the LHCf experiment around ATLAS is partially marked brown to indicate its capability to measure neutrons

In addition to the four large general purpose LHC experiments there two specialised experiments with a high impact on air showers, LHCf and TOTEM. Both experiments can measure particles emitted in the very-forward region.

  • LHCf is a system of electromagnetic zero-degree sampling calorimeters located up- and downstream of the ATLAS experiment (Adriani et al. 2008). It was designed to measure the production cross-sections of photons and \(\pi ^{0}\) in the far forward region \(|\eta | > 8.4\) with energy resolution of a few percent. It is also capable of measuring neutrons with an energy resolution of about \(40\%\) and detection efficiency of better than \(\approx 50\%\) (Adriani et al. 2018b). It is an independent experiment with its own trigger, but the time-stamps are partially synchronised with ATLAS so that some events may be merged offline. This has been used to study diffractive events, which had no activity in the ATLAS tracker. A highlight of LHCf is the measurement of neutron spectra at zero-degree, which is beyond the design goals of the calorimeter (Adriani et al. 2015, 2018b).

  • TOTEM is a system of charged particle trackers surrounding the CMS experiment, designed to measure the total p-p cross-section at the TeV-scale to very high precision via the optical theorem and to study elastic scattering and diffractive dissociation (Anelli et al. 2008). The experiment consists two large trackers T1 and T2 which are installed near the CMS experiment to provide particle tracking in the forward region \(3.1 < |\eta | < 4.7\) and \(5.3 < |\eta | < 6.5\). These trackers are used by TOTEM to tag inelastic collisions and offer the most forward data on charged particle densities at the LHC. Further small trackers are installed in special movable beam-pipe insertions (called Roman Pots) at several locations in large distances from the interaction point. The trackers inside the Roman Pots are moved within \(1~\text{mm}\) of the beam to measure elastic and diffractive scattering under extremely small angles. The TOTEM measurements of the inelastic cross-section of p-p collisions are among the most accurate to date. Also the shape of differential \(|t|\)-distribution for elastic events has been measured at unprecedented accuracy, illustrating higher-order correction to the assumed Gaussian shape of the proton-proton overlap region (Antchev et al. 2019a). Another unique highlight is the measurement of the Pomeron-photon interference region at extremely small scattering angles (Antchev et al. 2016), which was only done by TOTEM at LHC energies.

4.2 Relevant LHC measurements

The quantities that need to be measured at the LHC to improve the simulation of air showers are primarily average properties of light-flavor hadron production at low momentum transfer in the realm of semi-hard- and soft-QCD. Rare events like the production of heavy and/or high-\(p_{\text{T}}\) particles do not significantly influence air shower development and are not of prime interest, with the exception of forward heavy-flavour production, which is the source of the so-called prompt atmospheric neutrino flux. The most important measurements at LHC for air shower physics are

  • the inelastic cross-section, \(\sigma _{\text{inel}}\),

  • the hadron multiplicity over a wide range of rapidity,

  • the diffractive cross-section (cross-section for events with rapidity gaps to occur),

  • the composition and spectra of light hadrons: pions, kaons, protons, \(\text{K}^{0}\), and \(\Lambda \),

  • the forward production cross-sections of certain key particles: \(\pi ^{0} \), \(\eta \), \(\rho ^{0} \), the lightest \(D\) mesons and \(B\) mesons,

  • and the forward energy flow \(\text{d}E/\text{d}\eta \).

The total energy flow is a more effective quantity, while the ratio of hadronic to electromagnetic energy flow is a direct measurement of the energy fraction \(\alpha \) which remains in the hadronic cascade to which the number of low energy muons produced in an air shower is very sensitive to. The energy fraction \(1-\alpha \) is lost via photon production in each step through decays of short-lived mesons. These are mostly \(\pi ^{0}\) and \(\eta \) mesons. In \(\pi \)-air interactions, the forward production of \(\rho^{0}\) mesons plays an important role in increasing \(\alpha \) (Drescher 2008; Ostapchenko 2013; Riehn et al. 2020a) and generally the production of baryons (Grieder 1973; Pierog and Werner 2008; Riehn et al. 2020a). The relevant process for \(\rho ^{0}\) production in pion-projectile interactions cannot be directly measured with the LHC beams, but the ratio of \(\rho ^{0}\) to \(\pi ^{0}\) production in p-p and p-Pb collisions is an important benchmark for the hadronic generators. Forward baryon production can be measured, and anti-baryons are excellent proxies for baryons newly formed in interactions as opposed to being created in the disintegration/fragmentation of the projectile.

The measurement of \(D\) and \(B\) is not motivated by the Muon Puzzle, but by the wish to accurately predict inclusive atmospheric lepton fluxes above \(1~\text{PeV}\), which is the main background for high-energy neutrino observatories. Air showers provide the conventional component of this flux, the amount is directly linked to the muon production and thus the Muon Puzzle, since most muons and neutrinos are produced in pairs. At neutrino energies above \(1~\text{PeV}\), the prompt component becomes dominant which arises from the production and decay of charm and beauty in the first interactions of a cosmic-ray nucleus with air. The production cross-sections for \(D\) and \(B\) mesons in p-p, p-Pb, and p-O collisions constrain the corresponding nuclear parton density functions, which are key ingredients for the flux calculations.

On the one hand, the published data already constrain well the inelastic cross-sections, the hadron multiplicity, the diffractive cross-section, and the production cross-sections for \(D\) and \(B\) mesons, also to some part in the forward region. On the other hand, we identify a lack of data on identified hadron spectra and on strangeness production in the forward region especially in p-A collisions, which are expected to have a high impact on the Muon Puzzle. The relevant measurements that have been published in refereed journals so far are listed in Table 3 and are further discussed in the following.

  • Inelastic and elastic cross-sections. Two complementary techniques are used to measure the inelastic cross-section. One is based on counting empty events; the probability to observe empty events decreases as the inelastic cross-section increases. It requires wide acceptance, a precise measurement of the beam luminosity, and theory input to extrapolate the measured fiducial cross-section to the full inelastic cross-section. The other technique is employed by TOTEM and ATLAS/ALPHA and is based on observing elastic scattering down to very low momentum transfer. From the forward amplitude of elastic scattering, the total cross section can be calculated using the optical theorem, which is based on conservation of probability. This also requires some small model-dependent phase-space corrections as well as an independent measurement of luminosity. However, when inelastic event counting is combined with the elastic measurement, it is possible to perform a luminosity-independent cross section measurement. This is a particular advantage of the experimental setup of TOTEM and also the ATLAS/ALPHA system. Furthermore, from elastic scattering further important physics parameters can be extracted. The most important one is the shape of the differential elastic cross section in \(|t|\), which corresponds to the size and shape of the actual collision region in impact parameter space. For small \(|t|\) the shape of this distribution is exponential to very good approximation with small higher-order corrections (Antchev et al. 2019b). Another quantity that is measured by TOTEM in dedicated data taking and special analysis is the ratio of the real to imaginary part of the elastic forward scattering amplitude. This can be extracted from very high-\(\beta ^{*}\) data by observing the Coulomb-nuclear interference region. These parameters are further important ingredients in modelling of cross section, e.g. via Glauber or Gribov-Regge theory.

    Measurements of the inelastic cross-section of p-p collisions have been performed from 0.9 to \(13~\text{TeV}\) by multiple experiments. The most precise measurements have been obtained by TOTEM. CMS has also measured the inelastic cross-section of p-Pb collisions at \(5.02~\text{TeV}\), which is very interesting to test predictions based on Glauber models directly. There are currently no published measurements for Pb-Pb and Xe-Xe.

    These p-p measurements are very precise, and significantly improve the extrapolation toward the highest energies as they occur in air showers. This increase in accuracy had a significant impact on the predictions of the depth of shower maximum \(X_{\text{max}}\) in air shower simulations, but it is less relevant for the number of muons \(N_{\mu }\) produced in the shower.

  • Diffractive cross-sections and rapidity gaps. Diffractive collisions are a subclass of inelastic events (about 20% of the inelastic cross-section at the TeV scale) in which no colour is exchanged between the beam particles, but momentum is transferred and new particles are produced. The experimental signature of a diffractive event are large rapidity gaps, regions in rapidity devoid of particles. Rapidity gaps were first studied at HERA (Derrick et al. 1993; Ahmed et al. 1994). Important for air showers are single diffractive events, in which only one of the beam particles is dissociated, and double diffractive event, in which both are dissociated. The measurement is ideally performed with a detector with wide acceptance and very-forward detectors to detect the dissociation.

    The differential rapidity-gap cross section measurement by ATLAS and CMS are particularly useful. In this case the diffractive cross-section is given as a function of the size of the gap in pseudorapidity, which allows one to directly study the transition from inelastic to diffractive event topologies. This new type of measurement had an impact on the development of EPOS and revealed tension to air shower data that could only be overcome by an individual tuning of pion-projectiles in EPOS (Pierog et al. 2016). This underlined the importance of pion-projectiles in air showers, and the related potential modelling uncertainties.

    Diffractive cross-sections were measured in p-p collisions from 0.9 to \(7~\text{TeV}\) by ALICE, CMS, and TOTEM. Measurements in p-Pb collisions are under study by CMS (Sosnov 2020) and would help to understand how diffractive dissociation is modified in an ion collision.

  • Charged particle spectra. Charged particle spectra can be measured with particle tracking as a function of \(\eta \) and \(p_{\text{T}}\) or without tracking as a function of \(\eta \) only. Both types of measurements are of interest for air shower physics, the loss of the \(p_{\text{T}}\) information is acceptable at the TeV scale, since most of the lateral spread of particles perpendicular to the air shower axis is generated at lower energies. At high energies, the boost factor \(1/\gamma \) suppresses the transverse spread of the particles.

    Measurements at mid-rapidity \(|\eta | < 2.5\) have been performed in p-p collisions from 0.9 to \(13~\text{TeV}\), in p-Pb from 2.76 to \(8.16~\text{TeV}\), in Pb-Pb from 2.76 to \(5.02~\text{TeV}\), and in Xe-Xe collisions at \(5.44~\text{TeV}\). Forward measurements that cover the relevant \(\eta \)-region for air showers have also been performed up to \(\eta = 6.4\) in p-p collisions from 0.9 to \(8~\text{TeV}\) and in Xe-Xe collisions at \(5.44~\text{TeV}\) by ALICE, TOTEM, and LHCb. There are currently no published forward measurements for p-Pb and Pb-Pb. The forward measurements are especially valuable for air shower simulations, since the evolution of the hadronic cascade is dominated by forward-produced hadrons.

  • Forward energy flow. The energy deposits of particles in a electromagnetic or hadronic calorimeter are measured instead of tracking or counting charged particles. The advantage is that also neutral particles are captured. The disadvantage is the loss of the \(p_{\text{T}}\) information, but as mentioned previously this is not a major concern at the TeV scale. Calorimeters are preferred in the forward region over charged-particle counters or even trackers, since they better withstand high radiation levels closer to the beam. The CASTOR calorimeter of the CMS experiment has the most forward acceptance up to \(\eta = 6.6\), apart from zero-degree calorimeters (ZDCs), which cover \(\eta > 8.2\) but can only measure neutral particles. A particular ZDC is the LHCf experiment, which is specifically designed for tuning models employed in air shower simulations.

    Forward measurements up to \(|\eta | < 6.6\) were performed in p-p collisions from 0.9 to \(13~\text{TeV}\) by CMS and LHCb. And by CMS also in p-Pb collisions at \(5.02~\text{TeV}\) (Sirunyan et al. 2019a) as well as in Pb-Pb collisions at \(2.76~\text{TeV}\) (Chatrchyan et al. 2012a).

    Of particular interest for air shower physics is also the ratio \(R\) of electromagnetic and hadronic energy flows, which so far has been only measured once in p-p collisions at \(13~\text{TeV}\) by CMS using the CASTOR calorimeter. Of great interest would be energy flow ratio in p-Pb collisions as a function of charged particle multiplicity.

  • Identified hadron spectra. Experiment with a tracker and a particle identification system (based on specific energy loss, time-of-flight, Cherenkov cone angle, etc.) can discriminate individual hadron species track-by-track. Measured are then either yield ratios or identified hadron spectra, the product of relative yields and charged particle spectra. Only ALICE and LHCb have been specifically designed with particle identification capabilities, but CMS has successfully used the energy deposits in its tracker to identify particles with \(p < 2~\text{GeV}\,c^{-1}\).

    Pions, kaons, and protons were measured at mid-rapidity \(|\eta | < 1\) in p-p, p-Pb, and Pb-Pb collisions from 0.9 to \(13~\text{TeV}\) by CMS and ALICE. Measurements in the forward region \(2.5 < \eta < 4.5\) were performed in p-p collisions at 0.9 and \(7~\text{TeV}\) by LHCb. The anti-proton flux was measured in p-He collisions at \(\sqrt{s_{\text{NN}}} = 110~\text{GeV} \) with LHCb in fixed-target mode. We emphasise that there are currently no forward data available for p-Pb collisions. The hadron composition in the forward region and its potential nuclear modification plays an important role for the muon production in an air shower.

  • Inclusive photon, neutral pion, and \(\eta \) spectra. The inclusive flux of photons, produced dominantly from the decay of \(\pi ^{0}\) and \(\eta \) mesons, is the electromagnetic complement to the hadronic energy produced in a collision. The muon production in air showers is very sensitive to this ratio and therefore measurements are of great interest in regard to the Muon Puzzle. The spectra of neutral and charged pions are linked to first order by isospin symmetry, which means that measurements of charged pions also constrain neutral pions, but the yields are not exactly identical because of additional effects, for example, pions produced in decays of short-lived particles.

    ALICE has measured inclusive photon, neutral pion, and \(\eta \) production in p-p collisions from 0.9 to \(7~\text{TeV}\), in p-Pb at \(5.02~\text{TeV}\) and in Pb-Pb collisions at \(2.76~\text{TeV}\). The LHCf experiment has measured photon and neutral pion spectra in the very forward rapidity range \(8.4 < \eta \) in p-p collisions from 0.9 to \(13~\text{TeV}\), and in p-Pb at \(5.02~\text{TeV}\). The LHCf measurements are particularly important for the Muon Puzzle, since an unexpectedly low flux of neutral pions in the very forward region would have been a sign of new physics at the LHC energy scale with a direct impact on air showers. Since large deviations were not observed in the very forward region nor at mid-rapidity, the search is now focused on a smaller modification that applies to a wide rapidity range.

  • Very forward neutron spectra and inelasticity. The LHCf experiment has measured forward neutron spectra in p-p collisions at 7 and \(13~\text{TeV}\). While the actual inelasticity remains a more theoretical concept not directly accessible by measurements, an analysis based on the forward neutrons can be performed (Adriani et al. 2020) exploiting the fact that the forward neutron is often the most energetic particle, and subsequently the inelasticity is the complement of the energy fraction carried by the neutrons. The inelasticity as a conceptual parameter has a large impact on the depth of shower maximum \(X_{\text{max}}\), while the number of produced muons in the shower depends only weakly on it. The ratio of the very forward neutral pion and neutron spectra is an important parameter for muon production in air showers and are constrained by these measurements. An analysis of p-Pb collisions at \(5.02~\text{TeV}\) with LHCf is needed to further complete the picture.

  • Strangeness production. The enhanced strangeness production in events with high multiplicities is understood as a high-density QCD effect (related to QGP, or colour glass phases, etc.). Measurements by ALICE suggest that strangeness enhancement is to a large degree independent of the collision system or the collision energy and even appears in small collision systems like p-p collisions – it mainly depends on hadron multiplicity. This is a potential hint on how QCD physics could be extended to resolve the Muon Puzzle in air showers, as described in Sect. 3. Experimentally, strangeness production is measured by reconstructing decays of strange mesons and baryons, \(K_{S}^{0}\), \(\phi \), \(\Lambda \), \(\Xi ^{-}\), \(\Omega ^{-}\) and their antiparticles. This requires a high-precision tracker, since the momenta of the decay products need to be known and the impact parameter of candidate tracks to reduce combinatorial background. Particle identification is not required but beneficial to reduce combinatorial background. Multi-strange hadrons are particularly sensitive probes for a strangeness enhancement.

    ALICE has measured a strangeness enhancement at mid-rapidity \(|y| < 1\) as a function of multiplicity at mid-rapidity \(|y|<0.5\) using data from p-p collisions at \(7~\text{TeV}\), p-Pb at \(5.02~\text{TeV}\), and Pb-Pb at \(2~\text{TeV}\). CMS has measured strangeness in p-p, p-Pb, and Pb-Pb collisions up to \(|y|<2.4\). Earlier measurements were performed without splitting the data into multiplicity classes. ATLAS has measured \(\text{K}^{0}_{S}\) and \(\Lambda \) in p-p collisions at 0.9 and \(7~\text{TeV}\) up to \(|y| < 2.5\). LHCb has measured \(\phi \) production in p-p collisions at \(7~\text{TeV}\) in the range \(2.44 < y < 4.06\).

    The strangeness enhancement was observed at mid-rapidity, but it has not been experimentally demonstrated in the forward region. In order to solve the Muon Puzzle in air showers, the strangeness enhancement has to be present also in the forward region and it has to be sufficiently large to make an impact on the shower development. In order to search for this, high-precision measurements of strangeness production as a function of the particle multiplicity need to be performed in the forward region \(y > 2.5\) in p-p and p-Pb, and optionally Pb-Pb collisions with LHCb. This will address the key question whether the universal strangeness enhancement observed at mid-rapidity is also present in the forward region.

  • D and B meson production cross-section. The cross-sections for \(D\) and \(B\) meson production have no impact on air shower development, but their decays give rise to the prompt component of the atmospheric neutrino flux, which dominates the overall flux above about \(10~\text{PeV}\), see Fig. 9. The production via charm decays is dominant. The neutrino yield from decays of \(D\) and \(B\) mesons is similar, but the production cross-section for \(c\bar{c}\) is an order of magnitude higher than for \(b\bar{b}\) (Martin et al. 2003; Bhattacharya et al. 2016). \(B\) decays contribute less than 10% of the prompt component. The production cross-sections for \(D\) and \(B\) mesons are also important inputs for the (nuclear) parton density functions (PDF) at high momentum transfer (Ball et al. 2015; Hou et al. 2021; Zenaiev et al. 2020). The neutrino flux calculations are sensitive to the PDFs at small \(x \lesssim 10^{-5}\), which can only be constrained by forward measurements at large \(\eta \).

    Production cross-sections for \(D\) and \(B\) mesons have been measured at mid-rapidity in p-p collisions from \(2.76~\text{GeV}\) to \(13~\text{TeV}\), in p-Pb at 5.02 and \(8.16~\text{TeV}\), and in Pb-Pb at \(2.76~\text{TeV}\) with ALICE, ATLAS, and CMS. Forward measurements with LHCb have been performed in p-p collisions from 7 to \(13~\text{TeV}\) and in p-Pb collisions at 5.02 and \(8.16~\text{TeV}\). \(D\) mesons were also measured in fixed target mode with LHCb for p-He collisions at \(\sqrt{s_{\text{NN}}} = 86~\text{GeV} \) and p-Ar collisions at \(\sqrt{s_{\text{NN}}} = 110~\text{GeV} \).

Table 3 Peer-reviewed papers on LHC measurements related to air shower physics for ALICE (A), ATLAS (T), CMS (C), LHCb (B), TOTEM (O), and LHCf (F). Collision energies are in \(\sqrt{s_{\text{NN}}}/\text{TeV}\)

4.3 Fixed target experiments

This review focuses on the input from the LHC toward a solution of the Muon Puzzle in air showers, based on the results discussed in Sect. 2.6 that show a distinct onset of the muon discrepancy at energies which correspond to the TeV scale in the nucleon-nucleon cms-system. There are, however, also opportunities to further improve our knowledge on hadronic interactions in later stages of an air shower. A fixed-target setup can cover a large phase space for the measurement of particle productions relevant for air showers as discussed by Meurer et al. (2006), as shown in Fig. 18, and allows for great flexibility in the regard to the target nucleus, to study nuclear effects as a function of the system size. In this section, we discuss the two fixed-target experiments with high relevance for air shower physics at cms-energies that are one to two orders below the TeV scale, the NA61/SHINE experiment and the LHCb experiment in fixed-target mode.

  • NA61/SHINE. The main part of the NA61/SHINE experimentFootnote 2 (Abgrall et al. 2014c) is a set of large-acceptance time projection chambers (TPCs) with two superconducting magnets that have a combined bending power of \(9~\text{Tm}\), resulting in a precise measurement of particle momenta and excellent particle identification capabilities via the specific energy loss in the TPC volumes. The setup is very flexible in regard to projectiles and targets, which allowed it to collect a large variety of data. Of particular interest for air shower physics are measurements of identified hadron spectra in p-C interactions at \(31~\text{GeV}\,c^{-1}\) (Abgrall et al. 2011, 2012, 2016), in p-p interactions from 20 to \(158~\text{GeV}\,c^{-1}\) (Aduszkiewicz et al. 2020a,b, 2017b, 2016; Abgrall et al. 2014a,b), and in \(\pi \)-C interactions at 158 and \(350~\text{GeV}\,c^{-1}\) (Aduszkiewicz et al. 2017a; Prado 2018, 2019; Unger 2020). Resonances like \(\text{K}^{0}_{S}\), \(\text{K}^{*0}\), \(\Lambda \), \(\omega \) and \(\rho ^{0}\) were also measured. In the case of the data on \(\pi \)-C and in the p-C interactions, carbon is a very good proxy for air.

    A key result for air showers is shown in Fig. 19, the total energy fractions transferred to anti-protons and \(\rho ^{0}\) mesons in \(\pi \)-C interactions. These fractions were obtained by integrating the \(p \, \text{d}n/\text{d}p\) spectra including an extrapolation up to the full beam momentum (Prado 2018). The muon production in an air shower scales with the energy retained in the hadronic cascade. Since baryon number is conserved in subsequent interactions, this energy increases with the amount of baryon production. The forward \(\rho ^{0}\) production is important since it is an alternative to the charge exchange reaction \(\pi ^{-} + p \to \pi ^{0} + n + X\); leading \(\pi ^{0}\) production incurs a large loss of hadronic energy, since there is a high probability of \(\pi ^{0} \rightarrow \gamma \gamma \). On the other hand, the measured anti-proton fraction constrains the production of \(p\), \(\bar{p}\), \(n\), and \(\bar{n}\). The sum of the energy fractions of these particles is about at the same level as the one going into \(\rho ^{0}\) mesons. The comparison of the NA61/SHINE data to predictions of hadronic models reveals that none of the existing attempts to describe interactions in air showers succeeds in reproducing both energy fractions at the same time. These discrepancies get potentiated by the fact that a hadronic cascade initiated by a \(10^{11}~\text{GeV} \) particle will traverse all beam energies shown in Fig. 19. Measurements at energies beyond the SPS would be highly desirable to constrain the full air shower development. While baryon production can in principle also be measured in p-p collisions if universality in regard to the projectile can be assumed, a study of \(\rho ^{0}\) production in the charge exchange reaction requires a \(\pi ^{-}\) beam and accelerating pion beams at the LHC is not foreseen. Petrov et al. (2010) has investigated the prospect to measure the \(\pi^{+}\)-\(p\) cross-section indirectly by tagging pion-exchange interactions between colliding protons, which is technically feasible. It would require tagging events with a neutron in a zero-degree calorimeter and a rapidity gap next to the neutron. One could use such events to infer the inelastic cross-section and to study inclusive hadron production.

  • LHCb in fixed-target mode. The SMOG (System for Measuring the Overlap with Gas) is a device to inject small amounts of noble gases directly into the LHC beam pipe around the LHCb collision point (Aaij et al. 2014b). It was designed to precisely measure the radial profile of the LHC beams to accurately compute the luminosity of colliding proton bunches, as an alternative to van-der-Meer scans (Aaij et al. 2012a, 2014b). A combination of both techniques yielded the most accurate luminosity measurements so far at a bunched-beam collider (Aaij et al. 2014b).

    This system has been used to study fixed-target interactions with a variety of noble gases. So far, interactions of proton and lead beams with helium, neon, and argon have been recorded with centre-of-mass energies \(\sqrt{s_{\text{NN}}} \) of 68, 87, and \(110~\text{GeV}\). LHCb has access to the highest energies ever obtained in a fixed-target experiment and closes the energy gap between previous fixed target experiments and the TeV scale. Measurements of the charm production in p-He and p-Ar (Aaij et al. 2019a) and anti-proton production in p-He have been published (Aaij et al. 2018a). The forward acceptance of LHCb in collider mode corresponds to an acceptance at mid-rapidity \(-2.5 < \eta _{\text{cms}} < 0.5\) in the nucleon-nucleon cms frame when running in fixed-target mode. The fixed-target measurements of p-A collisions in the mid-rapidity region offer important opportunities to study the hadron composition and its modification as a function of the nuclear target.

    The SMOG device is currently upgraded (Barschel et al. 2020) by installing a gas storage cell with open windows upstream of the vertex locator. This upgrade allows for injecting non-noble gasses, in particular hydrogen, nitrogen, and oxygen, and at higher densities to increase the interaction rates by two orders of magnitude. The new device was been designed explicitly to allow fixed-target experiments with LHC beams and is also motivated by air shower physics. The increased luminosity will improve the accuracy of studies of charm and beauty production, while air shower physics will profit from the study of the hadron composition in different projectile-target system combinations. A particularly interesting opportunity with SMOG2 for air shower physics will arise when the LHC is run with oxygen beams.

Fig. 18
figure 18

Overview of data on pion spectra in p-p, p-Pb collisions as a function of the momentum of the incoming proton (Adare et al. 2011; Agakishiev et al. 2012; Chatrchyan et al. 2012b; Abelev et al. 2014e; Abgrall et al. 2014a; Aguilar-Benitez et al. 1991; Adam et al. 2015d; Aaij et al. 2012c; Adriani et al. 2016; Aamodt et al. 2011). The proposed p-O collisions are indicated. In case of LHCb, only hadron yield ratios were measured and not full spectra

Fig. 19
figure 19

Energy fraction transferred to anti-protons (left) and \(\rho ^{0}\)-mesons (right) in \(\pi \)-C collisions as measured by NA61/SHINE (data points) and as predicted by hadronic interaction models over the whole range of beam energies relevant for air showers (images from Prado 2019; Unger 2020)

4.4 Prospects with oxygen beams in the LHC

Citron et al. (2019) discuss the science case for and technical feasibility of a pilot run with oxygen beams at the LHC at the end of Run 3 to study p-O and O-O collisions at 10 and \(7~\text{TeV}\), respectively, in response to a long-standing request by the astroparticle community. The pilot run has been approved and is expected in 2023 or 2024. Details about the planned run are given by Bruce et al. (2021).

The primary motivation to request oxygen beams is the fact that measurements in p-p and p-Pb are both not representative of the first interaction in an air shower, while p-O and O-O are both suitable references (Citron et al. 2019, Sect. 11.3). To identify simple laws and universal features of proton-ion collisions, an intermediate system between p-p and p-Pb collisions is needed as indicated by Fig. 14, since an interpolation based on two points is ambiguous. Caution is particularly warranted in light of the recent surprising findings of collective effects in high multiplicity p-p and p-Pb collisions that were previously discussed in Sect. 3.

There is a considerable theoretical uncertainty in the extrapolation of hadron production parameters from p-p to p-O, as demonstrated in Fig. 20. Most hadronic interaction models (with the exception of EPOS) are only tuned to p-p collisions and then extrapolated to p-air collisions. Models agree to better than 5% at \(13~\text{TeV}\) in p-p collisions at mid-rapidity, but the spread in p-O collisions at \(10~\text{TeV}\) is 50%. Similar divergence is found for other relevant aspects of hadronic collisions, for example, the p-air cross-section previously shown in Fig. 16. All measurements discussed in Table 3 should be performed in p-O and compared to an interpolation from p-p to p-Pb, especially measurements at forward rapidities.

Fig. 20
figure 20

Charged particle spectra from different hadronic generators in p-p at \(13~\text{TeV}\) and in p-O at \(10~\text{TeV}\). The data points are from CMS (Khachatryan et al. 2015c)

Some measurements also become more precise in p-O. The LHCf experiment has measured the forward photon and \(\pi ^{0} \) flux in p-Pb, but the accuracy of the measurement suffers from a large background from ultra-peripheral collisions (UPCs), in which the proton interacts with a virtual photon emitted by the lead nucleus, to form e.g. a \(\Delta (1232)\) resonance, which in turn decays into a proton and a \(\pi ^{0}\) or a neutron and a charged pion. This process occurs with a similar or even larger rate in p-Pb collisions compared to strong interactions. Thus, LHCf uses a model-dependent correction which introduces a leading systematic uncertainty (Adriani et al. 2014). Since the UPC rate scales with \(Z^{2}\), it is completely negligible in p-O. Khachatryan et al. (2016) faced a related issue in their measurement of the p-Pb inelastic cross-section, but the measurement was not as severely affected.

Interesting opportunities with oxygen beams also arise for the LHCb experiment running in fixed-target mode. If oxygen gas is injected, it would allow LHCb to study O-O collisions at \(\sqrt{s_{\text{NN}}} = 81~\text{GeV} \) at mid-rapidity, \(-2.5 < \eta _{\text{cms}} < 0.5\). If hydrogen gas is injected, the O-\(p\) system at \(\sqrt{s_{\text{NN}}} = 115~\text{GeV} \) probes the mid to near forward region \(-0.5 < \eta _{\text{cms}} < 2.5\) in the nucleon-nucleon system. The acceptance further forward is an advantage compared to the mirrored system in which the proton beams collide with oxygen gas.

5 Summary

The muon deficit in air shower simulations has been experimentally established with \(8~\sigma\) evidence Dembinski et al. (2019), Cazon (2020), Soldin (2021). Measurements with small model-dependence were important to establish the muon deficit and were provided by the Pierre Auger Observatory and the IceCube Neutrino Observatory. The muon deficit has an onset at \(\sqrt{s_{\text{NN}}} \approx 8~\text{TeV} \) in the nucleon-nucleon cms-system, followed by a linear increase with logarithm of the energy. The source of the deficit therefore should be observable in high-energy collisions at the LHC. The most likely explanation consistent with all available data is a small modification to the hadron production that reduces the energy fraction carried by photons, which originate mostly from \(\pi ^{0}\) decays, in soft hadronic collisions. Such a modification has a compounded effect on the hadronic cascade, so that only a comparably small modification is required. A small modification would not destroy the consistency of current air shower simulations with other air shower data; the first two moments of the depth of shower maximum and the intrinsic fluctuations of the muon number in air showers that were recently measured by the Pierre Auger Observatory for the first time.

An enhancement of strangeness production that qualitatively matches this desired behaviour was observed at the LHC by ALICE in the mid-rapidity region (Adam et al. 2017b). There is no consensus in regard to the theoretical explanation of the phenomenon, but the most important point for air showers is its apparent universality (Anchordoqui et al. 2020). The modification does not depend on the collision energy and the collision system, only on the multiplicity of the event. A preliminary study (Baur et al. 2019) suggests that the effect could resolve the muon discrepancy. Since hadron production at mid-rapidity does not have a direct impact on air showers, the ALICE result by itself does not yet solve the Muon Puzzle. Required are measurements of the forward hadron production at a pseudo-rapidities \(\eta > 2\) in the region relevant for air showers. Further needed are studies of light hadron production in the forward region and of the hadron composition, and in particular strangeness production, as a function of the charged particle multiplicity in the collision systems from p-p and p-Pb, and possibly Pb-Pb. The LHCb experiment is in a unique position to perform these so far missing measurements. Also important are direct measurements of the ratio of the electromagnetic and hadronic energy flow with the CASTOR experiments in proton-lead collisions in addition to those in proton-proton collisions.

In light of these findings and in regard to the current divergence of generators of hadronic interactions when applied to p-O collisions, studies of p-O collisions at the LHC as proposed in Citron et al. (2019) are essential to understand the evolution of hadron production between p-p and p-Pb, which should be performed with all LHC experiments to characterise hadron production over a wide acceptance in rapidity. To improve the accuracy of air shower predictions for the depth of shower maximum, also a precise measurement of the p-O cross-section is highly desired which could be provided by TOTEM or ATLAS-ALFA and the very forward photon production, which could be measured with LHCf, if the runs with oxygen beams are not postponed to Run 4 of the LHC.

A solution to the muon deficit from the LHC would have a large impact on the astroparticle community. Research on high-energy cosmic rays would directly benefit, but also research with gamma-ray and neutrino observatories, since cosmic rays generate the background for these observatories. In the field of cosmic-ray experiments, future upgrades will provide more muon data. The Pierre Auger Observatory is upgrading its array (AugerPrime) to give each surface detector muon separation capabilities (Aab et al. 2016c). The radio observation of highly-inclined air showers will combine a calorimetric measurement of the shower energy using radio data with muon measurements in the surface detectors. The IceCube Neutrino Observatory is working on a study of the GeV and TeV muon component in air showers, which can be measured simultaneously by combining data from the surface array and the deep in-ice detector (De Ridder et al. 2018). An extension of IceTop, the surface array of IceCube, with scintillator and radio detectors (Haungs 2019; Schröder 2020) will provide a calorimetric measurement of the shower energy and will further enhance the sky coverage, energy range, and accuracy of hybrid muon measurements. The future measurements with AugerPrime and IceCube will significantly contribute to the solution of the Muon Puzzle in EAS.