1 Introduction

To date many existing models beyond the Standard Model (BSM) like the two-Higgs doublet models [1] and its extensions incorporates scalars of mass lower or higher than the SM Higgs-boson (\(m_h = 125\) GeV) [2] with models parameters heavily constrained by existing experimental data and theoretical limits. The multi-lepton anomalies seen in Run 1 data at ATLAS and CMS are explained in a two-Higgs doublet model with additional real singlet scalar (2HDM+S) [3,4,5,6,7,8].For a recent review of anomalies see Ref. [9]. In this model the mass of the heaviest CP-even scalar H is considered in the interval \(2m_h \le m_H < 2m_t\), where \(m_t\) is the mass of top-quark. The 2HDM+S model with different mass ranges of scalars are also well motivated from theories BSM [10,11,12,13,14], possibilities of existence of BSM scalars at the large Hadron collider data [15, 16] and future \(e^+e^-\) collider [17, 18], to explain dark matter abundance [19,20,21], di-Higgs production [22], excess seen at 96 GeV [23, 24] and to explain recent CDF [25] W-mass measurements [26]. Heavy scalars searches in WW/ZZ channels are considered at CMS and ATLAS [27,28,29]. The discovery potential of heavy Higgs-boson through the resonant di-Higgs production in HL-LHC and FCC-hh has been studied with \(4\tau \) and \(bb\gamma \gamma \) channels in “xSM” model [30].Footnote 1 Even the physics of dark matter and axions or axions like particles can be connected with CP-even or odd scalars [31,32,33].

In this work we investigate the possibility of probing H at the proposed future electron-proton colliders via the deep-inelastic scattering charged-current (CC) process. The proposed large Hadron electron collider (LHeC) facility at CERN provides sufficient center of mass energy \(\sqrt{s} \approx 1.3\) TeV following electron (proton) energy of \(E_{e(p)} = 60\) GeV (7 TeV) to explore the allowed mass range of H. Interestingly with this mass range one can explore the resonance H via its decay to \(W^\pm \) and \(Z-\)bosons. In this work we consider \(H\rightarrow W^+W^-\), where \(W^\pm \) decay to hadronic final states. However, the mass reconstruction of H through this final state is challenging due to (a) the \(W^\pm -\)boson emanating from heavy H is boosted with respect to the laboratory system, and hence the jets coming from \(W^\pm \) are collimated, and (b) in the \(e^- p\) production process, the scattered jet from the proton-line is not easily distinguishable from jets coming from \(W^\pm \) (Fig. 1). However, the high rapidity (\(\eta _j\)) region of the scattered jets can be exploited to reconstruct the signal. We also employ a machine learning approach to distinguish the signal and potential backgrounds in this work.

In Sect. 2 we discuss the framework needed to perform this analysis. A description of event simulation and tools needed are discussed in Sect. 3. The mass reconstruction methods are described in Sect. 4. Summary and discussion of this work is presented in Sect. 5.

Fig. 1
figure 1

Leading order diagram for signal process \(p e^- \rightarrow \nu _e H j\), \(H \rightarrow W^{+}W^{-}\), \(W^\pm \rightarrow jj\). Here, \(q \equiv u, c, \bar{d}, \bar{s}\) and \(q' \equiv d, s, \bar{u}, \bar{c}\)

2 Model

To investigate the discovery potential of heavy Higgs boson of mass \(2m_h \le m_H < 2m_t\) in \(e^- p\) environment, we consider a model where H corresponds to a real singlet scalar field \(\Phi _H\) which mixes with the SM SU(2) doublet Higgs field \(\Phi \). Then the Higgs-boson Lagrangian will be modified and can be written as [34,35,36]:

$$\begin{aligned} \mathcal{L}_\textrm{Higgs}&= (D_\mu \Phi )^2 + (\partial _\mu \Phi _H)^2 + \mu _h^2 \left| \Phi \right| ^2 - \lambda _h \left| \Phi \right| ^4 \nonumber \\&\quad + \mu _H^2 \left| \Phi _H\right| ^2 - \lambda _H \left| \Phi _H\right| ^4 + \xi \left| \Phi \right| ^2\left| \Phi _H\right| ^2. \end{aligned}$$
(1)

In general the, parameters \(\mu _h, \mu _H, \lambda _h\) and \(\lambda _H\) are all positive in order to have stable potential but \(\xi \) may not require any particular sign. We assume that in the above Lagrangian the scalar fields acquire a vacuum expectation values and hence the component fields can be written as:

$$\begin{aligned} \Phi = \frac{1}{\sqrt{2}} \begin{pmatrix} G^\pm \\ \phi + \textrm{v} + i G^0 \end{pmatrix}, \Phi _H = \frac{1}{\sqrt{2}}\left( \phi _H + \textrm{v}_H + i G^\prime \right) . \end{aligned}$$
(2)

Here the fields G are Goldstone bosons absorbed by the vector bosons, and so no physical pseudoscalar states are left in the spectrum. But the scalar spectrum has two physical states h and H rather than just one of the SM. Also since the singlet do not couple to the \(SU(2)_L \times U(1)_Y\) gauge bosons, they do not contribute to \(m_W\) and \(m_Z\) and hence v must take the SM value \(\textrm{v} = 246\) GeV. We can also redefine the coefficient of Eq. (1) such that \(\textrm{v}_H = 0\). Note that we are not imposing any extra possible symmetries like \(\mathbb {Z}_2\) in the scalar sector, and in general \(\phi \) will mix with the \(\phi _H\) to form the mass eigenstates. We assume the masses of h and H as in previous case, \(m_h < m_H\), where \(m_h = 125\) GeV is taken as the SM Higgs boson and \(m_H\) as mass of the heavy scalar singlet. The mass eigenstates h and H are related to the gauge eigenstates \(\phi \) and \(\phi _H\) by a \(2\times 2\) unitary matrixFootnote 2V:

$$\begin{aligned} \begin{pmatrix} \phi \\ \phi _H \end{pmatrix} = V \begin{pmatrix} h \\ H \end{pmatrix}. \end{aligned}$$
(3)

Hence the couplings of the gauge bosons and fermions with h will be same as in the SM if \(\left| V_{11}\right| = 1\) which implies \(\left| V_{12}\right| = \sqrt{1 - \left| V_{11}\right| ^2} = 0\). However in this work we considered \(\left| V_{11}\right| \ne 1\) and \(\left| V_{12}\right| \ne 0\). Then the production rates of the h and H are suppressed by a factor \(\left| V_{1i} \right| ^2\) relative to the SM h production rates. The branching ratios (BRs) of h to the SM particles are identical to the SM BRs, while the BRs of heavy H depend on whether the channel \(H \rightarrow h h\) are kinematically accessible. For our analysis we scale the \(HW^+W^-\) coupling with respect to the SM Higgs boson \(hW^+W^-\) coupling.

3 Event simulation and tools

The simulation of CC process (signal) for the heavy scalar H production follows through \(p e^{-} \rightarrow \nu _{e} H j\), where \(\nu _e\) is electron-neutrino (and is the source of missing energy) and j represents jets emanating from proton-line (we refer to this j as scattered or forward jet in the text). Further the decay of \(H \rightarrow W^+ W^-\) and \(W^\pm \rightarrow j j\) is taken at the matrix element level for this signal process (see Fig. 1). Note that H can also be produced in neutral current process through the fusion of Z-bosons at tree-level as \(p e^{-} \rightarrow e^- H j\), but the cross-section is sub-dominant and approximately 5.5 times smaller than the CC process which follows through \(W^\pm \)-fusion for unpolarized \(e^-\) beam.

To generate event samples for signal and potential backgrounds we use a Monte Carlo generator MadGraph5 [37], interfaced with a customised Pythia-PGS [38] for parton showers and hadronization (for details see Ref. [39]). The detector simulation is performed using Delphes [40] with parameters optimised for the detector in LHeC. The jets are clustered using FastJet [41] with the anti-\(k_{T}\) algorithm [42] and distance parameter R = 0.4. The factorisation and renormalisation scales for the signal simulation are fixed to the heavy Higgs boson mass \(m_{H}\). The background simulations are done with the default MadGraph5 dynamic scales. The polarization of the charged electron is assumed to be − 80%. This enhances the polarized cross-sections by \(\sim 1.8\) times with respect to the unpolarized \(e^-\) beam for both signal and background.

Table 1 Total cross-sections (in fb) for signal production (see text) and potential backgrounds with \(E_e = 60\) GeV and \(E_p = 7\) TeV. The polarisation of \(e^-\) is taken to be \(-80\)%. The first row represents the signal process and the other four rows are for the dominant background processes
Table 2 A summary table of event selections. In the first column the selection criteria are given. The second column contains the weight of the signal process \(p e^- \rightarrow \nu _e H j\), \(H \rightarrow W^+W^-\), \(W^\pm \rightarrow jj\) for \(m_H = 270\) GeV. From column third to sixth dominant weights for backgrounds are given. Seventh column is weighted total number of backgrounds. All weights are calculated with \(\mathcal{L}\) = 1 ab\(^{-1}\). The significance of signal over total background is given in the eight column. In the last column significance with \(\delta _{sys} = 2\%\) is estimated
Fig. 2
figure 2

a Multiplicity of jets in signal and backgrounds. b The pseudo-rapidity distribution of the forward jet after five jet selection in signal and backgrounds

An estimation of cross-section for the signalFootnote 3 and potential background processes are calculated at leading order using MadGraph5 with applied minimal cuts on transverse momentum of jets \(p_{T_j} > 20\) GeV, jet pseudo-rapidity \(-1< \eta _j < 5\) and there is no requirements for transverse missing energy \(E^{miss}_T\), and presented in Table 1 for a benchmark value of \(m_H = 270\) GeV. Before going for mass reconstruction of H with appropriate methodologies we made preliminary selection criteria to estimate the significance, and those are as follows: (a) since the final state of signal (Fig. 1) contains five jets at matrix element level (four from decay of \(W^\pm \)-boson and one scattered jet), we chose at least five leading \(p_T\)-ordered jets in simulated events and (b) \(E^{miss}_T > 20\) GeV. In Table 2 we presented the number of weighted events of signal (S) and backgrounds (B) at luminosity \(\mathcal {L}\) = 1 ab\(^{-1}\) after these selection criteria where in the last column significance of signal over background is calculated with formula \(\sigma = S/\sqrt{B}\). It is interesting to note the there is slight increase (\(\approx 2.3\)%) in \(\sigma \) after the selection of five leading jets, though \(E^{miss}_T > 20\) GeV reduces the \(\sigma \) by \(\approx 7\)% in comparison with initial weighted events. In order to estimate the systematic errors in the shape of signal and background distributions due to detector resolution, \(E_T^{miss}\) measurement, reconstruction efficiency etc., as well as on the expected number of events we calculate significance as function of systematic factor \(\delta _{sys}\): \(\sigma (\delta _{sys}) = S/\sqrt{B+(\delta _{sys}\cdot B)^2}\) and added the estimation in Table 2.

It is important to investigate and account for these observations during the mass reconstruction procedure of H and further discuss in next section.

4 Reconstruction of the invariant mass

In order to reconstruct \(m_H\) it is important to select appropriate hadronic jets in our signal and observe the features with respect to the dominant backgrounds. To begin the procedure we must isolate and identify the hadronic jets after detector simulations. In Fig. 2a, number of hadronic jets are shown which are constructed with requirement on \(\Delta R = 0.4\).Footnote 4 It is clear that the number of hadronic jets from ZZ backgrounds are competitive in comparison to the signal. Also a similar feature can be observed in the pseudo-rapidity of forward jets, \(\eta _j\), as shown in Fig. 2b. And therefore, the ZZ backgrounds needs to be optimize with the help of missing transverse energy cut \(E^{miss}_T > 20\) GeV (see Fig. 3) and corresponding significant reduction in weighted events can be seen in Table 2.

To compare the reconstructed invariant \(m_H\) with the truth-level mass, the hadronic jets originated from \(W^+\) and \(W^-\) bosons are selected using the truth-level information (note that \(W^\pm \) are decaying from H in signal). An illustration of invariant mass of two-jets, \(m_{jj}\), from \(W^+\) (\(W^-\)) is shown in Fig. 4a (Fig. 4b). Note that along with signal we only showed backgrounds with \(W^\pm \) final states as there is no information stored for Z-bosons in truth-level.

After analysing these observable, we apply three different methodologies to reconstruct \(m_H\) in the mentioned channel and compare the significance. In Method 1, selection of four \(p_T\)-ordered leading jets are considered. Method 2 is to select four hadronic jets excluding the most forward jet (which corresponds to largest \(\eta _j\)), while a high-level machine learning (ML) techniques used in Method 3.

Fig. 3
figure 3

The missing transverse energy distribution after applying the \(> 20\) GeV requirement

4.1 Method 1: selection of four \(p_T\)-ordered leading jets

In this method, all jets are sorted according to the corresponding \(p_T\) and the four out-of five leading (\(p_T\)-ordered) jets are selected from the weighted signal and background events. We expect an inherent uncertainty in this method from the forward jet (which may not originate from either of \(W^+\) or \(W^-\)) and this may contaminate the reconstruction of \(m_H\) in the signal. The invariant mass distribution of four selected jets, \(m_{4j} \equiv m_H\), using this method is shown in Fig. 5a. The corresponding significance \(\sigma \) are shown in Table 3 (second column). Here \(\sigma _{m_{4j}}\) represents the significance in full available range in \(m_{4j}\), and \(\sigma _{max}\) is the range where maximum \(\sigma \) can be achieved. This method results maximum of \(4.0 \sigma \) within the invariant mass-range of \(m_{4j}\in [190, 540]\) GeV and the improvement from full range of \(m_{4j}\) is by 2.5% with initial events. However after selecting \(E^{miss}_T > 20\) GeV, accuracy of measurement improves with \(4.9 \sigma \) in \(m_{4j}\in [190, 540]\) GeV (4.1% improvement from full range). And an improvement of \(\sim 16\)% in comparison with significance shown in Table 2.

From distribution of \(m_{4j}\) in Fig. 5a it is noticed that the width of invariant mass is wide and reason for this could be the contamination of forward jets as discussed. Thus a method to narrower the width suppose to result better mass reconstruction by removing the forward jet and discussed in next subsection.

Fig. 4
figure 4

Invariant di-jet mass distribution \(m_{jj}\) from truth-level information of a \(W^{+}\) and b \(W^{-}\), where \(H \rightarrow W^+ W^-\) with \(m_H = 270\) GeV

Fig. 5
figure 5

a Invariant mass distribution of four \(p_T\)-ordered leading jets [Method 1 (Sect. 4.1)]. b Invariant mass distribution of four \(p_T\)-ordered jets by removing the forward jet [Method 2 (Sect. 4.2)]

Table 3 The significance is calculated at each stage of the optimised selection criteria using \(\sigma = S/\sqrt{B}\) and \(\sigma (\delta _{sys} = 2\%) = S/\sqrt{B+(\delta _{sys}\cdot B)^2}\) where S and B are the expected signal and background yields at a luminosity of 1 ab\(^{-1}\) respectively. Here \(\sigma _{m_{4j}}\) represents the significance in full available range in \(m_{4j}\). And \(\sigma _{max} (m_{4j})\) is the range where maximum \(\sigma \) can be achieved, corresponding minimum to maximum range \(m_{4j} \in [m_{4j}^{min},~m_{4j}^{max}]\) are specified for each approach (corresponding S and B are given in the next row)

4.2 Method 2: elimination of forward jet

As Method 1 slightly improved the accuracy in the measurement of \(m_H\) through four \(p_T\)-ordered leading jets using \(m_{4j}\) (comparing the significance obtained in Table 2), we employ a second approach where forward jet corresponding to largest \(\eta _j\) are eliminated and remaining four \(p_T\)-ordered jets are selected. In addition we also verified that the selected jets originate from \(W^\pm \)-bosons using the truth-level information. The corresponding invariant mass distribution is shown in Fig. 5b. Clearly the \(m_{4j}\) distribution has narrower width comparing with Method 1 (Fig. 5a) and this approach should improve the accuracy of measuring \(m_H\). This approach also uses the same number of initial weighted events as the above method. When reconstructing the invariant mass of H, this method achieved a maximum significance of 5.0\(\sigma \) before applying the missing energy cut. A maximum significance of \(6.1\sigma \) can be attained with 24% improvement after selecting jets with \(E^{miss}_{T}\) > 20 GeV. In Table 3 (third column) significance obtained for Method 2 is shown. Overall applying this method shows improvement in significance of about 33% in comparison with significance obtained selecting at least 5j with \(E^{miss}_T> 20\) GeV as in Table 2.

Fig. 6
figure 6

Invariant mass distribution \(m_{4j}\) of the trained signal and evaluated background sample using the BDTG, DNN and LD method

Fig. 7
figure 7

Comparison of \(m_{4j}\) (signal only) for three different masses: \(m_{H} = 250, 270\) and 300 GeV following a Method 1 (Sect. 4.1), b Method 2 (Sect. 4.2) and c DNN method (Sect. 4.3)

4.3 Method 3: machine learning technique

Though the use of Method 2 results a higher significance of about \(6\sigma \) shows the efficacy of this approach to reconstruct \(m_H\), we also analyse the event samples using high-level machine learning technique as Method 3 and compare the significance. For our analysis we employed the Toolkit for Multivariate Data Analysis (TMVA) package [43] in which all multivariate methods respond to supervised learning only, i.e., the input information is mapped in feature space to the desired outputs.

To start with, the four-momentum information of jets from the signal and backgrounds’ event samples are used to construct the low-level observables like jet’s transverse momenta \(p_{T_j}\), pseudo-rapidity \(\eta _j\), azimuthal angle \(\phi _j\), energy \(E_j\) and mass \(m_j\). The signal samples with these observables are passed in two equal proportions for training and testing, respectively, to reconstruct \(m_{4j}\). Here we include three different analysis routines known as: Boosted Decision Trees with gradient boosting (BDTG), Deep Neural Network (DNN) and Linear Discriminator (LD). The details of all three analysis procedure and mechanism are documented in Ref. [43]. All background samples are passed through evaluation with default parameters in TMVA regression application with Boosted Decision Trees (BDTG), Deep Neural Networks (DNN) and Linear Discriminants (LD). The combination of outputs are shown in Fig. 6. The default parameters are later tested and tuned to give maximum significance with target mass as 270 GeV.Footnote 5 In Table 3, the significance obtained through all three analysis techniques are presented. All three analysis routines provides the maximum significance of mass measurement \(\sim 5\sigma \), which is a little less in comparison with Method 2 while is similar to Method 1. Though the improvements after \(E^{miss}_T > 20\) GeV requirement are high in comparison with Method 1. However among the three analysis routines the DNN performance is better with maximum significance of \(4.9\sigma \) in \(m_{4j} \in [210, 270]\).

By analysing the \(m_{4j}\) distributions shown in Fig. 6 the ML algorithms used here seems to accumulate the signal as well as the backgrounds region towards the target mass. Though the significance are consistent with other two methods and even better than Method 1 by using DNN as shown in Table 3.

Table 4 Same as Table 3 for \(m_H = 250\) and 300 GeV in comparison with \(m_H = 270\) GeV

4.4 Scanning \(m_H\)

Among the three methods, the Method 2 - elimination of forward jet corresponding to the largest \(\eta _j\) is the most efficient to reconstruct the \(m_H\). So we will use this technique for two different \(m_H =\) 250 and 300 GeV, and compare the significance with the benchmark \(m_H = 270\) GeV taken in this study to understand how other masses affect the sensitivity of measurement method(s). This will allow us to investigate such masses at LHeC with considered \(\sqrt{s} \approx 1.3\) TeV. For completeness we also analyse and compare the significance with Method 1 and DNN routines (as this method gives highest significance in comparison to BDTG and LD).

In Fig. 7a–c we compare \(m_{4j}\) (signal only) using Method 1, Method 2 and DNN routines for \(m_H = 250\), 270 and 300 GeV, respectively. In Table 4, the maximum significance obtained using Method 1, Method 2 and DNN are shown as in Table 3. A comparison with \(m_H = 270\) GeV shows \(\sim 1\sigma \) difference in significance for both masses. Since the cross-section of \(m_H = 250~(300)\) GeV is higher (lower) than the corresponding cross-section of \(m_H = 270\) GeV, the enhancement (suppression) in significance is expected.

5 Discussion and summary

The existence of heavy particles are usually known in physics BSM and strategies to search such particles in colliders are very important. Specially in the scalar-sector it is most important since these particles are responsible for mass generation of several bosons and fermions in SM as well in BSM. In this article we attempted to prescribe mass reconstruction methods for a heavy scalar boson in a mass range of \(m_H \in (2 m_h, 2 m_t)\), where H particularly decays to hadronic jets through \(W^\pm \) and the production is followed through charged-current in the LHeC environment.

As a benchmark, a heavy scalar of mass \(m_H = 270\) GeV produced in CC channel in LHeC with \(E_e = 60\) GeV and \(E_p = 7\) TeV. Further we considered \(H \rightarrow W^+W^-\) and \(W^\pm \rightarrow jj\) channel to develop a prescription for mass reconstruction. In doing so we explained the possible methods of selecting final state hadronic jets as the scattered jets in this channel are the source of contamination. Overall Method 2 gives a significance of about \(6\sigma \) using \(m_{4j}\), which is better compared to the other two methodologies discussed. It is also noted that \(E^{miss}_T > 20\) GeV plays a significant role to improve the significance only when a proper selection of four hadronic jets are taken out of at least five jets. Similarly, a significant results for mass reconstruction of \(m_H = 250\) and 300 GeV with \(7\sigma \) and \(5\sigma \), respectively, indicates the efficiency to discover such heavy masses at future LHeC. By accounting for the systematics effect of 2% mentioned, the significance reduces from 6.1\(\sigma \) to 5.4\(\sigma \) for \(m_H = 270\) GeV in Method 2.

Future opportunities: A similar analysis can be performed with \(H \rightarrow ZZ\), \(Z \rightarrow \ell ^+\ell ^-, jj\) in addition with the neutral current channel \(p e^- \rightarrow e^- H j\). Also, these studies can be carried forward in the HL-LHC and proposed FCC facilities.