Beyond $M_{t\bar{t}}$: learning to search for a broad $t\bar t$ resonance at the LHC

A resonance peak in the invariant mass spectrum has been the main feature of a particle at collider experiments. However, broad resonances not exhibiting such a sharp peak are generically predicted in new physics models beyond the Standard Model. Without a peak, how do we discover a broad resonance at colliders? We use machine learning technique to explore answers beyond common knowledge. We learn that, by applying deep neural network to the case of a $t\bar{t}$ resonance, the invariant mass $M_{t\bar{t}}$ is still useful, but additional information from off-resonance region, angular correlations, $p_T$, and top jet mass are also significantly important. As a result, the improved LHC sensitivities do not depend strongly on the width. The results may also imply that the additional information can be used to improve narrow-resonance searches too. Further, we also detail how we assess machine-learned information.


I. INTRODUCTION
Discovering new physics through a new resonance is one of the most exciting opportunities. A "narrow" resonance peak, being sharply localized in the energy spectrum, allows for the most efficient discovery above continuum backgrounds as well as for precision measurements of the particle mass, width and other properties. However, the widths of new (and presumably heavier) resonances in new physics can be easily much larger than those of the Standard Model (SM) particles. The width generally grows with the mass of a resonance, and a new strong coupling may induce rapid decay as in composite Higgs models [1][2][3][4] or warped extra dimensional models [5,6]. Also, more decay channels to lighter beyond-SM particles may open up, which further increases the width.
The large width causes several difficulties in collider experiments. Above all, without a sharp peak, the discovery becomes challenging, as the signal becomes spread over a large range of energy above continuum backgrounds. For example, the ATLAS result based mostly on the invariant mass distribution [7] shows that for a M = 1 TeV Kaluza-Klein gluon, the measured (expected) cross section upper limit σ(pp → g KK → tt) increases from 1.4 (1.2) pb to 4.7 (2.7) pb when the width-to-mass ratio Γ/M varies from 10% to 40%. In addition, the phenomenological study in Ref. [4] shows that for the minimal composite Higgs model with the third generation left-handed quark q L = (t L , b L ) T being fully composite, a vector tt resonance as light as M = 1 TeV is still allowed by the direct search in the Γ/M 20% region.
Secondly, broad resonance shape is more susceptible to the energy dependences of parton luminosity and the width, interferences with backgrounds or other resonances, and mixing and overlap with nearby * sunghoonj@snu.ac.kr † dongsub93@snu.ac.kr ‡ kpxie@snu.ac.kr resonances. These effects make discoveries further challenging and complicated. In particular, the complex interference (the one with imaginary parts in amplitudes) in supersymmetric or two-Higgs doublet models can make broad heavy Higgs bosons decaying to tt generally appear not as a pure resonance peak [8][9][10][11] but even as pure dips or nothing [11]. And nearly degenerate heavy Higgs bosons can overlap significantly, producing complicated resonance shapes [12][13][14].
Many of these new broad resonances are just beyond the current reach of the LHC. Thus, it is imperative to study the physics of broad resonances and develop efficient discovery methods. However, broad-resonance searches have been studied only in limited cases, e.g., phenomenologically in Refs. [4] (third-generation quark pair, + − ), [5] (µ + µ − ), and experimentally in Refs. [7,15] (tt), [16,17] (jj) [18]( + − ). In all the cases, the invariant mass had still been used as a main observable, but the question of "how do we (best) discover a broad resonance without a peak?" had not been answered thoroughly 1 .
This question might be a problem appropriate to use deep neural network (DNN) technique to answer. It is because the answer is not so obvious, a priori, and even small improvements will be significant. Machine learning has indeed been applied to various problems in particle physics. For example, bump-hunting resonance searches were improved with DNN [20,21]. The DNN is one of machine learning algorithms. Coming with various network structures such as fully-connected network [22][23][24][25][26][27], convolutional neural network [28][29][30][31] and others [32][33][34][35][36], DNN had shown remarkable performances in the exploration of physics beyond the SM, often better than other machine learning algorithms such as boosted decision tree (BDT). We refer to Refs. [37,38] and references therein for reviews of the DNN applications in LHC physics.
In this paper, we consider a spin-1 broad tt resonance at the LHC (Sec. II). Being the heaviest particle in the SM, the top quark has been regarded as an important portal to new physics.
As a first step toward a more general study of broad resonances, we ignore any interference effects and nearby resonances (Sec. III). We use fully-connected DNN to explore answers beyond common knowledge (Sec. III). Finally, we assess whether and what DNN can learn, even beyond what we know well (Sec. IV).

II. BENCHMARK MODEL
For simplicity, here we consider a gauge singlet vector resonance ρ interacting strongly with the SM righthanded top quark t R , and the relevant Lagrangian is where ρ µν = ∂ µ ρ ν − ∂ ν ρ µ , and g 1 is the SM hypercharge gauge coupling. This model is also considered in Refs. [2,19,39]. Note that the ρ µ mixes with the SM gauge field B µ in Eq. (1). Given g ρ g 1 , the mixing angle is sin θ ≈ g 1 /g ρ before the electroweak symmetry breaking (EWSB). Therefore, after transforming to the mass eigenstates, the interactions between ρ resonance and SM fermions will be ∼ g ρ for t R and ∼ Y g 2 1 /g ρ for other fermions (including t L and other light quarks), with Y being the hypercharge of the corresponding fermion. The physical mass of ρ is M ρ = m ρ . EWSB gives O(v 2 /m 2 ρ ) corrections to above picture, and the details can be found in Appendix C of Ref. [19].
Due to the large coupling g ρ , the ρ resonance decays to tt with a branching ratio ∼ 100%, and the width-to-mass ratio is For g ρ = 3 and 4, this ratio reaches 36% and 64%, respectively. Thus a broad ρ is easily realized in the model described by Eq. (1). Note that Γ ρ→tt Γ ρ , if ρ has other strong dynamical decay channels such as the decay to low-mass top partners (which are not listed in our simplified model), typically Γ ρ is several times larger than Γ ρ→tt , thus a large Γ ρ /M ρ can be obtained even for smaller g ρ . We consider M ρ = 1 and 5 TeV as two benchmarks, and for each mass point the width-to-mass ratios Γ ρ /M ρ = 10%, 20%, 30% and 40% are considered. The corresponding benchmark cases are then identified as MiΓj, with i = 1 or 5 denoting the mass (in unit of TeV) and j = 1, 2, 3, 4 being 10 × Γ ρ /M ρ . For example, M1Γ4 is the benchmark for M ρ = 1 TeV and Γ ρ /M ρ = 40%. At the LHC, the ρ resonance can be produced via the Drell-Yan process (qq → ρ) through the ρ-light quark interaction. Among the various decay channels of the tt, we choose to focus on the semi-leptonic final state The dominant background is then the SM tt process, which contributes 81% ∼ 88% of the total backgrounds [7]. For simplicity, we only consider this background. It should be emphasized that although we provide a benchmark model as physical motivation here, our results are general for all heavy singlet spin-1 resonances with top quark portal.

III. SEARCHING FOR A BROAD tt RESONANCE
In this section, we describe technical details of our work and show final cross section limits. First, we describe how we parameterize a broad resonance, and how we build learning datasets and train DNN for each benchmark signal case. Then we derive improved cross section upper limits.

A. Breit-Wigner description
We assume a single, isolated broad resonance far away from any other resonances and thresholds, and ignore any interference effects. Then we use the following Breit-Wigner description of the propagator of a broad resonance where the nominal resonance mass M ρ and the width Γ ρ are fixed constants. The energy dependence of the masŝ M ρ (s) from the real part of the self-energy correction is higher-order, hence small irrespective of the large width. On the other hand, the energy dependence of the widthΓ ρ (s) ∝ √ s from the imaginary part can induce corrections as large as ∼100 (10)% for broad resonances considered in this paper Γ ρ /M ρ ∼ 40 (20)%. But, within this range of the width, the resonance shape remains relatively undistorted albeit some shifts of the peak and height [4,5,40]. Also, the fixed mass and width have been used in LHC searches of broad resonances [7,15]. Thus, we use Eq. (4) with fixed M ρ and Γ ρ , both for simplicity and for comparison purpose.

B. Preparing training data
The model described by Eq. (1) is written in the universal FeynRules output file [41]. We generate parton-level events of the signals and background using 5-flavor scheme within the MadGraph5 aMC@NLO [42] package. All spin correlations of the final state ± νbbjj objects are kept. The phase space integrate region is set to | √ s − M ρ | 15 × Γ ρ , which is large enough for us to simulate the full on-and off-shell effect of the ρ resonance. The interference between pp → ρ → tt and the SM tt background is negligible [7], thus not considered here. We normalize the SM tt cross section with the the next-to-next-to-leading order with nextto-next-to-leading logarithmic soft-gluon resummation calculation from the Top++2.0 package [43][44][45][46][47][48], and the K-factor is 1.63. The parton-level events are matched to +1 jet final state and then interfaced to Pythia 8 [49] and Delphes [50] for parton shower and fast detector simulation. As for the detector setup, we mainly use the CMS configuration, but with following modifications: the isolation ∆R parameters for electron, muon and jet are set to 0.2, 0.3 and 0.5 respectively. The b-tagging efficiency (and mis-tag rate for c-jet, light-flavor jets) is corrected to 0.77 (and 1/6, 1/134) according to Ref. [51]. We generate 5 × 10 6 events for the background and each signal benchmark.
We defined two kinematic regions. The first one is called the resolved region, in which the decay products of the top quark (i.e ± νbbjj) are identified as individual objects. This region is defined as follows > 300 GeV and |η jtop | < 2.0, and satisfies ∆φ(j top , ± ) > 2.3. The top-jet is reconstructed with a R = 1.0 cone in antik t algorithm, and is trimmed with R cut = 0.2 and f cut = 0.05 [52]. We use a simplified top-tagging procedure in event selection. The top-tagging efficiency and the mistag-rate are set to 80% and 20% respectively, based on Ref. [53], which makes use of jet invariant mass and N -subjettiness [54][55][56][57][58][59]. 4. Exactly one selected jet with p j sel T > 25 GeV and |η j sel | < 2.5. In addition, the selected jet should have ∆R(j sel , j top ) > 1.5 and ∆R(j sel , ) < 1.5.
The cuts here are again mainly based on Ref. [7]. and the cut flows for signals and background are listed in Table II. In this region, we consider both M ρ = 1 and 5 TeV signals. To increase the event generating efficiency of the background events, in this region we require the SM pp → tt → ± νbbjj process has at least one final state parton (including the b-parton) with p T > 150 GeV. This is done by setting xptj = 150 in MadGraph5 aMC@NLO. We have checked that this setup doesn't lose the generality, but improves the event generating efficiency by a factor of ∼ 6. The background cross section after cuts is 2.88 pb taken into account the K-factor.
The events after cuts are collected to make training and validation/test datasets. For the resolved region, we have 1 ± + / E T + 4 jets in total 6 reconstructed objects in the final state, and 26 low-level kinematic observables can be used as input features: E , p T , η and φ from the charged lepton; / E T , φ / E T from the missing transverse momentum; E ji , p ji T , η ji , φ ji and b ji from the 4 leading jets, with i = 1, 2, 3, 4. Here b j is the b-tagging observable, which is 1 for a b-tagged jet and 0 otherwise. Some examples of the low-level observables distributions are shown in Fig. 1(a). For each benchmark case (i.e. M1Γ1∼M1Γ4), we build a training dataset and a validation/test dataset. Both of those two datasets have 1,000,000 events, which contain nearly equal signal and background events.
For the boosted region, 1 ± + / E T + 1 top-jet + 1 selected jet in total 4 objects are reconstructed, and we can extract 15 low-level observables as input features: the first 6 are from and / E T , same as the resolved region; the other 9 insist of E j sel , Some examples of the low-level observables distributions are illustrated in Fig. 1(b). For each benchmark case (i.e. M1Γ1∼M1Γ4, and M5Γ1∼M5Γ4), we randomly mix equal number of signal and background events to get 800,000 events for training and another 800,000 events for validation/test.

C. Training the DNN
The DNN classifier is implemented using the Keras [60] package (with Tensorflow [61] as the backend). The architecture of the DNN is as follows,   where N hidden and N node are the numbers of hidden layers and the number of neurons per hidden layer, respectively. The number of input features N in = 26 (15) for the resolved (boosted) region. All the input features are rescaled to have average 0 and standard deviation 1 before training. We label the events with column matrices to match the two neurons in output layer: The Rectified Linear Unit (ReLU) activation function is used for all the hidden layers, while the softmax activation function is adopted for the output layer. The loss function is categorical crossentropy, and the optimizer is Adam. To get the best configuration of the DNN, we try various choices of the hyper-parameter combination as follows, where L r is the initial learning rate, D r is the dropout rate, and B s is the batch size. For each benchmark case, there are in total 48 different DNN configurations, in which we select the best one based on the learning curves with the following criteria: 1. If the validation/test accuracy curve achieves its maximum when crossing with the training accuracy curve, and meanwhile the validation/test loss curve reaches its minimum and crosses with the training curve, we select that configuration and cut the training at that epoch. This early stop is to prevent over-fitting.
2. If more than one configurations have the behaviors mentioned above, then we select the one with the higher validation/test accuracy and lower validation/test loss; if still there remain more than one networks, we choose the one with learning curves having less fluctuation.
The details of training and the chosen configurations are listed in Tables III and IV of the Appendix. For the M ρ = 1 TeV models, the DNN can reach a classification accuracy of 80% in the resolved region and of 65% in the boosted region. While for the M ρ = 5 TeV case, the accuracy is 76% in the boosted region.
The softmax activation function for the output layer guarantees the output responses of the 0th neuron (r 0 ) and the 1st neuron (r 1 ) satisfy Therefore, we can consider r 1 only, and denote it as r.
Due to the label definition in Eq. (6), If the DNN is well trained, the distribution of r should have a peak around 1.0 (0.0) for the signal (background), for both the training data and the validation/test data. Figure 2 shows the distributions of the validation/test data for benchmark cases with Γ ρ /M ρ = 40% as an illustration. The DNN for M1Γ4 shows worse performance in boosted region compare to the one in resolved region. This is because that two peaks in neuron output from signal and SM background are not separated well. In fact, this is   Table I and II are applied. a generic feature for all M ρ = 1 TeV benchmark cases. It is mainly due to the the boosted region cuts, which require a top-jet with p jtop T > 300 GeV. As a result, most of the SM tt background events are round this value. However, for a M ρ = 1 TeV resonance, its decay product t/t acquires a transverse momentum ∼ 500 GeV, quite similar to the cut threshold. Therefore, the signal and background look similar (see the p jtop T distribution in Fig. 1(b)

D. Setting bounds for the signal
We treat the neuron output r as an observable, and fit its distribution shape to get the cross section upper limit of pp → ρ → tt for a given integrated luminosity. For the M ρ = 1 TeV benchmark cases, we use a binned χ 2 fitting method by dividing the 0 < r < 1 range into 50 bins. While for the M ρ = 5 TeV benchmarks, as the signal cross sections are expected to be tiny, to improve the efficiency we use the un-binned fitting method described in Refs. [62,63]. In each case, we consider the statistic uncertainty and assume a 12% systematic uncertainty for the background. To include the effect of other subdominant backgrounds besides tt (i.e. W + jets, multi-jet, etc), we further rescale the cross section by a factor of 1.23 = 1/0.81 and 1.14 = 1/0.88 for the resolved and boosted regions, respectively. Those factors come from the fact that tt contributes 81% (88%) of the total background for resolved (boosted) region [7]. This simple rescaling could overestimate final contributions from subdominant backgrounds, and result in somewhat conservative estimations of cross section bounds.
The signal strength upper limits are derived for the unfolded parton-level cross section σ(pp → ρ → tt), which can be compared with the final results in experimental papers, e.g. Refs. [7,15]. Our results are shown in Fig 3, in which the expected and measured upper limits of Ref. [7] are also plotted as references, as they use the same final state and similar selection cuts. One can read that the DNN results are rather insensitive to the width of the ρ resonance compare to the traditional approach, achieving better constraints in the large width region 2 . For the M ρ = 1 TeV benchmark, the result is obtained by the combined fitting of both resolved and boosted regions. Individually, the resolved and boosted regions respectively yield cross sections ∼ 3 pb and ∼ 1 pb. Although networks in the resolved region have a higher accuracy ( 80%) than those in the boosted region ( 65%) in Table III, they actually give a worse measurement of the cross section. This is because the boosted cuts can remove lots of background events and hence improve the fitting performance. That is also the reason why we only consider the boosted region for M ρ = 5 TeV: the production rate for such a high mass ρ is so small that we have to use the boosted region to suppress the background. The DNN bounds for 5 TeV signal benchmark are comparable to the experimentally measured ones, but still better than the experimentally expected ones. As the training uses random number for the initialization of weights and biases, even for a given DNN configuration, the final results are slightly different for different running. To take into account this training uncertainty, we repeat 15 times of running the chosen DNN configuration for each benchmark case. For the M ρ = 1 TeV case, the relative fluctuation is small thus not shown; while for the M ρ = 5 TeV case, the standard deviations of the runs are shown as vertical error bars in

IV. FIGURING OUT WHAT THE MACHINE HAD LEARNED
In this section, we attempt to assess information learned by DNN using three methods, each of which will be discussed in each subsection. As a result, we can figure out not only which information has been learned, but also which information is most important. 2 We also checked that the DNN results are better than those from more traditionally used BDT.

A. Testing high-level observables
It is important to know whether a DNN had learned well-known useful but complicated features. In fact, it has been argued that some machine learning methods such as jet image [30] do not efficiently capture invariant mass features [29].
Our approach is to train another set of DNNs using additional high-level observables, of which features we want to test. By comparing the performances of these new DNNs with the original DNNs trained with only lowlevel observables, we can test whether those particular high-level features (i.e. physically-motivated) have been effectively learned 3 or not. This "saturation approach" has been widely used in particle physics research [23,64].
To construct high-level observables, we first reconstruct the t andt. The longitudinal momentum of the neutrino is solved by requiring the leptonically decaying W to be on-shell, i.e. M ν = M W . For the resolved region, the assignment of the 4 reconstructed jets are done by minimizing for various jet permutations, where σ W = 0.1 × M W and σ t = 0.1 × M t . For the boosted region, a top quark is identified as the top-jet and the other is reconstructed from the combination of ± νj sel . Once the t andt are reconstructed, we are able to define the following 7 highlevel observables for the signal pp → ρ → tt: 1. The invariant mass M tt of the tt system.
2. The polar angle and azimuthal angle in the Collins-Soper frame [65]. We label the leptonic and hadronic decaying tops with subscripts "tl" and "th", respectively. Hence we have cos θ CS tl , cos θ CS th , φ CS tl and φ CS th in total 4 observables. 3. The polar angles in the Mustraal frame [66], cos θ Mus. 1 and cos θ Mus.
The first observable reveals the resonance feature, while the latter 6 observables reflect the spin-1 nature of the ρ resonance. For the boosted region, to take into account the features of the top jet, we introduce 3 additional highlevel observables, i.e.
1. The invariant mass M jtop of the top jet.
Those observables are shown to be important in identifying the color structure of the hard process [59,67,68]. In our scenario, the signal results from a color-singlet resonance, while the background comes from QCD process, and the jet mass and N -subjettiness can help to reveal this difference [67]. Moreover, such jet substructures can be more independent on resonance characteristics and kinematics. Some distributions of these high-level observables are shown in Fig. 4. Note that the spin correlations as well as the jet substructure observables are rather insensitive to the width of ρ, as expected. For the 5 TeV resonance, the mass peak of M tt ∼ 5 TeV almost disappears for Γ ρ /M ρ 10%; instead, there is a peak ∼ 1 TeV, due to the parton-distribution support of off-shell effects and hard p T cuts. Most identified top-jets in both signal and background originate correctly from the top quark, thus the differences shown in the distributions of M jtop and τ 32 come from the color structure of the hard process. For example, the background's M jtop distribution is slightly broader and the τ 32 is slightly bigger than the signals. This is because the top-jets from QCD tt are color connected with the initial state, consequently having more radiations. Using these "all observables" (i.e. sum of low-and high-level observables) as inputs, we train a new set of DNNs; best network configurations are again surveyed and detailed in Tables III and IV of the  Appendix. We compare the performances of original and new DNNs using receiver operating characteristic (ROC) curves. The area under curve (AUC) is used as a metric of the performance. Some of the comparisons are shown in Fig. 5. First, in the resolved region as shown in the top panel, we found that there is only little change on ROC curves by adding high-level observables. Not only AUC, but also background efficiencies show small change. This means that the inclusion of high-level observables does not yield the improvement of accuracy; the original DNN had learned those high-level features successfully from low-level inputs.
In the boosted region, while the M tt , M jtop and spin correlations can be derived from the four momenta  We use these to train a new set of DNNs to test whether such high-level features were learned.
of reconstructed objects, the N -subjettiness cannot be inferred from the low-level inputs. Therefore, adding high-level features can bring improvements. As shown with ROC in the bottom two panels of Fig. 5, the improvement is sizable for M1Γ4, while, however, relatively small for M5Γ4. This may be because the event topology of M5 boosted cases becomes so simple that many features are more correlated.

B. Ranking input observables by importance
Which information has been used most usefully by DNN in distinguishing a broad resonance against continuum background? To answer this, we attempt to identify which connections between which neurons and layers are weighted most importantly. Following Ref. [69], we define the learning speed of the j-th hidden layer as where b (j) is the bias vector of the j-th hidden layer, while L loss is the loss function. As the target of machine learning is to find the global minimum of L loss , the v (j) approximately reflects the training sensitivity of a specific layer. When training the DNN, the larger v (j) a layer acquires, the more important it is. We found that for all individual benchmark cases MiΓj, the first hidden layer has the highest learning speed several times larger than that of other layers. For example, for M1Γ4 case in the resolved region, the learning speed is v (1) = 0.457, v (2) = 0.086, v (3) = 0.033, v (4) = 0.016 and v (5) = 0.008. This means that good features are typically learned most efficiently in the first hidden layer. For our DNN architecture described in Eq. (5), the weights of the first hidden layer form a N in × N node matrix, whose element is denoted as w (1) mn with m = 1, · · · , N in and n = 1, · · · , N node . As all the input features are rescaled to have average 0 and standard deviation 1, the magnitude of the weight w (1) mn reflects the correlation strength between the m-th input and the n-th neuron in the first hidden layer. Motivated by this, we further define as a measure of the importance of the m-th input feature. The normalization N is such that Nin m=1 W m = 1.
(11) Figure 6 shows the W m 's of each input observable from the DNN trained using both low-and high-level observables. Above all, the M tt -that we expected to be less useful for a broad resonance -is still one of the most important observables even when the resonance is broad. This is particularly true for a low-mass broad resonance in the resolved region (upper panel). In the case of a heavy-resonance in the boosted region (lower panel), its importance is relatively reduced, partly because some invariant-mass information has been used in the selection of the boosted region. In such cases, the top-jet mass and transverse momentum which are somewhat correlated with M tt and width can significantly complement the search, as shown in the bottom panel. In addition, the invariant mass of the top-jet is another important input feature because it reflects the color flow difference between signals and background. On the other hand, Nsubjettinesses again turn out to be relatively less useful.
Remarkably, there are much other useful information, particularly from angular distributions η ,j and cos θ Mus.
1,2 . From Figs. 1 and 4, we can see that these observables are relatively uncorrelated with the resonance width. We have indeed checked that the cross entropies [70] between these observables and M tt , which can quantify their correlations, are not so high. As we will see in the next subsection, these information are useful even in the off-shell region away from the resonance, hence less correlated with the width. Thus, these features are useful in search of broad resonances. This may also imply that narrow-resonance searches can be improved by adding off-resonance information; this is partly because a large fraction of signals is still from low-energy off-resonance region where parton-luminosity support is much larger (although buried under larger backgrounds). We leave this for a future study.

C. Planing away M tt
We have observed that M tt is still important, but there are indeed uncorrelated useful information. How much is discovery capability attributed to those uncorrelated (whether known or unknown) information? Using the data planing method [29,71], we plane away the feature in the invariant mass spectrum. We attach a weight to each event so that the weighted distribution of M tt becomes flat for both signals and backgrounds; the details of chosen network configurations and more results are described in Table V of the Appendix. A new set of DNNs trained with such planed data must learn information uncorrelated with M tt , and the difference between the performance with/without M tt offers a quantitative answer to the question "how much information it is beyond the invariant mass". In practice, to avoid large fluctuations, we use only M tt ∈ [0.5, 3] TeV region with 20 GeV bin size for all signal cases. This means that for 5 TeV signals, we consider only off-resonance events; note that the majority of signal is from the low-energy region supported by larger parton luminosities.

��Γ� ��Γ� ��Γ� ��Γ�
After M tt planed away, the classification accuracies reduce from 80% to 73% for M ρ = 1 TeV in the resolved region and from 65% to 62% in the boosted region. For M ρ = 5 TeV cases, accuracies reduce from 76% to 63% in the boosted region. As accuracies are still significantly higher than random guess (i.e. 50%), we conclude that DNNs still have some capabilities to distinguish signals from background, even though they are blind to M tt and most events are from off-resonance region (for 5 TeV cases). Clearly, on top of M tt and width, the original DNNs had learned extra information (such as aforementioned angular correlations).
Indeed, we have checked that the weights W m for various anglular and angular-correlation observables, after planing the M tt , are relatively high. From Fig. 1  and 4, one can also see that they are largely independent on the width. The helicity conservation (hence, angular correlations) can hold somewhat independently of the invariant mass, as the range of the invariant mass considered is always much larger than the top mass. Thus, we conclude that much of the angular information can be from off-resonance region, and such off-resonance information (although buried under larger backgrounds) can enhance discovery power. As a result, as shown in Fig. 3, final performance is not only improved but became rather insensitive to the resonance width.
A final remark is that there could still be unknown (to us) useful information that are not identified in our analysis.

V. CONCLUSION
We have found that, in an attempt to develop methods to discover broad tt resonances, M tt is still one of the most important observables, but additional information from both on-and off-resonance regions can significantly enhance discovery capability. As a result, the cross section upper limits can be improved by ∼ 60% for Γ ρ /M ρ ∼ 40%, and the improved LHC sensitivities do not strongly depend on the width of a resonance. As resonances in new physics beyond the SM are easily broad, our learnings and techniques can be used to efficiently search for them.
The most useful observables turn out to be M tt (even for broad resonances), p jtop T , M jtop , angular distributions and color correlations. The usefulness of M tt even for broad-resonance searches is not necessarily obvious, a priori. But correlated observables such as p jtop T are found to further complement. Angular information (some of whose contributions come from off-resonance region) and M jtop (which can measure color flow structures irrespective of resonance characteristics) are relatively uncorrelated with the width and M tt , making improved LHC sensitivities less dependent on the width. Lastly, as we trained using only low-level inputs, our results also show that high-level observables such as M tt are effectively well learned by DNN.
We have assessed these machine-learned information in three ways: by explicitly testing those high-level observables, by ranking input (low and/or high) observables using weights of the network, and by planing away features correlated with M tt . Notably, after all, there can still be unknown useful information that are not easily identified in our analysis. Thus, being able to communicate more efficiently with networks will enable better explorations of the nature, beyond what we know.

ACKNOWLEDGMENTS
We would like to thank Shawn Jia, Jinmian Li, Hui Luo, Tao Xu, Daneng Yang and Zhao-Huan Yu for discussions and and the anonymous referee for useful suggestions. SJ and KPX are supported by Grant Korea NRF 2015R1A4A1042542, NRF 2017R1D1A1B03030820, SJ also by POSCO Science Fellowship, and DL by NRF 0426-20170003, NRF 0409-20190120. The selected DNN configurations for M ρ = 1 and 5 TeV are listed in Table III and Table IV, respectively. The selection criteria are described in Section III C. The epochs when we cut the training are listed in the forth columns. One can see that for a individual signal benchmark in a given kinematic region, the DNN with low-level observables usually requires a longer training epoch than the DNN with all observables, if they have the same configurations. That is because the DNN needs more time to learn about the physics in the signal process, if no hint is given to it. The classification accuracies (on the validation/test data) of the networks are given in the fifth columns. Table V shows the accuracy reach of the DNNs before and after planing away the key observable M tt . The data of the second row, i.e. the accuracies before planing, are taken from the fifth columns of Tables III and IV. While the accuracies after planing listed in the third row are obtained by the weighted training described in Section IV C.