Statistics is a tool to show how accurate a measurement is. Particle physicists make use of statistics to know, for example, with how much probability the events under study are signal-like in a new particle search, how much improvement a new measurement gives compared with the past measurements, how powerful a new analysis method is, and how much a certain theoretical prediction is restricted. Statistics is also useful to discriminate the signal events from the background events and to estimate the number of the signal events and background events properly. Therefore, the basics of statistics are essential for experimentalists.

As an example of how the statistics is used in particle physics, let us explain how the cross section of a certain physics process (\(\sigma _{\textrm{phys}}\)) can be measured. As described in Chap. 2, \(\sigma _{\textrm{phys}}\) can be extracted from five experimental observables: the number of observed events (\(N_{\textrm{obs}}\)), the number of estimated background events (\(N_{\textrm{bkgd}}\)), the acceptance of the event selection (A), the detection efficiency (\(\varepsilon \)), and the integrated luminosity (\(L_{\textrm{int}}\)) using

$$\begin{aligned} \sigma _{\textrm{phys}} = \frac{(N_{\textrm{obs}} - N_{\textrm{bkgd}})}{L_{\textrm{int}} {A} \varepsilon }. \end{aligned}$$
(4.1)

\(N_{\textrm{obs}}\) is measured by counting the number of events after the signal event selections. \(N_{\textrm{bkgd}}\) is estimated from data and/or the Monte Carlo simulation samples (MC samples). In case we use data, it is often estimated from the fit to a distribution of a physics observable such as an invariant mass of particles. The geometrical acceptance and the detection efficiency of the signal events are often determined using a large amount of MC samples. Signal and background separation can be improved by the selections based on the likelihood method or the multivariate statistical analyses. In the above example, there are five observables and four of them (\(N_{\textrm{bkgd}}\), \(L_{\textrm{int}}\), A, and \(\varepsilon \)) have both statistical and systematic uncertainties but \(N_{\textrm{obs}}\) has only the statistical uncertainty. Systematic uncertainty is an uncertainty that arises from methods performed in the detector calibrations, the data analyses, etc. In the end, the total uncertainty of the cross-section measurement is determined by propagating and combining the statistical and systematic uncertainties of the five observables.

All of what is mentioned here require a good knowledge of statistics. This chapter describes the basics of statistics, which include uncertainties, the probability of the distributions, the propagation of the uncertainties, basic techniques of the fit to a distribution, and the basics of the maximum likelihood method.

4.1 Uncertainty

One can never know the true values of nature, but believe that there are such true values. This idea comes from a kind of frequentist’s viewpoint and is adopted in many analyses of collider physics. All we can obtain is the estimator for the true value based on the outcome of the experiment, i.e. measurement. Since perfect measurements can never be performed, the result of measurements is always represented by a centre value and its uncertainty. The centre value is often determined by the most probable value or the expectation value of the measurement. The spread of the probability distribution for the estimator of the central value is often used as the uncertainty, which usually consists of two kinds: the statistical uncertainty and the systematic uncertainty. Therefore, the measurement is usually expressed by

$$\begin{aligned} \mathrm{(measurement)} = \mathrm{(central \ value)} \pm \mathrm{(stat. \ uncertainty)} \pm \mathrm{(syst. \ uncertainty)} . \end{aligned}$$
(4.2)

In this section, the basic concepts of the uncertainties are explained using the example of the cross-section measurements shown in Eq. (4.1).

4.1.1 Statistical Uncertainty

The statistical uncertainty arises from stochastic fluctuations of random processes. If an event observed is uncorrelated with events observed in the past and the future, the statistical uncertainty follows adequate probability distributions. \(N_{\textrm{obs}}\) obeys the Poisson distribution (or the normal distribution if \(N_{\textrm{obs}}\) is large enough), which is explained in the next section. The acceptance (A) and efficiency (\(\varepsilon \)) follow the binomial distribution, which is also explained in the next section. The mean and the uncertainty can be obtained from these probability distributions.

4.1.2 Systematic Uncertainty

Suppose charged leptons are selected in measuring the cross section of a certain physics process. We must know the selection efficiency of the charged leptons to extract the cross section (Eq. (4.1)). The event selection efficiency, which includes all the efficiency of each selection stage, is usually estimated from MC samples. Ideally, the efficiency obtained from MC samples is the same as the one in real experimental data (in short, real data). But who knows if it is true? In the real life, it is impossible to have the perfect correspondence between MC samples and real data.

The differences must be evaluated, and corrected if needed. Let’s keep using the physics process selected by the requirement on charged leptons, as an example, where the selection efficiency is estimated from purely MC samples (defined as \(\varepsilon ^\textrm{MC}\)). We must know the selection efficiency of the leptons in real data not in the MC samples. We correct the efficiency of MC simulation if \(\varepsilon ^\textrm{MC}\) is not consistent with the one in real data. This correction can be obtained using an under-controlled data and/or well-known physics process (so-called control data), which are collected by the event selection without the lepton requirements. The caveat here is that we cannot use the selection which we want to evaluate when collecting the control data. For example, \(Z \rightarrow \ell \ell \), where \(\ell \) represents charged lepton, can be selected by requiring the invariant mass of two charged tracks to be around 90 GeV. Assuming the purity of this control data is high enough for brevity, we can extract the efficiency of selecting the lepton for both MC simulation  (\(\varepsilon _{\textrm{cont}}^\textrm{MC}\)) sample and real data (\(\varepsilon _{\textrm{cont}}^\textrm{data}\)).

Once we obtain \(\varepsilon _{\textrm{cont}}^\textrm{data}\) and \(\varepsilon _{\textrm{cont}}^\textrm{MC}\), a so- called Scale Factor (SF) is defined as \(\displaystyle \textrm{SF}=\varepsilon _{\textrm{cont}}^\textrm{data} / \varepsilon _{\textrm{cont}}^\textrm{MC}\). Using \(\textrm{SF}\), the efficiency used in Eq. (4.1) is corrected as \(\displaystyle \varepsilon = \varepsilon ^\textrm{MC} \times \textrm{SF}\). The uncertainty of \(\textrm{SF}\), which comes usually from limited statistics of the control data, is taken into account as the systematic uncertainty of the cross-section measurement. In addition, if \(\varepsilon ^\textrm{MC}\) is different from \(\varepsilon _{\textrm{cont}}^\textrm{MC}\) (indication that the lepton selection efficiency depends on physics processes), the difference must be also taken into account as another systematic uncertainty.

Here has been shown one typical example. Usually, we need to consider several different types of systematic uncertainties. The sources of the systematic uncertainty are, for example, poor understanding of jets, electrons, muons, and charged tracks reconstructions, mismodelling of the fitting function to discriminate the background events from the signal events, imperfectness of the theoretical model in the Monte Carlo simulation, and mismeasurement of the luminosity. The study of the systematic uncertainty is essential not only to estimate proper uncertainty of the measurement but also to help understand the details of detector responses and the dependence of what we are measuring on the other physics parameters. The study of systematic uncertainty sometimes also tells us the weakness of the analysis procedure.

4.2 Probability Distribution

Let’s assume you roll a dice which is truly a cube. The number n, which is an integer from 1 to 6, will be shown randomly and one can only predict the probability of the number on the dice to be n, which is denoted by P(n). In this case, the probability of n is the flat distribution of \(P(n)=1/6\).

Now further assuming that you throw two cubic-dice, the sum of the numbers on two dice is the random integer from 2 to 12. The probability of P(n) that the sum of the numbers on two dice is n is obtained as a function of n as shown in Table 4.1. We cannot say what n happens in the next throw but know how large the probability is for each n.

Similarly, in collider physics, it is impossible to predict what kind of event (physics process) will happen in the next collision. A human being can only know the probability of having a certain event in the next collision. The number of observed events (n) in a fixed number of collisions follows a certain probability distribution (P(n)).

Not only the number of events but also many other observable quantities such as energy or angle of a particle created by particle collisions, invariant mass reconstructed from particles, and so on follows certain probability distributions. Typical probability distributions which are often used in the particle physics experiments are introduced in the following subsections.

Table 4.1 The probability that the sum of the number on two dice is n

4.2.1 Basics of Probability Distributions

In the example of the dice described previously, the random number n is discrete and P(n) is the probability to have n. For the discrete probability distribution such as the dice, the probability of having from \(n_j\) to \(n_k\) is given by \(\sum _{i=j}^{k} P(n_i)\), if all possible random numbers are distributed as \(n_0, n_1, n_2, \ldots , n_j, \ldots , n_k, \ldots n_\ell \). If x is continuous, P(x) is a probability density function of x but is simply called a probability. The probability having x in the interval from x to \(\textrm{d}x\) is given by \(P(x)\textrm{d}x\).

The probability distribution is normalised to 1, so that the probability is defined to be between 0 and 1. If \(n_i\) is discrete,

$$\begin{aligned} \sum _{i=0}^{k} P(n_i) = 1 \end{aligned}$$
(4.3)

or, if x is continuous,

$$\begin{aligned} \int P(x) \textrm{d} x = 1, \end{aligned}$$
(4.4)

where the integral is over all eligible x.

The expectation value (\(\mu \)) and the standard deviation (\(\sigma \)) are often used as the measured value and its statistical uncertainty, respectively, in particle physics. So the result of a measurement is usually represented as \(\mu \pm \sigma \).

The expectation value is obtained with the arithmetic average. The expectation value is defined to be

$$\begin{aligned} \mu = \sum _{i=0}^{k} n_i P(n_i) \end{aligned}$$
(4.5)

if \(n_i\) is discrete, or

$$\begin{aligned} \mu = E[x] = \int x P(x) \textrm{d} x \end{aligned}$$
(4.6)

if x is continuous (the integral is over all eligible x).

The standard deviation is the amount indicating how a certain measurement varies statistically from the average. The square of the standard deviation is called the variance and defined as

$$\begin{aligned} \sigma ^2 = \sum _{i=0}^{k} (n_i - \mu )^2 P(n_i) \end{aligned}$$
(4.7)

if \(n_i\) is discrete, or

$$\begin{aligned} \sigma ^2 = E[(x-E[x])^2] = \int (x-\mu )^2 P(x) \textrm{d} x = E[x^2] - (E[x])^2 \end{aligned}$$
(4.8)

if x is continuous (the integral is over all eligible x).

4.2.2 Binomial Distribution

When a particle passes through a certain readout channel of the detector, usually it provides a “hit" signal or a “miss" signal occasionally. The reason for “miss" may be due to, for example, geometrically dead-regions, dead-time of the readout, or an unexpectedly small gain of the charge from the interaction between the particle and detector material. The probability of “hit" or “miss" is given by the binomial distribution. Let’s assume the N particles pass through the detector. The probability of n “hits" and \((N-n)\) “misses" in N trials is given by

$$\begin{aligned} P(n) =\frac{N!}{n!(N-n)!} p^{n}(1-p)^{N-n} \end{aligned}$$
(4.9)

where p and \((1-p)\) are the probability of “hit" and “miss" in a single trial, respectively. The sum of the P(n) from \(n=0\) to \(n=N\) is represented by the binomial expansion of the \([(1-p) + p]^N=1\). This means that the sum of Eq. (4.9) is normalised to be 1. Figure 4.1 show the distributions for \(N=20\) with \(p=0.1\), 0.5, and 0.8.

Fig. 4.1
figure 1

The Binomial distributions represented by Eq. (4.9) for N=20 with \(p=0.1\), 0.5, and 0.8

The expectation value (\(\mu \)) and variance (\(\sigma ^2\)) can be extracted by substituting Eq. (4.9) for Eqs.  (4.5) and (4.7):

$$\begin{aligned} \mu= & {} Np, \end{aligned}$$
(4.10)
$$\begin{aligned} \sigma ^2= & {} Np(1-p). \end{aligned}$$
(4.11)

The detailed calculation can be found in Appendix A.1. In the case of large N than a few 10’s and small p such as \(p \le 0.1\), the binomial distribution is approximated by the Poisson distribution of \(\mu \sim Np\). In contrast, in the case of large N and moderate p such as \(p \sim 0.5\), the binomial distribution is approximated by the normal distribution with the mean and variance expressed by Eqs. (4.10) and (4.11). The Poisson and the normal distributions are explained in the following subsections.

4.2.3 Poisson Distribution

When a theory expects \(\mu \) events of a certain physics process with some integrated luminosity at a collider experiment, the probability having n events obeys the Poisson distribution, expressed as

$$\begin{aligned} P(n) = \frac{\mu ^n \textrm{e}^{-\mu }}{n!} \;\; . \end{aligned}$$
(4.12)

Simply put, this distribution can be used in the case where rare events occur. In general, the Poisson distribution describes the probability of n events occurring in a unit interval of time if the events occur with a known average rate \(\mu \) and independently in the time since the last event. Figure 4.2 show the Poisson distributions for \(\mu =1\), 5, 10, and 20. Using the Maclaurin expansion, i.e. \(\displaystyle \textrm{e}^x = \sum _n \frac{x^n}{n!}\), it is shown that the sum of Eq. (4.12) is normalised to 1.

Fig. 4.2
figure 2

The Poisson distributions represented by Eq. (4.12) for \(\mu =\) 1, 5, 10, and 20

By substituting Eq. (4.12) for Eqs. (4.5) and (4.7), both the expectation value and variance of the Poisson distribution are expressed with only one parameter \(\mu \) (see Appendix A.2):

$$\begin{aligned} \mu= & {} mu, \end{aligned}$$
(4.13)
$$\begin{aligned} \sigma ^2= & {} \mu . \end{aligned}$$
(4.14)

Then the standard deviation is expressed as \(\sigma = \sqrt{\mu }\). It means that you can use the square root of the number of events as a statistical uncertainty in counting experiments. In fact, the mean and the square of the standard deviation of the distributions in Fig. 4.2 are close to \(\mu \) in Eq. (4.12). But the mean is not exactly the same as the peak position due to the asymmetric shape of the Poisson distribution. If \(\mu \) become large, for example larger than around 10, the distribution is relatively symmetric and approximated by the normal distribution.

4.2.4 Normal Distribution (Gaussian Distribution)

The normal distribution, which is very often called Gaussian distribution, commonly appears in nature, and used by not only particle physics but also almost everywhere in science. The Gaussian function is symmetric and continuous. It is expressed by two parameters \(\mu \) and \(\sigma \),

$$\begin{aligned} P(x) = \frac{1}{\sqrt{2 \pi } \sigma } \exp {\left( - \frac{(x-\mu )^2}{2 \sigma ^2} \right) } \;\; . \end{aligned}$$
(4.15)

A coefficient of \(\displaystyle \frac{1}{\sqrt{2\pi }\sigma }\) is a normalisation factor to ensure that the integral of Eq. (4.15) from \(\displaystyle -\infty \) to \(\displaystyle \infty \) is normalised to 1.Footnote 1 The Gaussian distributions for \(\mu =100\) and \(\sigma =\) 10, 20, and 30 are shown in Fig. 4.3.

Fig. 4.3
figure 3

The normal distribution represented by Eq. (4.15) for \(\sigma =\) 10, 20, and 30

By substituting Eq. (4.15) for Eqs. (4.6) and (4.8), we can find that the parameter \(\mu \) is the expectation value and \(\sigma ^2\) is the variance:

$$\begin{aligned} \int _{-\infty }^{\infty } x \cdot \frac{1}{\sqrt{2 \pi } \sigma } \exp {\left( - \frac{(x-\mu )^2}{2 \sigma ^2} \right) } \textrm{d}x = \mu \end{aligned}$$
(4.16)
$$\begin{aligned} \int _{-\infty }^{\infty } (x-\mu )^2 \cdot \frac{1}{\sqrt{2 \pi } \sigma } \exp {\left( - \frac{(x-\mu )^2}{2 \sigma ^2} \right) } \textrm{d}x = \sigma ^2. \end{aligned}$$
(4.17)

For experimental measurements, the values \(\mu \) and \(\sigma \) are taken from the measured values and the uncertainty.

The integral of the Gaussian distribution in range \(\mu \pm \sigma \) is

$$\begin{aligned} \int _{\mu -\sigma }^{\mu +\sigma } \frac{1}{\sqrt{2 \pi } \sigma } \exp {\left( - \frac{(x-\mu )^2}{2 \sigma ^2} \right) } \textrm{d}x = \textrm{erf}\left( {\frac{1}{\sqrt{2}}}\right) = 0.6827 \end{aligned}$$
(4.18)

where the \(\textrm{erf}(x)\) is called the error function defined by Eq. (4.19)

$$\begin{aligned} \textrm{erf}(x) = \frac{2}{\sqrt{\pi }} \int _0^{x} \textrm{e}^{-t^2} \textrm{d} x \end{aligned}$$
(4.19)

and is shown in Fig. 4.4. Equation (4.18) shows that in the measurement of x, the probability to have \(|x-\mu | \le \sigma \) is about 68%. In other words, the probability of \(|x-\mu | > \sigma \) is \(1-0.6827 = 0.3173\) (32%). Several examples of the occurrence having \(|x-\mu | > \delta \) are shown in Table 4.2.

Particle physicists use the expression such that a certain measurement has an excess of 5\(\sigma \) from the background-only hypothesis. This means that the number of observed events is larger than the number of events expected from only background events, which is estimated at the 5\(\sigma \), that is, such a measurement can occur with the probability of \(5.73 \times 10^{-7}\) (for both sides) under the only background environment. This is very rare so we call it “observation” and/or “discovery”. In particle physics, we claim “evidence” and “discovery” of something new for 3\(\sigma \) and 5\(\sigma \) excesses, respectively. The full width at half maximum (FWHM) is also often used as the uncertainty rather than \(\sigma \). This can be easily translated to \(\sigma \) with the equation of FWHM=\( 2 \sqrt{2 \ln {2}} \sigma = 2.355 \sigma \).

Fig. 4.4
figure 4

The error function

Table 4.2 The probability outside a certain range expressed in units of \(\sigma \)

4.2.5 Uniform Distribution

The uniform distribution which represents the fixed probability in a certain interval of x is defined as

$$\begin{aligned} P(x) = \left\{ \begin{array}{ll} \frac{1}{b-a} &{} (a \le x \le b) \\ 0 &{} (\textrm{otherwise}). \end{array} \right. \end{aligned}$$
(4.20)

We can calculate the expectation value and variance of the uniform distribution by substituting Eq. (4.20) for Eqs. (4.6) and (4.8):

$$\begin{aligned} \mu = \int x P(x) \textrm{d} x = \int ^{b}_{a} \frac{x}{b-a} \textrm{d} x = \frac{1}{2}(a+b), \end{aligned}$$
(4.21)
$$\begin{aligned} \sigma ^2 = \int (x-\mu )^2 P(x) \textrm{d} x = \int ^{b}_{a} \left\{ x - \frac{1}{2}(a+b)) \right\} ^2 \frac{1}{b-a} \textrm{d} x = \frac{1}{12}(b-a)^2.\nonumber \\ \end{aligned}$$
(4.22)

An important application of the uniform distribution is position measurements. The position where the particle passes through is determined by position-sensitive sensors. Let’s consider the detector with strip-shaped sensors aligned perpendicular to the x axis, which allows you to know the particle position along x. If a certain sensor with a width d, which has a sensitive area from \(x=a\) to \(x=b\) (\(d=b-a\)), provides a hit signal, the expectation value and uncertainty of the position where the particle passes through is estimated to be \(\displaystyle \mu = \frac{1}{2}(a+b)\) and \(\displaystyle \sigma =\frac{b-a}{\sqrt{12}} = \frac{d}{\sqrt{12}}\), respectively.

4.2.6 Breit-Wigner Distribution

The Breit-Wigner distribution is used to express the probability density for the energy of an unstable particle with a mass M and a decay width \(\varGamma \) (and mean lifetime of \(\displaystyle \tau =1/\varGamma \)). The Breit-Wigner distribution is defined as

$$\begin{aligned} \textrm{BW} (x; M, \varGamma ) = \frac{1}{\pi } \frac{\varGamma / 2}{(M - x)^2 + (\varGamma /2 )^2}, \end{aligned}$$
(4.23)

and shown in Fig. 4.5. The expectation value and variance of the Breit-Wigner are not well-defined, since the integral of Eqs. (4.6) and (4.8) is divergent. Instead of them, the peak position of M and the FWHM of \(\varGamma \) represent the distribution.

Fig. 4.5
figure 5

Breit-Wigner distribution for Z boson (\(M=91.2\) GeV, and \(\varGamma = 2.5\) GeV)

4.2.7 Exponential Distribution

The exponential distribution is used to express the probability density of the existence for the unstable particle with a mean lifetime of \(\tau \). The exponential distribution for a continuous variable \(0<x<\infty \) is defined as

$$\begin{aligned} P(x) = \frac{1}{\tau } \mathrm{{e}}^{-\frac{x}{\tau }} \;\; , \end{aligned}$$
(4.24)

using one parameter \(\tau \). The expectation value and variance of x are derived as

$$\begin{aligned} \mu = \int x P(x) \textrm{d} x = \frac{1}{\tau } \int ^{\infty }_{0} x \mathrm{{e}}^{-\frac{x}{\tau }} \textrm{d} x = \tau , \end{aligned}$$
(4.25)
$$\begin{aligned} \sigma ^2 = \int (x-\mu )^2 P(x) \textrm{d} x = \frac{1}{\tau } \int ^{\infty }_{0} (x-\tau )^2 \mathrm{{e}}^{-\frac{x}{\tau }} \textrm{d} x = \tau ^2 . \end{aligned}$$
(4.26)

4.2.8 \(\chi ^2\) (Chi-Square) Distribution

In case n observables \(x_i\) independently obey the normal distributions \(\displaystyle N_i(\mu _i, \sigma _i)=\frac{1}{\sqrt{2 \pi } \sigma _i} \exp {\left( - \frac{(x_i-\mu _i)^2}{2 \sigma _i^2} \right) }\), the \(\chi ^2\) value defined as

$$\begin{aligned} \chi ^2 = \sum ^{n}_{i=1} \frac{(x_i - \mu _i)^2}{\sigma _i^2} \end{aligned}$$
(4.27)

is used as the test of a hypothesis, which indicates how well the expectation matches with the experimental data. If the hypothesis predicts the nature properly, the \((x_i - \mu _i)^2\) is expected to be the variance of the experiment, i.e. \(\sigma ^2\). Thus, \(\chi ^2/n\) is expected to be 1.

If \(\chi ^2/n\) shows significant deviation from unity, either the hypothesis or the estimation of the \(\sigma \)’s is wrong. The probability density function of this \(\chi ^2\) distribution with n degrees of freedom (dof) can be written as

$$\begin{aligned} f(z; n) = \frac{z^{n/2-1} \mathrm{{e}}^{-z/2}}{2^{n/2}\varGamma (n/2)} \ \ \ \ \ \ \ (z>0) , \end{aligned}$$
(4.28)

where \(\varGamma \) is the \(\varGamma \) function. Figure 4.6 shows the \(\chi ^2\) distribution \(f(\chi ^2, n)\) for dof \(n=\) 1 to 5. For large n, the probability density function of this \(\chi ^2\) distribution approaches the normal distribution with a mean and variance of \(\mu =n\), \(\sigma ^2 = 2n\), respectively.

Fig. 4.6
figure 6

\(\chi ^2\) distributions. See the main text for details

4.3 Error Propagation

As the cross section is determined by the values of the five parameters in Eq (4.1), a physical quantity is often derived from several parameters which is determined by measurements. Naturally, the uncertainty of parameters carries over into a physical quantity. Let’s assume a physical quantity u depends on ith parameters \(x_i\). Namely, the u can be written as a function of \(x_i\) (\(i=1, 2, \ldots , n\)): \(u = f(x_1, x_2, \ldots , x_n) = f({\boldsymbol{x}})\). The expectation values and uncertainties of \({\boldsymbol{x}}\) are known as \({\boldsymbol{\mu }} = (\mu _1, \mu _2, \ldots , \mu _n) \) and \({\boldsymbol{\sigma }_{i}} = (\sigma _1, \sigma _2, \ldots , \sigma _n)\), respectively. A first-order expansion of the function \(f({\boldsymbol{x}})\) around the expectation value \({\boldsymbol{\mu }}\) can be written as

$$\begin{aligned} f({\boldsymbol{x}}) \approx f({\boldsymbol{\mu }}) + \left. \sum _{i=1}^{n} \frac{\partial f ({\boldsymbol{x}})}{ \partial x_i} \right| _{\boldsymbol{x}=\boldsymbol{\mu }} (x_i - \mu _i). \end{aligned}$$
(4.29)

Because \(E[x_i-\mu _i]=0\), the expectation value of u is represented at the first order to be

$$\begin{aligned} E[f({\boldsymbol{x}})] \approx f({\boldsymbol{\mu }}), \end{aligned}$$
(4.30)

and the expectation value of \(u^2\) is

$$\begin{aligned} E[f^2(\boldsymbol{x})]\approx & {} f^2({\boldsymbol{\mu }}) \nonumber \\+ & {} E\left[ \left( \sum _{i=1}^{n} \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_i} \right| _{\boldsymbol{x}=\boldsymbol{\mu }}(x_i - \mu _i) \right) \left( \sum _{j=1}^{n} \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_j} \right| _{\boldsymbol{x}=\boldsymbol{\mu }}(x_j - \mu _j)\right) \right] \nonumber \\= & {} f^2({\boldsymbol{\mu }}) + \sum _{i=1}^{n} \left( \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_i} \right| _{\boldsymbol{x}=\boldsymbol{\mu }} \right) ^2 E[(x_i - \mu _i)^2] \nonumber \\+ & {} \sum _{i \ne j}^{n} \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_i} \right| _{\boldsymbol{x}=\boldsymbol{\mu }} \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_j} \right| _{\boldsymbol{x}=\boldsymbol{\mu }} E[(x_i - \mu _i)(x_j - \mu _j)]. \end{aligned}$$
(4.31)

In case \(x_i\) and \(x_j\) are not correlated, the third term of Eq. (4.31) is 0. Because \(E[(x_i - \mu _i)^2] = \sigma _i^2\), the variance of the u can be calculated to be

$$\begin{aligned} \sigma _u^2 = E[f^2({\boldsymbol{x}})] - (E[f({\boldsymbol{x}})])^2 \approx \sum _{i=1}^{n} \left( \left. \frac{\partial f ({\boldsymbol{x}})}{ \partial x_i} \right| _{\boldsymbol{x}=\boldsymbol{\mu }} \right) ^2 \sigma _i^2. \end{aligned}$$
(4.32)

Suppose in the cross-section measurement, \(N_{\textrm{obs}}\), \(N_{\textrm{bkgd}}\), L, A, and \(\varepsilon \) can be measured independently. In this case, (the square of) the uncertainty of the cross section (\(\sigma ^2_{\sigma _{\textrm{phys}}}\)) can be expressed as

$$\begin{aligned} \sigma ^2_{\sigma _{\textrm{phys}}}= & {} \left( \frac{\partial \sigma _{\textrm{phys}}}{\partial N_{\textrm{obs}}} \right) ^2 \cdot \sigma ^2_{N_{\textrm{obs}}} + \left( \frac{\partial \sigma _{\textrm{phys}}}{\partial N_{\textrm{bkgd}}} \right) ^2 \cdot \sigma ^2_{N_{\textrm{bkgd}}} + \left( \frac{\partial \sigma _{\textrm{phys}}}{\partial L} \right) ^2 \cdot \sigma ^2_{L} \nonumber \\+ & {} \left( \frac{\partial \sigma _{\textrm{phys}}}{\partial A} \right) ^2 \cdot \sigma ^2_{A} + \left( \frac{\partial \sigma _{\textrm{phys}}}{\partial \varepsilon } \right) ^2 \cdot \sigma ^2_{\varepsilon }. \end{aligned}$$
(4.33)

Imagine you measure the efficiency of the particle detection for a detector. When N particles passed through the detector, the detector gives “hit” signal \(n_1\) times and “miss” signal \(n_2\) times (\(N=n_1 + n_2\)). In this case, the detection efficiency is given by \(\displaystyle \varepsilon = \frac{n_1}{n_1 + n_2}\). If the \(n_1\) and \(n_2\) are large enough to consider that they are not correlated and the their uncertainties are estimated as \(\sqrt{n_1}\) and \(\sqrt{n_2}\), respectively, you can show the uncertainty of efficiency, \(\sigma _\varepsilon \) to be \(\displaystyle \sqrt{\frac{\varepsilon (1 - \varepsilon )}{N}}\). Similar discussion can be done for the asymmetry \(\displaystyle A= \frac{n_1 - n_2}{n_1+n_2}\), instead of \(\varepsilon \). Show by yourself the uncertainty of the asymmetry, \(\sigma _{A}\).

4.4 Maximum Likelihood Method

Although we can never know the true values of physical quantities, we can estimate them from a set of the measurements. Consider that we made n independent measurements and obtained n measured quantities \({\boldsymbol{x}} = (x_1, x_2, ... , x_n)\). Suppose that the measured quantities \({\boldsymbol{x}}\) distributes a probability density function \(f(x_i; {\boldsymbol{\theta }})\) (\(i=1,2,...,n\)), where \({\boldsymbol{\theta }} = (\theta _1, \theta _2, ... , \theta _m)\) are unknown physical quantities. The likelihood function \(L({\boldsymbol{x}};{\boldsymbol{\theta }})\), which is regarded as the probability to have a set of measurements of \((x_1, x_2, ... , x_n)\), is defined as

$$\begin{aligned} L({\boldsymbol{x}};{\boldsymbol{\theta }}) = f(x_1; {\boldsymbol{\theta }}) f(x_2; {\boldsymbol{\theta }}) \cdots f(x_n; {\boldsymbol{\theta }}) = \prod ^{n}_{i=1}f(x_i; {\boldsymbol{\theta }}). \end{aligned}$$
(4.34)

If the hypothesis constructing the probability density function \(f(x; {\boldsymbol{\theta }})\) and parameter values \({\boldsymbol{\theta }}\) are correct, one expects that the L gives maximum. To estimate the most probable values of \({\boldsymbol{\theta }}\), the maximum likelihood estimators for \({\boldsymbol{\theta }}\) are defined as the values which maximise the likelihood function. As long as the likelihood function is a differentiable function of the parameters \({\boldsymbol{\theta }}\), and the maximum is not at the physical boundary, the estimators are given by solving the simultaneous equations

$$\begin{aligned} \frac{\partial L}{\partial \theta _i} = 0, \ \ \ \textrm{or} \ \ \ \frac{\partial \ln {L}}{\partial \theta _i} = 0, \ \ \ \ \ \ \ i=1, 2, ..., m. \end{aligned}$$
(4.35)

Because of the characteristics of the logarithm, maximum log-likelihood estimators, which are equivalent to the maximum likelihood, are often used. To distinguish the true values of physical quantities (\({\boldsymbol{\theta }}=( \theta _1, \theta _2, ... \theta _m)\)) from their estimators, the parameters satisfied with Eq. (4.35) are written as \(\hat{\boldsymbol{\theta }} =( \hat{\theta _1}, \hat{\theta _2}, ... \hat{\theta _m})\).

As an example, let’s consider that a variable x, which obeys a Gaussian distribution with unknown \(\mu \) and \(\sigma ^2\), is measured n times. The log-likelihood function is

$$\begin{aligned} \ln {L(\mu , \sigma ^2)}= & {} \sum ^{n}_{i=1} \ln {f(x_i; \mu , \sigma ^2)} = \sum ^{n}_{i=1} \ln {\frac{1}{\sqrt{2 \pi } \sigma } \exp {\left( - \frac{(x_i-\mu )^2}{2 \sigma ^2} \right) }} \nonumber \\= & {} \sum ^{n}_{i=1} \left( -\ln {\sqrt{2 \pi }} - \frac{1}{2} \ln {\sigma ^2} - \frac{(x_i - \mu )^2}{\sigma ^2} \right) . \end{aligned}$$
(4.36)

By solving \(\displaystyle \frac{\partial \ln {L}}{\partial \mu } = 0\), \(\hat{\mu }\) is obtained as

$$\begin{aligned} \hat{\mu } = \frac{1}{n} \sum ^{n}_{i=1} x_i . \end{aligned}$$
(4.37)

The expectation value of \(\hat{\mu }\) is an unbiased estimator for \(\mu \):

$$\begin{aligned} E(\hat{\mu })= & {} \mu . \end{aligned}$$
(4.38)

This calculation can be found in Appendix A.3. Similarly, solving \(\displaystyle \frac{\partial \ln {L}}{\partial \sigma ^2} = 0\) gives \(\hat{\sigma ^2}\)

$$\begin{aligned} \hat{\sigma ^2} = \frac{1}{n} \sum ^{n}_{i=1} (x_i - \mu )^2 = \frac{1}{n} \sum ^{n}_{i=1} (x_i - \hat{\mu })^2 . \end{aligned}$$
(4.39)

Because \(\mu \) is an unknown parameter, \(\hat{\mu }\) is actually used to estimate \(\sigma \). Computing the expectation value of \(\sigma ^2\), it gives

$$\begin{aligned} E[\hat{\sigma ^2}] = \frac{n-1}{n} \sigma ^2, \end{aligned}$$
(4.40)

which means that the estimator \(\hat{\sigma ^2}\) is biased, because using \(\hat{\mu }\) instead of \(\mu \) reduces the number of dof by 1. Instead of \(\hat{\sigma ^2}\),

$$\begin{aligned} s^2 = \frac{n}{n-1} \hat{\sigma ^2} = \frac{1}{n-1}\sum ^{n}_{i=1} (x_i - \mu )^2 = \frac{1}{n} \sum ^{n}_{i=1} (x_i - \hat{\mu })^2 \end{aligned}$$
(4.41)

may be used as a more correct estimator, but the difference between them is ignored when n is large enough.

4.5 Least Squares Method

Suppose \(y=f(x; \theta _1, \theta _2, \ldots , \theta _m)\) is a function of x and you want to determine m parameters of \({\boldsymbol{\theta }} = (\theta _1, \theta _2, \ldots , \theta _m)\) by measuring parameters \(y_i\) at the n points of \(x_i\) (\(i=1, 2, \ldots , n\)). When the uncertainties of measurements \(y_i\) are given by \(\sigma _i\), the parameters can be estimated by finding the values of the parameter \({\boldsymbol{\theta }}\) that minimise the following quantity:

$$\begin{aligned} \chi ^2 = \sum _{i=1}^{n} \frac{(y_i - f({x_i; {\boldsymbol{\theta }}}))^2}{\sigma _i^2}. \end{aligned}$$
(4.42)

This method is called the least squares method. This method is equivalent to the maximum likelihood method described in Sect. 4.4. In fact, when \(f(x_i; \mu , \sigma ^2)\) obeys a Gaussian distribution, \(\chi ^2\) is identical to \(-2\ln {L(\mu , \sigma ^2)}\).

4.6 Statistical Figure of Merit

When we discuss the statistical significance of observed events, the following figure of merit is often used,

$$\begin{aligned} \frac{N_{\textrm{signal}}}{\sqrt{N_{\textrm{obs}}}} = \frac{N_{\textrm{signal}}}{\sqrt{N_{\textrm{signal}}+N_{\textrm{background}}}}. \end{aligned}$$
(4.43)

Because the \(\sigma = \sqrt{N_{\textrm{obs}}}\) shows the statistical uncertainty of total number of observed events, the figure of merit above is the indicator to show how significant we have the signal over the background in units of \(\sigma \). For example, suppose 10000 events are observed and 9500 events are expected as background events after a certain event selection, the figure of merit is \((10000-9500)/(\sqrt{10000}) = 5 \sigma \). If the \(\displaystyle N_{background}\) is much larger than \(\displaystyle N_{signal}\), one can use

$$\begin{aligned} \frac{N_{\textrm{signal}}}{\sqrt{N_{\textrm{background}}}}. \end{aligned}$$
(4.44)

The higher the statistical figure of merit, the more sensitive the measurement can be expected. Note that in this discussion, only statistical uncertainty is taken into account. If you need to consider systematic uncertainties, the figure of merit becomes more complicated.

4.7 Hypothesis Test

A hypothesis test is a method to describe how well the data agree or disagree with a given hypothesis. The hypothesis under consideration is called the null hypothesis \(H_0\). This hypothesis \(H_0\) is compared with a so-called alternative or test hypothesis \(H_1\) in order to quantify the compatibility of \(H_0\). In practice, \(H_1\) is a hypothesis we would like to see, for example, the presence of a new particle. In other words, \(H_0\) is a hypothesis we would like to reject.

Fig. 4.7
figure 7

Probability density distributions for \(H_0\) (left) and \(H_1\) (right) hypotheses. Given a threshold on the number of events (“thres”), the regions for the significance level of \(\alpha \) and the power \(1-\beta \) are shown

4.7.1 Discovery and Exclusion

According to the Neyman-Pearson lemma, the likelihood ratio \(L(H_0)/L(H_1)\) is the optimal discriminator for the hypothesis test \(H_0\) versus \(H_1\) such that we often use the likelihood ratio as a test statistic t. However, in this section we use the number of events we select as a test statistic assuming they follow Gaussian distributions (Eq. (4.15)) to understand p-value, etc. intuitively, where we discuss discovery and exclusion using the number of selected events. Suppose that a Gaussian distribution \(N_\textrm{B}(x)\) is for background-only (the SM) and the other one \(N_\mathrm {S+B}(x)\) is for signal+background (the signal is a new physics beyond the SM). Figure 4.7 shows these two Gaussian distributions. Since the x-axis is the number of events, the \(N_\mathrm {S+B}(x)\) distribution is present in the right side of \(N_\textrm{B}(x)\). Here, we assume that the background-only model is the null hypothesis \(H_0\) and the signal+background is for \(H_1\). For a hypothesis test, we determine a threshold \(x^\textrm{thres}\) to define a significance level \(\alpha \). In this case, the \(\alpha \) is defined as

$$\begin{aligned} \alpha = \int ^\infty _{x^\textrm{thres}} N_\mathrm {H_0:B}(x) dx, \nonumber \end{aligned}$$

which is shown in Fig. 4.7. Then, if \(H_0\) is false and \(H_1\) is true, the probability to reject \(H_0\) correctly is called a power \(1-\beta \) where the \(\beta \) is defined as

$$\begin{aligned} \beta = \int ^{x^\textrm{thres}}_{-\infty } N_\mathrm {H_1:S+B}(x) dx, \nonumber \end{aligned}$$

which is also shown in Fig. 4.7. These \(\alpha \) and \(\beta \) also correspond to type I error (false positive) and type II error (false negative), respectively. The former is the probability to reject \(H_0\) wrongly and the latter is to reject \(H_1\) wrongly. Then, we assume that we obtain the number of events \(x^\textrm{obs}\) from the data. We define a p-value p, which is a probability to show the compatibility with \(H_0\):

$$\begin{aligned} p = \int ^\infty _{x^\textrm{obs}} N_\mathrm {H_0:B}(x) dx. \nonumber \end{aligned}$$

When the value of p is smaller than \(\alpha \), we can say that \(H_0\) is rejected by the significance level of \(\alpha \).

Figure 4.8a shows an example of the discovery, where \(H_0\) is the background-only and \(H_1\) is the signal+background. We use the p-value \(p_0\) under \(H_0\) to claim the discovery of a new particle.Footnote 2 Conventionally, if \(p_0\) is smaller than \(2.87\times 10^{-7}\), what we observe is very rare under \(H_0\) such that we consider that \(H_0\) is rejected. The p-value can be transformed to z-value, which is defined using a standard Gaussian distribution as

$$\begin{aligned} p = \int ^\infty _{z} \frac{1}{\sqrt{2\pi }}e^{-\frac{x^2}{2}} dx. \nonumber \end{aligned}$$

For \(p=2.87\times 10^{-7}\), z-value corresponds to \(5\sigma \), which is shown in Table 4.2. When we investigate new physics models using MC samples (MC studies), the observed \(x^\textrm{obs}\) is replaced with the median of the \(N_\mathrm {H_1:S+B}(x)\) distribution. To claim the so-called “evidence” instead of “discovery”, we conventionally use \(p=1.35\times 10^{-3}\) (\(3\sigma \)).

Fig. 4.8
figure 8

a \(p_0\) for discovery and b \(1-p\) for exclusion. They are evaluated for the number of observed events (“obs”). In case of MC studies, the number of observed events is replaced with the median of signal+background and background-only for (a) and (b), respectively

Figure 4.8b shows an example of the exclusion of a model, where \(H_0\) is the signal+background and \(H_1\) is the background-only. When the value of \((1-p)\) under \(H_0\) is smaller than 0.05, \(H_0\) is rejected. We call it “95% exclusion.”Footnote 3 In case of MC studies, the \(x^\textrm{obs}\) is replaced with the median of the \(N_\mathrm {H_1:B}(x)\) distribution. The \((1-p)\) is denoted as \(\textrm{CL}_{s+b}\) in high-energy experiments since the \((1-p)\) value corresponds to compatibility with the signal+background hypothesis. Furthermore, in the LHC experiments, a \(\textrm{CL}_{s}\)-based exclusion is often used instead of \(\textrm{CL}_{s+b}\). \(\textrm{CL}_{s}\) is defined as

$$\begin{aligned} \textrm{CL}_{s} = \frac{\textrm{CL}_{s+b}}{\textrm{CL}_{b}} = \int ^{x^\textrm{obs}}_{-\infty } N_\mathrm {H_0:S+B}(x) dx / \int ^{x^\textrm{obs}}_{-\infty } N_\mathrm {H_1:B}(x) dx. \end{aligned}$$
(4.45)

In case of MC studies, the denominator is 0.5 because \(x^\textrm{obs}\) is the median of the \(N_\mathrm {H_1:B}(x)\) distribution; \(\textrm{CL}_{s}\) is \(2\textrm{CL}_{s+b}\) so that the 95% exclusion using \(\textrm{CL}_{s}\) corresponds to the \((1-p)\) of 0.025 for \(\textrm{CL}_{s+b}\). The \(\textrm{CL}_{s}\) is not a probability but in order to avoid incorrect exclusions, which could be possible when the expected signal is small, the LHC experiments often use it to claim exclusions.

4.7.2 Profile Likelihood Fit

Suppose we count the number of observed events n after applying our event selection. This parameter of n follows a Poisson distribution with an expectation value of \(\mu s+b\), where s is the expected value from a signal model, and b is the expected value from background processes. The likelihood function can be defined as

$$\begin{aligned} L(\mu ,s,b) = \frac{(\mu s+b)^n}{n!}e^{-(\mu s+b)}. \nonumber \end{aligned}$$

The parameter of \(\mu \) is a scale factor of the signal and is called a signal strength. Given s and b, we can extract a \(\mu \) value from a fit to data, which gives the value of n, using the maximum likelihood technique explained in Sect. 4.4. The value of \(\mu \) be around unity if the data follows the assumed signal model, while it is close to zero if the data follows the SM, that is, the data contain background only.

We modify this likelihood function by adding more terms. Since the s corresponds to \(L\sigma _\textrm{phys}A\varepsilon \) of Eq. (4.1), we can consider systematic uncertainties from these parameters (L, \(\sigma _\textrm{phys}\), A, \(\varepsilon \)). For example, the uncertainty on the integrated luminosity, the scale uncertainties (factorisation and renormalisation scales) on the \(\sigma _\textrm{phys}\), the uncertainties from jet energy scale, etc. on the \(\varepsilon \) and so on can be systematic uncertainties for \(\mu \). We often use Gaussian termsFootnote 4 to constrain the signal term and also other terms in b. Furthermore, in most of data analyses, we estimate the background b in the signal region, which is defined by our (signal) event selection, by using a so-called control or sideband regions in data with the help of MC samples. In this case, the b in the signal region can be described with \(\eta _\textrm{tf}(\alpha _\textrm{tf})b\), where b is the number of events in the control region, and \(\eta _\textrm{tf}\) is a scale (transfer) factor from the control region to the signal region. The \(\eta _\textrm{tf}\) is obtained from both data and MC samples so that some additional constraints (\(\alpha _\textrm{tf}\)) are possible. In the end, one of examples of a final likelihood function can be written as

$$\begin{aligned} \begin{aligned} L(\mu ,\boldsymbol{\theta })&= Pois(n|\mu \eta _s(\alpha _s)s+\eta _\textrm{tf}(\alpha _\textrm{tf})b)\cdot N(\alpha _s|0,1) \cdot \\&\quad Pois(m|\eta _b(\alpha _b)b)\cdot N(\alpha _b|0,1) \cdot \\&\quad N(\alpha _\textrm{tf}|0,1), \nonumber \end{aligned} \end{aligned}$$

where \(\boldsymbol{\theta } = (b,\alpha _s,\alpha _b,\alpha _\textrm{tf})\), \(\eta _i(\alpha )=\mu _i+\sigma _i\alpha \), \(Pois(n|\mu )=\mu ^n e^{-\mu }/n!\), and \(N(x|\mu ,\sigma )=1/(\sqrt{2\pi }\sigma )\cdot \exp (-(x-\mu )^2/(2\sigma ^2))\). The m is the number of observed events in the control region. The \(\boldsymbol{\theta }\) is called a set of nuisance parameters, which are determined by the likelihood fit with the \(\mu \). The \(\eta _i\) is a scale parameter for signal, background, and so on. The \(\alpha _i\) is a parameter to adjust the \(\eta _i\) through a Gaussian constraint. Parameters \(\mu _i\) and \(\sigma _i\) for \(\eta _i\) describe centre values and their uncertainties and are evaluated from other studies before the likelihood fit. If the \(\alpha _i\) value is 0, the value of \(\eta _i\) becomes \(\mu _i\). If not, the value of \(\eta _i\) is varied from its centre value. Practically, \(\mu _i\) is close to 1. Then, the effect on the signal strength \(\mu \) from each constraint is determined in the maximum likelihood (ML) fit. It means that the systematic uncertainties on the \(\mu \) from each constraint term are simultaneously determined with the \(\mu \) value itself. We call this procedure a “profiled” fit. When the pre-studies on \(\mu _i\) and \(\sigma _i\) are proper, the values of \(\alpha _i\) are expected to be close to \(0\pm 1\). For example, if the error of \(\alpha _i\) is smaller than 1 (say 0.3 or 0.4), it means that the value of \(\sigma _i\) given from the pre-studies is tightly constrained from data used in the ML fit, for example, data of control regions. If this is not expected, some additional studies might be required to understand such small values.

4.7.3 Profile Likelihood Ratio

We introduce the following likelihood ratio as a test statistic \(t_\mu \):

$$\begin{aligned} t_\mu = -2\ln \lambda (\mu ), \nonumber \end{aligned}$$
$$\begin{aligned} \lambda (\mu ) = \frac{L(\mu ,\hat{\hat{\boldsymbol{\theta }}})}{L(\hat{\mu },\hat{\boldsymbol{\theta }})}, \nonumber \end{aligned}$$

where the denominator of \(\lambda (\mu )\) is maximised for both \(\mu \) and \(\boldsymbol{\theta }\) (an unconditional ML fit) but the numerator is maximised for \(\boldsymbol{\theta }\) with respect to a specified \(\mu \) value (a conditional ML fit). Since the denominator corresponds to the best fit to data, the value of \(\lambda (\mu )\) is \(0 < \lambda (\mu ) \le 1\) so that the value of \(t_\mu \) is 0 or positive. When the numerator with a specified \(\mu \) value follows the data, the \(t_\mu \) can be small, if not, the \(t_\mu \) becomes large. The p-value is defined as

$$\begin{aligned} p_\mu = \int ^\infty _{t_{\mu ,\textrm{obs}}} f(t_\mu |\mu ') dt_\mu , \nonumber \end{aligned}$$

where \(t_{\mu ,\textrm{obs}}\) is the value of the observed \(t_\mu \) and \(f(t_\mu |\mu ')\) is the probability density distribution of \(t_\mu \) under the assumption of the signal strength \(\mu '\). The advantage of the use of this test statistic is that the distribution of \(t_\mu \) follows a \(\chi ^2\) distribution of one degree-of-freedom: \(f(t_\mu |\mu ) \sim \chi ^2_{\textrm{dof}=1}(t_\mu )\), so that we can evaluate p-value without toy Monte Carlo.Footnote 5 We explain the overall idea of the discovery and exclusion using this test statistic below but the technical detail of the hypothesis test using this test statistic can be found in Ref. [1].

In high-energy experiments, we search for a new signal particle by checking an excess over the expected events of a background-only assumption. The signal existence corresponds to \(\mu > 0\). For this case, an alternative test statistic \(\tilde{t}_\mu =-2\ln \tilde{\lambda }(\mu )\) is introduced,

$$\begin{aligned} \tilde{\lambda }(\mu ) = {\left\{ \begin{array}{ll} \frac{L(\mu ,\hat{\hat{\boldsymbol{\theta }}}(\mu ))}{L(\hat{\mu },\hat{\boldsymbol{\theta }}(\hat{\mu }))} \quad &{}(\hat{\mu }\ge 0) \\ \frac{L(\mu ,\hat{\hat{\boldsymbol{\theta }}}(\mu ))}{L(0,\hat{\boldsymbol{\theta }}(0))} \quad &{}(\hat{\mu }< 0), \end{array}\right. } \end{aligned}$$
(4.46)

where the best-fit \(\mu \) value with a deficit (\(\hat{\mu }< 0\)) is replaced with \(\mu =0\).

4.7.3.1 Discovery

We test \(\mu =0\), that is, we reject the null hypothesis \(H_0\) of \(\mu =0\) (background-only). We use a special notation \(q_0=\tilde{t}_0\) for this case. From Eq. (4.46), we use

$$\begin{aligned} q_0 = {\left\{ \begin{array}{ll} -2\ln \lambda (0) = -2\ln \frac{L(0,\hat{\hat{\boldsymbol{\theta }}}(0))}{L(\hat{\mu },\hat{\boldsymbol{\theta }}(\hat{\mu }))} \quad &{}(\hat{\mu }\ge 0) \\ 0 \quad &{}(\hat{\mu }< 0). \nonumber \end{array}\right. } \end{aligned}$$

We get a single value of \(q_0\) from data, \(q^\textrm{obs}_0\), and evaluate p-value \(p_0\) using

$$\begin{aligned} p_0 = \int ^\infty _{q^\textrm{obs}_0} f(q_0|0)dq_0, \nonumber \end{aligned}$$

where the \(f(q_0|0)\) is a distribution of \(q_0\) made under the assumption of \(\mu =0\). Figure 4.9a shows distributions of \(q_0\) for the assumption of \(\mu =0\) and 1: \(f(q_0|0)\) and \(f(q_0|1)\). Once we obtain a distribution \(f(q_0|0)\), we can evaluate \(p_0\) using \(q^\textrm{obs}_0\). Then, when the \(p_0\) value is smaller than \(2.87\times 10^{-7}\), we can claim discovery.Footnote 6

We can approximate the \(f(q_0|0)\) distribution as follows:

$$\begin{aligned} f(q_0|0) = \frac{1}{2}\delta (q_0)+\frac{1}{2}\frac{1}{\sqrt{2\pi }}\frac{1}{\sqrt{q_0}}e^{-q_0/2}. \nonumber \end{aligned}$$

By using this equation, a z-value Z can be obtained as

$$\begin{aligned} Z = \sqrt{q^\textrm{obs}_0}. \nonumber \end{aligned}$$

The \(5\sigma \) discovery corresponds to \(q^\textrm{obs}_0=25\). In Fig. 4.9a, the value of \(q^\textrm{obs}_0\) is 23, which is just an example, so that we cannot say the “discovery.” In case of MC studies, as shown in Fig. 4.9a, we can use the median of the \(f(q_0|1)\) distribution as \(q^\textrm{obs}_0\). In this case, \(q^\textrm{med}[f(q_0|1)]\) is smaller than 25 so that we cannot claim the “discovery”Footnote 7 by a physics model of \(s=20\).

Fig. 4.9
figure 9

a \(f(q_0|0)\) with \(f(q_0|1)\) for discovery b \(f(q_1|1)\) with \(f(q_1|0)\) for exclusion (upper limit). \(f(*|0)\) and \(f(*|1)\) show distributions for background-only and signal+background events, respectively. A likelihood function of \(L(\mu ,\theta =b) = \frac{(\mu s+b)^n}{n!}e^{-(\mu s+b)} \cdot \frac{b^m}{m!}e^{-b}\) is used, where variables of n and m are the number of events observed in the signal and control regions and \(s=20\) and \(b=10\) are used in this example; b in the signal region is estimated from the value of b in the control region. Dashed curves are central (for blue) and noncentral (for red) \(\chi ^2\) distributions of one degree-of-freedom. For the noncentral cases, so-called Asimov data, which is defined as data produced with the expectation values of inputs (s, b, and \(\mu \)), is used to evaluate a width required in an approximate formula of \(f(q_\mu |\mu ')\) [1]

4.7.3.2 Exclusion or Upper Limit

We test \(\mu (\ne 0)\), that is, we reject the null hypothesis \(H_0\) of the signal+background model. When a specified \(\mu \) value is equal to or smaller than the \(\hat{\mu }\) of the unconditional ML fit, we consider that \(q_\mu \) is 0. It means that the exclusion of models is performed for only \(\mu \) values which are larger than the observed best-fit \(\mu \). We define \(q_\mu \) as

$$\begin{aligned} q_\mu = {\left\{ \begin{array}{ll} -2\ln \lambda (\mu ) \quad &{}(\hat{\mu }\le \mu ) \\ 0 \quad &{}(\hat{\mu }> \mu ). \nonumber \end{array}\right. } \end{aligned}$$

We evaluate p-value \(p_\mu \) using

$$\begin{aligned} p_\mu = \int ^\infty _{q^\textrm{obs}_\mu } f(q_\mu |\mu ')dq_\mu , \nonumber \end{aligned}$$

where the \(f(q_\mu |\mu ')\) is a distribution of \(q_\mu \) made under the assumption of \(\mu '\). Figure 4.9b shows distributions of \(q_{\mu =1}\) (simply \(q_1\)) for the assumption of \(\mu =0\) and 1. When the \(p_\mu \) value is smaller than 0.05, we can claim 95% \(\textrm{CL}_{s+b}\) exclusion. This corresponds to \(q^\textrm{obs}_\mu >2.69(=1.64^2)\). In Fig. 4.9b, the observed \(q_1\) (23 as an example) can claim the “95% \(\textrm{CL}_{s+b}\) exclusion”, where a model of \(s=20\) cannot be explained. Practically, we need a scan of \(\mu \) values to find a \(\mu \) value having \(p_\mu =0.05\). This corresponds to \(\mu \sim 0.4\) in case of Fig. 4.9b. For 95% \(\textrm{CL}_{s}\) exclusionFootnote 8, we need a distribution \(f(q_\mu |0)\) to evaluate \(\textrm{CL}_{b}=\int ^\infty _{q^\textrm{obs}_\mu } f(q_\mu |0)dq_\mu \). In case of MC studies, as shown in Fig. 4.9b, we can use the median of the \(f(q_\mu |0)\) distribution as \(q^\textrm{obs}_\mu \) and \(\textrm{CL}_{b}=0.5\). For 95% \(\textrm{CL}_{s}\) exclusion of MC studies, we can use \(q^\textrm{obs}_\mu >3.84(=1.96^2)\).

For the case where we consider models with \(\mu \ge 0\), we can define and use an alternative test statistic \(\tilde{q}_\mu \):

$$\begin{aligned} \tilde{q}_\mu = {\left\{ \begin{array}{ll} -2\ln \frac{L(\mu ,\hat{\hat{\boldsymbol{\theta }}}(\mu ))}{L(0,\hat{\boldsymbol{\theta }}(0))} \quad &{}(\hat{\mu }< 0) \\ -2\ln \frac{L(\mu ,\hat{\hat{\boldsymbol{\theta }}}(\mu ))}{L(\hat{\mu },\hat{\boldsymbol{\theta }}(\hat{\mu }))} \quad &{}(\mu \ge \hat{\mu }\ge 0) \\ 0 \quad &{}(\hat{\mu }> \mu ). \nonumber \end{array}\right. } \end{aligned}$$

The procedure similar to the case of \(q_\mu \) can be applied [1].