## Abstract

Given certain measurements made on an individual, classification problems arise when one wishes to determine to which one of several categories (populations) this individual belongs. The cases of two populations with known distributions and two Gaussian populations having unknown parameters are initially considered. In the latter instance, the classification problem is extended to *k* categories. The maximum likelihood approach is also investigated for certain cases.

You have full access to this open access chapter, Download chapter PDF

### Similar content being viewed by others

## 12.1. Introduction

We will use the same notations as in the previous chapters. Lower-case letters *x*, *y*, … will denote real scalar variables, whether mathematical or random. Capital letters *X*, *Y*, … will be used to denote real matrix-variate mathematical or random variables, whether square or rectangular matrices are involved. A tilde will be placed on top of letters such as \(\tilde {x},\tilde {y},\tilde {X},\tilde {Y}\) to denote variables in the complex domain. Constant matrices will for instance be denoted by *A*, *B*, *C*. A tilde will not be used on constant matrices unless the point is to be stressed that the matrix is in the complex domain. The determinant of a square matrix *A* will be denoted by |*A*| or det(*A*) and, in the complex case, the absolute value or modulus of the determinant of *A* will be denoted as |det(*A*)|. When matrices are square, their order will be taken as *p* × *p*, unless specified otherwise. When *A* is a full rank matrix in the complex domain, then *AA*
^{∗} is Hermitian positive definite where an asterisk designates the complex conjugate transpose of a matrix. Additionally, d*X* will indicate the wedge product of all the distinct differentials of the elements of the matrix *X*. Thus, letting the *p* × *q* matrix *X* = (*x*
_{ij}) where the *x*
_{ij}’s are distinct real scalar variables, \(\mathrm {d}X=\wedge _{i=1}^p\wedge _{j=1}^q\mathrm {d}x_{ij}\). For the complex matrix \(\tilde {X}=X_1+iX_2,\ i=\sqrt {(-1)}\), where *X*
_{1} and *X*
_{2} are real, \(\mathrm {d}\tilde {X}=\mathrm {d}X_1\wedge \mathrm {d}X_2\).

Historically, classification problems arose in anthropological studies. By taking a set of measurements on skeletal remains, anthropologists wanted to classify them as belonging to a certain racial group such as being of African or European origin. The measurements might have been of the following type: *x*
_{1} = width of the skull, *x*
_{2} = volume of the skull, *x*
_{3} = length of the thigh bone, *x*
_{4} = width of the pelvis, and so on. Let the measurements be represented by a *p* × 1 vector *X*, with *X′* = (*x*
_{1}, …, *x*
_{p}) where a prime denotes the transpose. Nowadays, classification procedures are employed in all types of problems occurring in various contexts. For example, consider the situation of a battery of tests in an entrance examination to admit students into a professional program such as medical sciences, law studies, engineering science or management studies. Based on the *p* × 1 vector of test scores, a statistician would like to classify an applicant as to whether or not he/she belongs to the group of applicants who will successfully complete a given program. This is a 2-group situation. If a third category is added such as those who are expected to complete the program with flying colors, this will become a 3-group situation. In general, one will have a *k*-group situation when an individual is classified into one of *k* classes.

Let us begin with the 2-group situation. The problem consists of classifying the *p* × 1 vector *X* into one of two, groups, classes or categories. Let the categories be denoted by population *π*
_{1} and population *π*
_{2}. This means *X* will either belong to *π*
_{1} or to *π*
_{2}, no other options being considered. The *p* × 1 vector *X* may be taken as a point in a *p*-space *R*
_{p} or *p*-dimensional Euclidean space \(\Re ^p\). In a two-group situation when it is decided that the candidate either belongs to the population *π*
_{1} or the population *π*
_{2}, two subspaces *A*
_{1} and *A*
_{2} within the *p*-space *R*
_{p} are determined: *A*
_{1} ⊂ *R*
_{p} and *A*
_{2} ⊂ *R*
_{p}, with *A*
_{1} ∩ *A*
_{2} = *O* (the empty set) or a decision rule can be symbolically written as *A* = (*A*
_{1}, *A*
_{2}). If *X* falls in *A*
_{1}, the candidate is classified into *π*
_{1} and if *X* falls in *A*
_{2}, then the candidate is classified into *π*
_{2}. In other words, *X* ∈ *A*
_{1} means the individual is classified into population *π*
_{1} and *X* ∈ *A*
_{2} means that the individual is classified into population *π*
_{2}. The regions *A*
_{1} and *A*
_{2} or the rule *A* = (*A*
_{1}, *A*
_{2}) are not known beforehand. These are to be determined by employing certain decision rules. Criteria for determining *A*
_{1} and *A*
_{2} will be subsequently put forward. Let us now consider the consequences. When a decision is made to classify *X* as coming from *π*
_{1}, either the decision is correct or the decision is erroneous. If the population is actually *π*
_{1} and the decision rule classifies *X* into *π*
_{1}, then the decision is correct. If *X* is classified into *π*
_{2} when in reality the population is *π*
_{1}, then a mistake has been committed or a misclassification occurred. Misclassification will involve penalties, costs or losses. Let such a penalty, cost or loss of classifying an individual into group *i* when he/she actually belongs to group *j*, be denoted by *C*(*i*|*j*). In a 2-group situation, *i* and *j* can only equal 1 or 2. That is, *C*(1|2) > 0 and *C*(2|1) > 0 are the costs of misclassifying, whereas *C*(1|1) = 0 and *C*(2|2) = 0 since there is no cost or penalty associated with correct decisions. The following table summarizes this discussion:

## 12.2. Probabilities of Classification

The vector random variable corresponding to the observation vector *X* may have its own probability/density function. The real scalar variables as well as the observations on them will be denoted by the lower-case letters *x*
_{1}, …, *x*
_{p}. When dealing with the probability/density function of *X*, *X* is taken as vector random variable, whereas when looked upon as a point in the *p*-space, *R*
_{p}, *X* is deemed to be an observation vector. The *p* × 1 vector *X* may have a probability/density function *P*(*X*). In a 2-group or two classes situation, *P*(*X*) is either *P*
_{1}(*X*), the population density of *π*
_{1} or *P*
_{2}(*X*), the population density of *π*
_{2}. For convenience, it will be assumed that *X* of the continuous type, the derivations in the discrete case being analogous. In the 2-group situation, *P*(*X*) can only be *P*
_{1}(*X*) or *P*
_{2}(*X*). What is then the probability of achieving a correct classification under the rule *A* = (*A*
_{1}, *A*
_{2})? If the sample point *X* falls in *A*
_{1}, we classify the candidate as belonging to *π*
_{1}, and if the true population is also *π*
_{1}, then a correct decision is made. In that instance, the corresponding probability is

where d*X* = d*x*
_{1} ∧d*x*
_{2} ∧… ∧d*x*
_{p}, *A* = (*A*
_{1}, *A*
_{2}) denoting one decision rule or one given set of subspaces of the *p*-space *R*
_{p}. The probability of misclassification in this case is

Similarly, the probabilities of correctly selecting and misclassifying *P*
_{2}(*X*) are respectively given by

and

In a Bayesian setting, there is a prior probability *q*
_{1} of selecting the population *π*
_{1} and *q*
_{2} of selecting the population *π*
_{2}, with *q*
_{1} + *q*
_{2} = 1. Then, what will be the probability of drawing an observation from *π*
_{1} and misclassifying it as belonging to *π*
_{2}? It is \(q_1\times Pr\{2|1,A\}=q_1\int _{A_2}P_1(X)\mathrm {d}X\) and, similarly, the probability of drawing an observation from *π*
_{2} and misclassifying it as coming from *π*
_{1} is \(q_2\times Pr\{1|2,A\}=q_2\int _{A_1}P_2(X)\), with the respective costs of misclassifications being *C*(2|1) = *C*(2|1, *A*) and *C*(1|2) = *C*(1|2, *A*). What is then the expected cost of misclassification? It is the sum of the costs multiplied by the corresponding probabilities. Thus,

So, an advantageous criterion to rely on, when setting up *A*
_{1} and *A*
_{2} would consist in minimizing the expected cost as given in (12.2.5). A rule could be devised for determining *A*
_{1} and *A*
_{2} accordingly. In this regard, this actually corresponds to Bayes’ rule. How can one interpret this expected cost? For example, in the case of admitting students to a particular program of study based on a vector *X* of test scores, it is the cost of admitting potentially incompetent students or students who would not have successfully completed the program of study and training them, plus the projected cost of losing good students who would have successfully completed the program of study.

If prior probabilities *q*
_{1} and *q*
_{2} are not involved, then the expected cost of misclassifying an observation from *π*
_{1} as coming from *π*
_{2} is

and the expected cost of misclassifying an observation from *π*
_{2} as coming from *π*
_{1} is

We would like to have *E*
_{1}(*A*) and *E*
_{2}(*A*) as small as possible. In this case, a procedure, rule or criterion *A* = (*A*
_{1}, *A*
_{2}) corresponds to determining suitable subspaces *A*
_{1} and *A*
_{2} in the *p*-space *R*
_{p}. If there is another procedure \(A^{(j)}=(A_1^{(j)},A_2^{(j)})\) such that *E*
_{1}(*A*) ≤ *E*
_{1}(*A*
^{(j)}) and *E*
_{2}(*A*) ≤ *E*
_{2}(*A*
^{(j)}), then procedure *A* is said to be as good as *A*
^{(j)}, and if at least one of the inequalities above is a strict inequality, that is < , then *A* is preferable to *A*
^{(j)}. If procedure *A* is preferable to all other available procedures *A*
^{(j)}, *j* = 1, 2, …, *A* is said to be *admissible*. We are seeking an admissible class {*A*} of procedures.

## 12.3. Two Populations with Known Distributions

Let *π*
_{1} and *π*
_{2} be the two populations. Let *P*
_{1}(*X*) and *P*
_{2}(*X*) be the known *p*-variate probability/density functions associated with *π*
_{1} and *π*
_{2}, respectively. That is, *P*
_{1}(*X*) and *P*
_{2}(*X*) are two *p*-variate probability/density functions which are fully known in the sense that all their parameters are known in addition to their functional forms. Consider the Bayesian situation where it is assumed that the prior probabilities *q*
_{1} and *q*
_{2} of selecting *π*
_{1} and *π*
_{2}, respectively, are known. Suppose that a particular *p*-vector *X* is at hand. What is the probability that this given *X* is an observation from *π*
_{1}? This probability is *q*
_{1} *P*
_{1}(*X*) if *X* is discrete or *q*
_{1}
*P*
_{1}(*X*)d*X* if *X* is continuous. What is the probability that the given vector *X* is an observation vector either from *π*
_{1} or from *π*
_{2}? This probability is *q*
_{1}
*P*
_{1}(*X*) + *q*
_{2}
*P*
_{2}(*X*) or [*q*
_{1}
*P*
_{1}(*X*) + *q*
_{2}
*P*
_{2}(*X*)]d*X*. What is then the probability that the vector *X* at hand is from *P*
_{1}(*X*), given that it is an observation vector from *π*
_{1} or *π*
_{2}? As this is a conditional statement, it is given by the following in the discrete or continuous case:

where d*X*, which is the wedge product of differentials and positive in this case, cancels out. If the conditional probability that a given *X* is an observation from *π*
_{1} is larger than or equal to the conditional probability that the given vector *X* is an observation from *π*
_{2} and if we assign *X* to *π*
_{1}, then the chance of misclassification is reduced. Our main objective is to minimize the probability of misclassification and then come up with a decision rule. This statement is equivalent to the following: If

then we assign *X* to *π*
_{1}, meaning that our subspace *A*
_{1} is specified by the following rule:

Note that if *q*
_{1}
*P*
_{1}(*X*) = *q*
_{2}
*P*
_{2}
*X*), then *X* can be assigned to either *π*
_{1} or *π*
_{2}; however, we have assigned it to *π*
_{1} for convenience. Observe that, it is assumed that *q*
_{1}
*P*
_{1}(*X*) + *q*
_{2}
*P*
_{2}(*X*)≠0, *q*
_{1} > 0, *q*
_{2} > 0 and *q*
_{1} + *q*
_{2} = 1 in (12.3.2). The conditional statement made in (12.3.2), which can also be written as

holds for some weight functions *η*
_{i}, *i* = 1, 2.

If the observation is from *π*
_{1} : *P*
_{1}(*X*), then the expected cost of misclassification is *q*
_{1}
*P*
_{1}(*X*)*C*(2|1) + *q*
_{2}
*P*
_{2}(*X*)*C*(2|2) = *q*
_{1}
*P*
_{1}(*X*)*C*(2|1) since *C*(*i*|*i*) = 0, *i* = 1, 2. Similarly, the expected cost of misclassifying of the observation *X* from *π*
_{2} : *P*
_{2}(*X*) is *q*
_{2}
*P*
_{2}(*X*)*C*(1|2). If *P*
_{1}(*X*) is our preferred distribution, then we would like the associated expected cost of misclassification to be the lesser one, that is,

which is the same rule as in (12.3.3) where *q*
_{1} is replaced by *q*
_{1}
*C*(2|1) and *q*
_{2}, by *q*
_{2}
*C*(1|2).

### 12.3.1. Best procedure

It can be established that the procedure *A* = (*A*
_{1}, *A*
_{2}) in (12.3.3) is the best one for minimizing the probability of misclassification. To this end, consider any other procedure \(A^{(j)}=(A_1^{(j)},A_2^{(j)}),\ j=1,2,\ldots .\) The probability of misclassification under the procedure *A*
^{(j)} is the following:

If \(A_1^{(j)}\cup A_2^{(j)}=R_p\), then \(\int _{A_1^{(j)}\cup A_2^{(j)}}P_2(X)\mathrm {d}X=1\); it is otherwise a given positive constant. However, *q*
_{1}
*P*
_{1}(*X*) − *q*
_{2}
*P*
_{2}(*X*) can be negative, zero or positive, whereas the left-hand side of (12.3.5) is a positive probability. Accordingly, the left-hand side is minimum if

which actually is the rejection region *A*
_{2} of the procedure *A* = (*A*
_{1}, *A*
_{2}). Hence, the procedure *A* = (*A*
_{1}, *A*
_{2}) minimizes the probabilities of misclassification; in other words, it is the best procedure. If cost functions are also involved, then *(i)* becomes the following:

The region where *q*
_{1}
*P*
_{1}(*X*) − *q*
_{2}
*P*
_{2}(*X*) = 0 or *q*
_{1}
*C*(2|1)*P*
_{1}(*X*) − *q*
_{2}
*C*(1|2)*P*
_{2}(*X*) = 0 need not be empty and the probability over this set need not be zero. If

it can also be shown that the above Bayes procedure *A* = (*A*
_{1}, *A*
_{2}) is unique. This is stated as a theorem:

### Theorem 12.3.1

*Let q*
_{1} *be the prior probability of drawing an observation X from the population π*
_{1} *with probability/density function P*
_{1}(*X*) *and let q*
_{2} *be the prior probability of selecting an observation X from the population π*
_{2} *with probability/density function P*
_{2}(*X*)*. Let the cost or loss associated with misclassifying an observation from π*
_{1} *as coming from π*
_{2} *be C*(2|1) *and the cost of misclassifying an observation from π*
_{2} *as originating from π*
_{1} *be C*(1|2)*. Letting*

*the classification rule given by A *= (*A*
_{1}, *A*
_{2}) *of* (12.3.4) *is unique and best in the sense that it minimizes the probabilities of misclassification.*

### Example 12.3.1

Let *π*
_{1} and *π*
_{2} be two univariate exponential populations whose parameters are *θ*
_{1} and *θ*
_{2} with *θ*
_{1}≠*θ*
_{2}. Let the prior probability of drawing an observation from *π*
_{1} be \(q_1=\frac {1}{2}\) and that of selecting an observation from *π*
_{2} be \(q_2=\frac {1}{2}\). Let the costs or loss associated with misclassifications be *C*(2|1) = *C*(1|2). Compute the regions and probabilities of misclassification if (1): a single observation *x* is drawn; (2): iid observations *x*
_{1}, …, *x*
_{n} are drawn.

### Solution 12.3.1

(1). In this case, one observation is drawn and the populations are

Consider the following inequality on the support of the density:

or equivalently,

On taking logarithms, we have

Letting *θ*
_{1} > *θ*
_{2}, the steps in the case *θ*
_{1} < *θ*
_{2} being parallel, we have

Accordingly,

The probabilities of misclassification are:

### Solution 12.3.1

(2). In this case, *X′* = (*x*
_{1}, …, *x*
_{n}) and

where \(u=\sum _{j=1}^nx_j\) is gamma distributed with the parameters (*n*, *θ*
_{i}), *i* = 1, 2. The density of *u* is then given by

Proceeding as above, for *θ*
_{1} > *θ*
_{2}, *A*
_{1} : *u* ≥ *k*
_{1} and \( A_2: u<k_1,\ k_1=\frac {\theta _1\theta _2}{\theta _1-\theta _2}\ln [\frac {\theta _1}{\theta _2}]^n=nk\) where *k* is as given in Solution 12.3.1(1). Consequently, the probabilities of misclassification are as follows:

where the integrals can be expressed in terms of incomplete gamma functions or determined by using integration by parts.

### Example 12.3.2

Assume that no prior probabilities or costs are involved. Suppose that in a certain clinic, the waiting time before a customer is attended to, depends upon the manager on duty. If manager *M*
_{1} is on duty, the expected waiting time is 10 minutes, and if manager *M*
_{2} is on duty, the expected waiting time is 5 minutes. Assume that the waiting times are exponentially distributed with expected waiting time equal to *θ*
_{i}, *i* = 1, 2. On a particular day (1): a customer had to wait 6 minutes before she was attended to, (2): three customers had to wait 6, 6 and 8 minutes, respectively. Who between *M*
_{1} and *M*
_{2} was likely to be on duty on that day?

### Solution 12.3.2

(1). In this case, *θ*
_{1} = 10, *θ*
_{2} = 5 and the populations are exponential with parameters *θ*
_{1} and *θ*
_{2}, respectively. Thus, \(k=\frac {\theta _1\theta _2}{\theta _1-\theta _2}\ln \frac {\theta _1}{\theta _2}=\frac {(10)(5)}{10-5}\ln \frac {10}{5}=10\ln 2\), \(\frac {k}{\theta _1}=\frac {10\ln 2}{10}=\ln 2, \ \frac {k}{\theta _2}=2\ln 2=\ln 4\), \({\mathrm{e}}^{-\frac {k}{\theta _1}}={\mathrm{e}}^{-\ln 2}=\frac {1}{2}=0.5\), and \({\mathrm{e}}^{-\frac {k}{\theta _2}}\) \(={\mathrm{e}}^{-\ln 4}=\frac {1}{4}=0.25\). In (1): the observed value of \(x=6<10(\ln 2)=10(0.69314718056)\approx 6.9315\). Accordingly, we classify *x* to *M*
_{2}, that is, the manager *M*
_{2} was likely to be on duty. Thus,

### Solution 12.3.2

(2). Here, *u* = 6 + 6 + 8 = 20, *n* = 3 and \(k_1=\frac {\theta _1\theta _2}{\theta _1-\theta _2}\,n\,\ln \frac {\theta _1}{\theta _2}=\frac {(10)(5)}{10-5}3\ln \frac {10}{5}=30\ln 2\). Since \(30\ln 2\approx 20.795\) and the observed value of *u* is 20, *u* < *k*
_{1}, and we assign the sample to *π*
_{2} or to *P*
_{2}(*X*) or *M*
_{2}, with \(\frac {k_1}{\theta _2}=\frac {30\ln 2}{5}=6\ln 2\) and \(\frac {k_1}{\theta _1}=\frac {30\ln 2}{10}=3\ln 2\). Thus,

Integrating by parts,

Then,

### Example 12.3.3

Let the two populations *π*
_{1} and *π*
_{2} be univariate normal with mean values *μ*
_{1} and *μ*
_{2}, respectively, and the same variance *σ*
^{2}, that is, *P*
_{1}(*x*) : *N*
_{1}(*μ*
_{1}, *σ*
^{2}) and *P*
_{2}(*x*) : *N*
_{1}(*μ*
_{2}, *σ*
^{2}). Let the prior probabilities of drawing an observation from these populations be \(q_1=\frac {1}{2}\) and \(q_2=\frac {1}{2}\), respectively, and the costs or loss involved with misclassification be *C*(1|2) = *C*(2|1). Determine the regions of misclassification and the corresponding probabilities of misclassification if (1): a single observation *x* is available; (2): iid observations *x*
_{1}, …, *x*
_{n} are available, from *π*
_{1} or *π*
_{2}.

### Solution 12.3.3

(1). If one observation is available,

Consider regions

Now, note that

The probabilities of misclassification are the following for \(k=\frac {1}{2}(\mu _1+\mu _2)\):

where *Φ*(⋅) is the distribution function of a univariate standard normal density and \(k=\frac {1}{2}(\mu _1+\mu _2)\).

### Solution 12.3.3

(2). In this case, *x*
_{1}, …, *x*
_{n} are iid and *X′* = (*x*
_{1}, …, *x*
_{n}). The multivariate densities are

where \(\bar {x}=\frac {1}{n}\sum _{j=1}^nx_j\). Hence for *μ*
_{1} > *μ*
_{2},

Taking logarithms and simplifying, we have

where

Therefore the probabilities of misclassification are the following:

where \(k=\frac {1}{2}(\mu _1+\mu _2)\) and *Φ*(⋅) is the distribution function of a univariate standard normal random variable.

### Example 12.3.4

Assume that no prior probabilities or costs are involved. A tuber crop called tapioca is planted by farmers. While farmer *F*
_{1} applies a standard fertilizer to the soil to enhance the growth of the tapioca plants, farmer *F*
_{2} does not apply any fertilizer and let the plants grow naturally. At harvest time, a tapioca plant is pulled up with all its tubers attached to the bottom of the stem. The upper part of the stem is cut off and the lower part with its tubers is put out for sale. Tuber yield per plant, *x*, is measured by weighing the lower part of the stem with the tubers attached. It is known from past experience that *x* is normally distributed with mean value *μ*
_{1} = 5 and variance *σ*
^{2} = 1 for *F*
_{1} type farms, that is, *x* ∼ *N*
_{1}(*μ*
_{1} = 5, *σ*
^{2} = 1)|*F*
_{1} and that for *F*
_{2} type farms, *x* ∼ *N*
_{1}(*μ*
_{2} = 3, *σ*
^{2} = 1)|*F*
_{2}, the weights being measured in kilograms. A road-side vendor is selling tapioca and his collection is either from *F*
_{1} type farms or *F*
_{2} type farms, but not both. A customer picked (1): one stem with its tubers attached weighing 4.2 kg (2) a random sample of four stems respectively weighing 6, 4, 3 and 5 kg. To which type of farms will you classify the observations in (1) and (2)?

### Solution 12.3.4

(1). The decision is based on \(k=\frac {1}{2}(\mu _1+\mu _2)=\frac {1}{2}(5+3)=4\). In this case, the decision rule *A* = (*A*
_{1}, *A*
_{2}) is such that *A*
_{1} : *x* ≥ *k* and *A*
_{2} : *x* < *k* for *μ*
_{1} > *μ*
_{2}. Note that \(\frac {k-\mu _1}{\sigma }=k-\mu _1=4-5=-1\) and \(\frac {k-\mu _2}{\sigma }=(4-3)=1\). As the observed *x* is 4.2 > 4 = *k*, we classify *x* into *P*
_{1}(*X*) : *N*
_{1}(*μ*
_{1}, 1). Moreover,

and

### Solution 12.3.4

(2). In this case, \(\bar {x}=\frac {1}{4}(6+4+3+5)=4.5,\ n=4\), \(\bar {x}\sim N(\mu _i,\frac {1}{n}),\ i=1,2\), \(\frac {(k-\mu _1)}{\sigma /\sqrt {n}}=2(4-5)=-2 \) and \( \frac {(k-\mu _2)}{\sigma /\sqrt {n}}=2(4-3)=2\). Since the observed \(\bar {x}\) is 4.5 > 4 = *k*, we assign the sample to *P*
_{1}(*X*) : *N*(*μ*
_{1}, 1), the criterion being \(A_1:\bar {x}\ge k\) and \(A_2:\bar {x}<k\). Additionally,

and

### Example 12.3.5

Let *π*
_{1} and *π*
_{2} be two *p*-variate real nonsingular normal populations sharing the same covariance matrix, *π*
_{1} : *N*
_{p}(*μ*
^{(1)}, *Σ*), *Σ* > *O*, and *π*
_{2} : *N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ* > *O*, whose mean values are such that *μ*
^{(1)}≠*μ*
^{(2)}. Let the prior probabilities be *q*
_{1} = *q*
_{2} and the cost functions be *C*(1|2) = *C*(2|1). Consider a single *p*-vector *X* to be classified into *π*
_{1} or *π*
_{2}. Determine the regions of misclassification and the corresponding probabilities.

### Solution 12.3.5

The *p*-variate real normal densities are the following:

for *i* = 1, 2, *Σ* > *O*, *μ*
^{(1)}≠*μ*
^{(2)}. Consider the inequality

Taking logarithms, we have

Let

Then, *u* has a univariate normal distribution since it is a linear function of the components of *X*, which is a *p*-variate normal. Thus,

where *Δ*
^{2} is Mahalanobis’ distance. The mean values of *u* under *π*
_{1} and *π*
_{2} are respectively,

so that

Accordingly, the regions of misclassification are

and the probabilities of misclassification are as follows:

where *Φ*(⋅) denotes the distribution function of a univariate standard normal variable.

### Note 12.3.1

If no conditions are imposed on the prior probabilities, *q*
_{1} and *q*
_{2}, or on the costs of misclassification, *C*(2|1) and *C*(1|2), then the regions are determined as \(A_1: u\ge k, \ k=\ln \frac {C(1|2)\, q_2}{C(2|1)\, q_1},\) and *A*
_{2} : *u* < *k*. In this case, the probabilities of misclassification will be \(\varPhi \big (\frac {k-\frac {1}{2}\varDelta ^2}{\varDelta }\big )\) and \(1-\varPhi \big (\frac {k+\frac {1}{2}\varDelta ^2}{\varDelta }\big ),\) respectively.

### Note 12.3.2

If the prior probabilities *q*
_{1} and *q*
_{2} are not known, we may assume that the two populations *π*
_{1} and *π*
_{2} are equally likely to be chosen or equivalently that \(q_1=q_2=\frac {1}{2}\), in which instance \(k=\ln \frac {C(1|2)}{C(2|1)}\). Then, the correct decisions are to assign the vector *X* at hand to *π*
_{1} in the region *A*
_{1} and to *π*
_{2} in the region *A*
_{2}, where *A*
_{1} : *u* ≥ *k* and \( A_2: u<k, \ k=\ln \frac {q_2\ C(1|2)}{q_1\,C(2|1)}\) with *q*
_{1}, *q*
_{2}, *C*(2|1) and *C*(1|2) assumed to be known and

whose first term, namely (*μ*
^{(1)} − *μ*
^{(2)})*′Σ*
^{−1}
*X*, is known as the linear discriminant function, which is utilized to discriminate or to separate two *p*-variate populations, not necessarily normally distributed, having mean value vectors *μ*
^{(1)} and *μ*
^{(2)} and sharing the same covariance matrix *Σ* > *O*.

### Example 12.3.6

Assume that no prior probabilities or costs are involved. Applicants to a certain training program are given tests to evaluate their aptitude for languages and aptitude for science. Let the test scores be denoted by *x*
_{1} and *x*
_{2}, respectively. Let *X* be the bivariate vector . After completing the training program, their aptitudes are tested again. Let \({X^{(1)}}'=[x_1^{(1)},x_2^{(1)}]\) be the score vector in the group of successful trainees and let \({X^{(2)}}'=[x_1^{(2)},x_2^{(2)}]\) be the score vector in the group of unsuccessful trainees. From previous experience of conducting such tests over the years, it is known that *X*
^{(1)} ∼ *N*
_{2}(*μ*
^{(1)}, *Σ*), *Σ* > *O*, and *X*
^{(2)} ∼ *N*
_{2}(*μ*
^{(2)}, *Σ*), *Σ* > *O*, where

Then (1): one applicant taken at random before the training program started obtained the test scores ; (2): three applicants chosen at random before the training program started had the following scores:

In (1), classify *X*
_{0} to *π*
_{1} or *π*
_{2} and in (2), classify the entire sample of three vectors into *π*
_{1} or *π*
_{2}.

### Solution 12.3.6

Let us compute certain quantities which are needed to answer the questions:

Hence,

Since, in (1), the observed , the observed *u* is *u* = 2*x*
_{1} − 2*x*
_{2} − 4 = 8 − 2 − 4 = 2 > 0 and we classify the observed *X*
_{0} into \(\pi _1: N_1(\frac {1}{2}\varDelta ^2,\varDelta ^2)\), the criterion being *A*
_{1} : *u* ≥ 0 and *A*
_{2} : *u* < 0. Thus,

When solving (2), the entire sample is to be classified. Proceeding as in the derivation of the criterion *u* in case (1), it is seen that for the problem at hand, *X*
_{0} will be replaced by \(\bar {X}\), the average of the sample vectors or the sample mean value vector, and then *u* will become \(u_1=2\bar {x}_1-2\bar {x}_2-4\) where \(\bar {X}'=[\bar {x}_1,\bar {x}_2]\). Thus, we require the sample average:

This means that \(\bar {x}_1=\frac {12}{3}=4,\ \bar {x}_2=\frac {4}{3}\), and the observed \(u_1=2\bar {x}_1-2\bar {x}_2-4=8-\frac {8}{3}-4>0\). Hence, we classify the whole sample to *π*
_{1} as the criterion is *A*
_{1} : *u*
_{1} ≥ 0 and *A*
_{2} : *u*
_{1} < 0. Since \(\bar {X}\) is normally distributed with \(E[\bar {X}]=\mu ^{(i)}\) and \( \mathrm {Cov}(\bar {X})=\frac {1}{n}\varSigma ,\ i=1,2,\) where *n* is the sample size, the densities of *u*
_{1} under *π*
_{1} and *π*
_{2} are the following:

Moreover,

and

## 12.4. Linear Discriminant Function

Let *X* be a *p* × 1 vector and *B* a *p* × 1 arbitrary constant vector, *B′* = (*b*
_{1}, …, *b*
_{p}). Consider the arbitrary linear function *w* = *B′X*. Then, the mean value and variance of *w* are the following: *E*(*w*) = *B′E*(*X*) and Var(*w*) = Var(*B′X*) = *B′*Cov(*X*)*B* = *B′ΣB* where *Σ* > *O* is the covariance matrix of *X*. Suppose that the *X* could be from a *p*-variate real population *π*
_{1} with mean value vector *μ*
^{(1)} or from the *p*-variate real population *π*
_{2} with mean value vector *μ*
^{(2)}. Suppose that both the populations *π*
_{1} and *π*
_{2} have the same covariance matrix *Σ* > *O*. Then, a measure of discrimination or separation between *π*
_{1} and *π*
_{2} is |*B′μ*
^{(1)} − *B′μ*
^{(2)}| as measured in terms of the standard deviation \(\sqrt {\mathrm {Var}(w)}\) for determining the best choice of *B*. Taking the squared distance, let

since the square of a scalar quantity is the scalar quantity times its transpose, *B′*(*μ*
^{(1)} − *μ*
^{(2)}) being a scalar quantity. Accordingly, we will maximize *δ* as specified in (12.4.1). This will be achieved by selecting a particular *B* in such a way that *δ* attains a maximum which corresponds to the maximum distance between *π*
_{1} and *π*
_{2}. Without any loss of generality, we may assume that *B′ΣB* = 1, so that only the numerator in (12.4.1) need be maximized, subject to the condition *B′ΣB* = 1. Let *λ* denote a Lagrangian multiplier and

Let us take the partial derivative of *η* with respect to the vector *B* and equate the result to a null vector (the reader may refer to Chap. 1 for the derivative of a scalar variable with respect to a vector variable):

Note that (*μ*
^{(1)} − *μ*
^{(2)})*′B* ≡ *α* is a scalar quantity and *B* is a specific vector coming from *(i)* and hence we may write *(i)* as

where *c* is a real scalar quantity. Observe that *δ* as given in (12.4.1) will remain the same if *B* is multiplied by any scalar quantity. Thus, we may take *c* = 1 in *(ii)* without any loss of generality. The linear discriminant function then becomes

and when *B′X* is as given in (12.4.2), *δ* as defined in (12.4.1), can be expressed as follows:

This *δ* is also the generalized squared distance between the vectors *μ*
^{(1)} and *μ*
^{(2)} or the squared distance between the vectors \(\varSigma ^{-\frac {1}{2}}\mu ^{(1)}\) and \(\varSigma ^{-\frac {1}{2}}\mu ^{(2)}\) in the mathematical sense (Euclidean distance). Hence Mahalanobis’ distance between two *p*-variate populations with different mean value vectors and the same covariance matrix is a measure of discrimination or separation between the populations, and the linear discriminant function is given in (12.4.2). Hence for an observed value *X*, if *u* = (*μ*
^{(1)} − *μ*
^{(2)})*′Σ*
^{−1}
*X* > 0 when *μ*
^{(1)}, *μ*
^{(2)} and *Σ* are known, then we choose population *π*
_{1} with mean value *μ*
^{(1)}, and if *u* < 0, then we select population *π*
_{2} with mean value *μ*
^{(2)}. When *u* = 0, both *π*
_{1} and *π*
_{2} are equally favored.

### Example 12.4.1

In a small township, there is only one grocery store. The town is laid out on the East and West sides of the sole main road. We will refer to the villagers as East-enders and West-enders. These townspeople shop only once a week for groceries. The grocery store owner found that the East-enders and West-enders have somewhat different buying habits. Consider the following items: *x*
_{1} = grain items in kilograms, *x*
_{2} = vegetable items in kilograms, *x*
_{3} = dairy products in kilograms, and let [*x*
_{1}, *x*
_{2}, *x*
_{3}] = *X′* where X is the vector of weekly purchases. Then, the expected quantities bought by the East-enders and West-enders are *E*(*X*) = *μ*
^{(1)} and *E*(*X*) = *μ*
^{(2)}, respectively, with the common covariance matrix *Σ* > *O*. From past history, the grocery store owner determined that

Consider the following situations: (1) A customer walked in and bought *x*
_{1} = 1 kg of grain items, *x*
_{2} = 2 kg of vegetable items, and *x*
_{3} = 1 kg of dairy products. Is she likely to be an East-ender or West-ender? (2): Another customer bought the three types of items in the quantities (10, 1, 1), respectively. Is she more likely to be an East-ender than a West-ender?

### Solution 12.4.1

The inverse of the covariance matrix, *μ*
^{(1)} − *μ*
^{(2)}, as well as other relevant quantities are the following:

In (1), *X′* = (1, 2, 1) and since

we classify this customer as a West-ender from her buying pattern. In (2),

so that, given her purchases, this customer is classified as an East-ender.

## 12.5. Classification When the Population Parameters are Unknown

We now consider the classification problem involving two populations *π*
_{1} and *π*
_{2} for which the parameters of the corresponding densities are unknown. Since the structure of the parameters in these general densities *P*
_{1}(*X*) and *P*
_{2}(*X*) is not known, we will present a specific example: Consider the two *p*-variate normal populations of Example 12.3.3. Let *π*
_{1} : *N*
_{p}(*μ*
^{(1)}, *Σ*) and *π*
_{2} : *N*
_{p}(*μ*
^{(2)}, *Σ*), which share the same positive definite covariance matrix *Σ*. Suppose that we have a single observation vector *X* to be classified into *π*
_{1} or *π*
_{2}. When the parameters *μ*
^{(1)}, *μ*
^{(2)} and *Σ* are unknown, we will have to estimate them from some training samples. But, for a problem such as classifying skeletal remains, one does not have samples from the respective ancestral groups. Nevertheless, one can obtain training samples from living racial groups, and so, secure estimates of the parameters involved. Assume that we have simple random samples of sizes *n*
_{1} and *n*
_{2} from *N*
_{p}(*μ*
^{(1)}, *Σ*) and *N*
_{p}(*μ*
^{(2)}, *Σ*), respectively. Denote the sample values by \(X_1^{(1)},\ldots ,X_{n_1}^{(1)},\) and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\), and let \(\bar {X}^{(1)}\) and \(\bar {X}^{(2)}\) be the sample averages. That is,

Let the sample matrices be denoted by bold-faced letters where the *p* × *n*
_{1} matrix **X**
^{(1)} and the *p* × *n*
_{2} matrix **X**
^{(2)} are the sample matrices and let \(\bar {\mathbf {X}}^{\mathbf {(1)}}\) and \(\bar {\mathbf {X}}^{\mathbf {(2)}}\) be the matrices of sample means. Thus, we have

Then, the sample sum of products matrices are

The unbiased estimators of *μ*
^{(1)}, *μ*
^{(2)} and *Σ* are respectively \(\bar {X}^{(1)},\bar {X}^{(2)}\) and \(\frac {S}{n_{(2)}}=\frac {S_1+S_2}{n_{(2)}}, \ n_{(2)}=n_1+n_2-2\). The criteria for classification, the regions, the statistic, and so on, are available from Example 12.3.3. That is,

where

Note that *q*
_{1} and *q*
_{2} are the prior probabilities of selecting the populations *π*
_{1} and *π*
_{2} and *C*(1|2) and *C*(2|1) are the costs or loss associated with misclassification. We will assume that *q*
_{1}, *q*
_{2}, *C*(1|2) and *C*(2|1) are all known but the parameters *μ*
^{(1)}, *μ*
^{(2)} and *Σ* are estimated by their unbiased estimators. Denoting the estimator of *u* as *v*, we obtain the following criterion, assuming that we have one *p*-vector *X* to be classified into *π*
_{1} or *π*
_{2}:

As it turns out, it already proves quite challenging to obtain the exact distribution of *v* as given in (12.5.4) where *X* is a single *p*-vector either from *π*
_{1} or from *π*
_{2}.

### 12.5.1. Some asymptotic results

Before considering asymptotic properties of *u* and *v* as defined in Sect. 12.4, let us recall certain results obtained in earlier chapters. Let the *p* × 1 vectors *Y*
_{j}, *j* = 1, …, *n*, be iid vectors from some population for which *E*[*Y*
_{j}] = *μ* and Cov(*Y*
_{j}) = *Σ* > *O*, *j* = 1, …, *n*. Let the sample matrix, the matrix of sample means wherein the sample mean \(\bar {Y}=\frac {1}{n}\sum _{j=1}^nY_j\) and the sample sum of products matrix *S* be the as follows:

where *J* is a *n* × 1 vector of unities. Since a matrix of the form \(\mathbf {Y}-\bar {\mathbf {Y}}\) is present, we may let *μ* = *O* without any loss of generality in the following computations since \(Y_j-\bar {Y}=(Y_j-\mu )-(\bar {Y}-\mu )\). Note that \(B=B'=I_n-\frac {1}{n}JJ'=B^2\) and hence, *B* is idempotent and of rank *n* − 1. Since *B* = *B′*, there exists an orthonormal matrix *Q* such that *Q′BQ* = diag(1, …, 1, 0) = *D*, *QQ′* = *I*, *Q′Q* = *I*, the diagonal elements being 1’s and 0 since *B* = *B*
^{2} and of rank *n* − 1. Then,

Consider \(\varSigma ^{-\frac {1}{2}}S\varSigma ^{-\frac {1}{2}}\). Let \(U_j=\varSigma ^{-\frac {1}{2}}Y_j,\ j=1,\ldots ,n,\) where *Y*
_{j} is the *j*-th column of **Y** and it is assumed that *μ* = *O*. Observe that *E*[*U*
_{j}] = *O*, Cov(*U*
_{j}) = *I*
_{p}, *j* = 1, …, *n*, and the *U*
_{j}’s are uncorrelated. Letting **U** = [*U*
_{1}, …, *U*
_{n}], *(ii)* implies that

Denoting by *U*
_{(j)} the *j*-th row of **U**, it follows that the elements of *U*
_{(j)} are iid uncorrelated real scalar variables with mean value zero and variance 1. Consider the transformation *V*
_{(j)} = *U*
_{(j)}
*Q*; then *E*[*V*
_{(j)}] = *O* and Cov[*V*
_{(j)}] = *I*
_{n}, *j* = 1, …, *p*, the *V*
_{(j)}’s being the uncorrelated. Let **V** be the *p* × *n* matrix whose rows are *V*
_{(j)}, *j* = 1, …, *p*. Let the columns of **V** be *V*
_{j}, *j* = 1, …, *n*, that is, **V** = [*V*
_{1}, …, *V*
_{n}]. Then, *(iii)* implies the following:

Additionally,

when *Σ* is finite with respect to any norm of *Σ*, namely ∥*Σ*∥ < *∞*. Appealing to the extended Chebyshev inequality, this shows that the unbiased estimator of *μ*, namely \(\bar {Y}\), converges to *μ* in probability, that is,

An unbiased estimator of *Σ* is \(\hat {\varSigma }=\frac {S}{n-1}\) with \(E[\hat {\varSigma }]=\varSigma \). Will \(\hat {\varSigma }\) also converge to *Σ* in probability when *n* →*∞*? In order to establish this, we require the covariance structure of the elements in *S*. For arbitrary populations, it is somewhat difficult to verify this result; however, it is rather straightforward for normal populations. We will examine this aspect next.

### 12.5.2. Another method

Let the *p* × 1 vectors *X*
_{j}, *j* = 1, …, *n*, be a simple random sample of size *n* from a population having a real *N*
_{p}(*μ*, *Σ*), *Σ* > *O*, distribution. Letting *S* denote the sample sum of products matrix, *S* will be distributed as a Wishart matrix with *m* = *n* − 1 degrees of freedom and *Σ* > *O* as its parameter matrix, whose density is

the reader may also refer to real matrix-variate gamma density discussed in Chap. 5. This is usually written as *S* ∼ *W*
_{p}(*m*, *Σ*), *Σ* > *O*. Letting \(S_{(*)}=\varSigma ^{-\frac {1}{2}}S\varSigma ^{-\frac {1}{2}}\), *S*
_{(∗)} ∼ *W*
_{p}(*m*, *I*). Consider the transformation *S*
_{(∗)} = *TT′* where *T* = (*t*
_{ij}) is a lower triangular matrix whose diagonal elements are positive, that is, *t*
_{ij} = 0, *i* < *j*, and *t*
_{jj} > 0, *j* = 1, …, *p*. It was explained in Chaps. 1 and 3 that the *t*
_{ij}’s are mutually independently distributed with the *t*
_{ij}’s such that *i* > *j* distributed as standard normal variables and \(t_{jj}^2,\) as a chisquare variable having *m* − (*j* − 1) degrees of freedom. The *j*-th diagonal element of *TT′* is of the form \(t_{j1}^2+\cdots +t_{jj-1}^2+t_{jj}^2\) where \(t_{jk}^2\sim \chi ^2_1\), for *k* = 1, …, *j* − 1, that is, the square of a real standard normal variable. Thus, the *j*-th diagonal element is distributed as \(\chi ^2_1+\cdots +\chi ^2_1+\chi ^2_{m-(j-1)}\sim \chi ^2_m\) since all the individual chisquare variables are independently distributed, in which case the resulting number of degrees of freedom is the sum of the degrees of freedom of the chisquares. Now, noting that for a \(\chi ^2_{\nu }\),

the expected value of each of the diagonal elements in *TT′*, which are the diagonal elements in *S*
_{(∗)}, will be *m* = *n* − 1. The non-diagonal elements in *TT′* result from a sum of terms of the form *t*
_{ik}
*t*
_{ii}, *k* < *i*, whose expected value is *E*[*t*
_{ik}
*t*
_{ii}] = *E*[*t*
_{ik}]*E*[*t*
_{jj}]; but since *E*[*t*
_{ik}] = 0, *i* > *k*, all the non-diagonal elements will have zero as their expected values. Accordingly,

and the estimator \(\hat {\varSigma }=\frac {S}{m}\) is unbiased for *Σ*, *m* being equal to *n* − 1. Now, let us examine the covariance structure of *S*
_{(∗)}. Let *W* denote a single vector comprising all the distinct elements of *S*
_{(∗)} = *TT′* and consider its covariance structure. In this vector of order \(\frac {p(p+1)}{2}\times 1\), convert all the original *t*
_{ij}’s and *t*
_{jj}’s in terms of standard normal and chisquare variables. Let \(z_1,\ldots ,z_{\frac {p(p-1)}{2}}\) be the standard normal variables and *y*
_{1}, …, *y*
_{p} denote the chisquare variables. Then, each element of Cov(*W*) = [*W* − *E*(*W*)][*W* − *E*(*W*)]*′* will be a sum of terms of the type

which happens to be a linear function of *m*. Our estimator being \(\hat {\varSigma }=\frac {S}{m}=\varSigma ^{\frac {1}{2}}\frac {S_{(*)}}{m}\varSigma ^{\frac {1}{2}}\), the covariance structure of \(\frac {S_{(*)}}{m}\) which is \(\frac {1}{m^2}\mathrm {Cov}(W)\) tends to *O* when *m* →*∞*, since each element of Cov(*W*) is of the form *a* *m* + *b* where *a* and *b* are real scalars, so that \(\frac {a\,m+b}{m^2}\to 0\) as *m* →*∞*, or equivalently, as *n* →*∞* since *m* = *n* − 1. Thus, it follows from an extended version of Chebyshev’s inequality that

These last two results are stated next as a theorem.

### Theorem 12.5.1

*Let the p *× 1 *vectors X*
_{j}, *j *= 1, …, *n*, *be iid with E*[*X*
_{j}] =* μ and* Cov(*X*
_{j}) =* Σ*, *j *= 1, …, *n. Assume that Σ is finite in the sense that* ∥*Σ*∥ <* ∞. Then, letting* \(\bar {x}=\frac {1}{n}\sum _{j=1}^nx_j\) *denote the sample mean,*

*Further, letting X*
_{j} ∼* N*
_{p}(*μ*, *Σ*), *Σ *>* O,*

Let us now examine the criterion in (12.5.4). In this case, we can obtain an asymptotic distribution of the criterion *v* for large *n*
_{(2)} or when *n*
_{(2)} →*∞* in the sense that *n*
_{1} →*∞* and *n*
_{2} →*∞*. When *n*
_{(2)} →*∞*, we have \(\bar {X}^{(1)}\to \mu ^{(1)}, \ \bar {X}^{(2)}\to \mu ^{(2)}\) and \(\frac {S}{n_{(2)}}\to \varSigma \), so that the criterion *v* in (12.5.4) becomes

which is nothing but *u* as specified in (12.3.7) with the densities \(N_1(\frac {1}{2}\varDelta ^2,\varDelta ^2)\) in *π*
_{1} and \(N_1(-\frac {1}{2}\varDelta ^2,\varDelta ^2)\) in *π*
_{2}. Hence, the following result:

### Theorem 12.5.2

*When n*
_{1} →*∞ and n*
_{2} →*∞*, *the criterion v provided in* (12.5.4) *becomes u as specified in* (12.5.7) *with the univariate normal densities* \(N_1(\frac {1}{2}\varDelta ^2,\varDelta ^2)\) *in π*
_{1} *and* \(N_1(-\frac {1}{2}\varDelta ^2,\varDelta ^2)\) *in π*
_{2}, *where Δ*
^{2} *is Mahalanobis’ distance given in* (12.3.8)*. We classify X, the observation vector at hand, to π*
_{1} *when X *∈* A*
_{1} *and, to π*
_{2} *when X *∈* A*
_{2} *where A*
_{1} :* u *≥* k and A*
_{2} :* u *<* k with* \(k=\ln \frac {C(1|2)\,q_2}{C(2|1)\,q_1}\)
*, q*
_{1} *and q*
_{2} *being the prior probabilities of selecting the populations π*
_{1} *and π*
_{2}
*, respectively, and C*(2|1) *and C*(1|2) *denoting the costs or loss associated with misclassification.*

In a practical situation, when *n*
_{1} and *n*
_{2} are large, we may replace *Δ*
^{2} in Theorem 12.5.2 by the corresponding sample value \(n_{(2)}(\bar {X}^{(1)}-\bar {X}^{(2)})'S^{-1}(\bar {X}^{(1)}-\bar {X}^{(2)})\) where *S* = *S*
_{1} + *S*
_{2} and *n*
_{(2)} = *n*
_{1} + *n*
_{2} − 2 and utilize the criterion *u* as specified in (12.5.7) to classify the given vector *X* into *π*
_{1} and *π*
_{2}. It is assumed that *q*
_{1}, *q*
_{2}, *C*(2|1) and *C*(1|2) are available.

### 12.5.3. A new sample from *π*
_{1} or *π*
_{2}

As in Examples 12.3.1 and 12.3.2, suppose that a simple random sample of size *n*
_{3} is available either from *π*
_{1} : *N*
_{p}(*μ*
^{(1)}, *Σ*) or from *π*
_{2} : *N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ* > *O*. Letting the new sample be \(X_1^{(3)},\ldots ,X_{n_3}^{(3)}\), the *p* × *n*
_{3} sample matrix, the sample mean \(\bar {X}^{(3)}=\frac {1}{n_3}\sum _{j=1}^{n_3}X_j^{(3)}\), the *p* × *n*
_{3} matrix of sample means and the sample sum of products matrix are the following:

An unbiased estimate from this third sample is \(\hat {\varSigma }=\frac {S_3}{n_3-1},\) as \(E[\hat {\varSigma }]=\varSigma \). A pooled estimate of *Σ* obtained from the three samples is

Then, the criterion corresponding to (12.3.4) changes to:

where

with *S* = *S*
_{1} + *S*
_{2} + *S*
_{3}, *n*
_{(3)} = *n*
_{1} + *n*
_{2} + *n*
_{3} − 3 and \(\bar {X}^{(3)}\) being the sample average from the third sample, which either comes from *π*
_{1} : *N*
_{p}(*μ*
^{(1)}, *Σ*) or *π*
_{2} : *N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ* > *O*. Thus, the classification rule is the following:

*w* being as defined in (12.5.11). That is, classify the new sample into *π*
_{1} if *w* ≥ *k* and, into *π*
_{2} if *w* < *k*.

As was explained in Sect. 12.5.2, as *n*
_{j} →*∞*, *j* = 1, 2, \(\bar {X}^{(i)}\to \mu ^{(i)},\ i=1,2,\) and although *n*
_{3} usually remains finite, as *n*
_{1} →*∞* and *n*
_{2} →*∞*, we have *n*
_{(3)} →*∞* and \(\frac {S}{n_{(3)}}\to \varSigma \). Accordingly, the criterion *w* as given in (12.5.11) converges to *w*
_{1} for large values of *n*
_{1} and *n*
_{2}, where

Compared to *u* as specified in (12.3.7), the only difference is that *X* associated with *u* is replaced by \(\bar {X}^{(3)}\) in *w*
_{1}. Hence, the variance in *u* will be multiplied by \(\frac {1}{n_3}\), and the asymptotic distributions will be as follows:

as *n*
_{1} →*∞* and *n*
_{2} →*∞*.

### Theorem 12.5.3

*Consider two populations π*
_{1} :* N*
_{p}(*μ*
^{(1)}, *Σ*) *and π*
_{2} :* N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ *>* O, and simple random samples of respective sizes n*
_{1} *and n*
_{2} *from these two populations. Suppose that a simple random sample of size n*
_{3} *is available, either from π*
_{1} *or π*
_{2}
*. For classifying the third sample into π*
_{1} *or π*
_{2}
*, the criterion to be utilized is w as given in* (12.5.11)*. Then, the asymptotic distribution of w, when n*
_{i} →*∞*, *i *= 1, 2, *is that of w*
_{1} *specified in* (12.5.13) *and the regions of classification are as given in* (12.5.12).

In a practical situation, when the sample sizes *n*
_{1} and *n*
_{2} are large, one may replace *Δ*
^{2} by its sample analogue, and then use (12.5.14) to reach a decision. As it turns out, it proves quite difficult to derive the exact density of *w*.

### Example 12.5.1

A certain milk collection and distribution center collects and sells the milk supplied by local farmers to the community, the balance, if any, being dispatched to a nearby city. In that locality, there are two types of cows. Some farmers only keep Jersey cows and others, only Holstein cows. Samples of the same quantities of milk are taken and the following characteristics are evaluated: *x*
_{1}, the fat content, *x*
_{2}, the glucose content, and *x*
_{3}, the protein content. It is known that *X′* = (*x*
_{1}, *x*
_{2}, *x*
_{3}) is normally distributed as *X* ∼ *N*
_{3}(*μ*
^{(1)}, *Σ*), *Σ* > *O*, for Jersey cows, and *X* ∼ *N*
_{3}(*μ*
^{(2)}, *Σ*), *Σ* > *O*, for Holstein cows, with *μ*
^{(1)}≠*μ*
^{(2)}, the covariance matrices *Σ* being assumed identical. These parameters which are not known, are estimated on the basis of 100 milk samples from Jersey cows and 102 samples from Holstein cows, all the samples being of equal volume. The following are the summarized data with our standard notations, where *S*
_{1} and *S*
_{2} are the sample sums of products matrices:

Three farmers just brought in their supply of milk and (1): a sample denoted by *X*
_{1} is collected from the first farmer’s supply and evaluated; (2) a sample, *X*
_{2}, is taken from a second farmer’s supply and evaluated; (3) a set of 5 random samples are collected from a third farmer’s supply, the sample average being \(\bar {X}\). The data is

Classify, *X*
_{1}, *X*
_{2} and the sample of size 5 to either coming from Jersey or Holstein cows.

### Solution 12.5.1

The following preliminary calculations are needed:

Then,

where the *w* is given in (12.5.11). For answering (1), we substitute *X*
_{1} to *X* in *w*. That is, *w* at *X*
_{1} is 3(2) + (1) − (1) − 4 = 2 > 0. Hence, we assign *X*
_{1} to Jersey cows. For answering (2), we replace *X* in *w* by *X*
_{2}, that is, 3(1) + (1) − (2) − 4 = −2 < 0. Thus, we assign *X*
_{2} to Holstein cows. For answering (3), we replace *X* in *w* by \(\bar {X}\). That is, 3(2) + (2) − (1) − 4 = 3 > 0. Accordingly, we classify this sample as coming from Jersey cows.

## 12.6. Maximum Likelihood Method of Classification

As before, let *π*
_{1} be the *p*-variate real normal population *N*
_{p}(*μ*
^{(1)}, *Σ*), *Σ* > *O*, with the simple random sample \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\) of size *n*
_{1} drawn from that population, and *π*
_{2} : *N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ* > *O*, with the simple random sample \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) of size *n*
_{2} so distributed. A *p*-vector *X* at hand is to be classified into *π*
_{1} or *π*
_{2}. Let the sample means and the sample sums of products matrices be \(\bar {X}^{(1)},\ \bar {X}^{(2)},\ S_1\) and *S*
_{2}. Then, the problem of classification of *X* into *π*
_{1} or *π*
_{2} can be stated in terms of testing a hypothesis of the following type: *X* and \(X_1^{(1)},\ldots ,X^{(1)}_{n_1}\) are from *N*
_{p}(*μ*
^{(1)}, *Σ*) and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) are from *π*
_{2} constitutes the null hypothesis, versus, the alternative *X* and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) are from *N*
_{p}(*μ*
^{(2)}, *Σ*) and \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\) are from *N*
_{p}(*μ*
^{(1)}, *Σ*). Let the likelihood functions under the null and alternative hypotheses be denoted as *L*
_{0} and *L*
_{1}, respectively, where

where

and *S*
_{1} and *S*
_{2} are the sample sums of products matrices from the samples \(X_1^{(1)},\ldots ,\) \(X_{n_1}^{(1)}\) and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\), respectively. Referring to Chaps. 1 and 3 for vector/matrix derivatives and the maximum likelihood estimators (MLE’s) of the parameters of normal populations, the MLE’s obtained from *(i)* are the following, denoting the estimators/estimates with a hat: The MLE’s under *L*
_{0} are the following:

observing that the scalar quantity

By substituting the MLE’s in *L*
_{0}, we obtain the maximum of *L*
_{0}:

Under *L*
_{1}, the MLE’s are

Thus,

Hence,

If *z*
_{1} ≥ 1, then \(\max L_0\ge \max L_1\), which means that the likelihood of *X* coming from *π*
_{1} is greater than or equal to the likelihood of *X* originating from *π*
_{2}. Hence, we may classify *X* into *π*
_{1} if *z*
_{1} ≥ 1 and classify *X* into *π*
_{2} if *z*
_{1} < 1. In other words,

If we let *S* = *S*
_{1} + *S*
_{2}, then *z*
_{1} ≥ 1 ⇒

We can re-express this last inequality in a more convenient form. Expanding the following partitioned determinant in two different ways, we have the following, where *S* is *p* × *p* and *Y* is *p* × 1:

observing that 1 + *Y* *′S*
^{−1}
*Y* is a scalar quantity. Accordingly, *z*
_{1} ≥ 1 means that

That is,

Hence, the regions of classification are the following:

Thus, classify *X* into *π*
_{1} when *z*
_{3} ≥ 0 and, *X* into *π*
_{2} when *z*
_{3} < 0. For large *n*
_{1} and *n*
_{2}, some interesting results ensue. When *n*
_{1} →*∞* and *n*
_{2} →*∞*, we have \(\frac {n_i}{n_i+1}\to 1,\ i=1,2,\ \bar {X}^{(i)}\to \mu ^{(i)},\ i=1,2,\) and \(\frac {S}{n_1+n_2-2}\to \varSigma \). Then, *z*
_{3} converges to *z*
_{4} where

where *u* is the same criterion *u* as that specified in (12.5.7). Hence, we have the following result:

### Theorem 12.6.1

*Let* \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\) *be a simple random sample of size n*
_{1} *from π*
_{1} :* N*
_{p}(*μ*
^{(1)}, *Σ*), *Σ *>* O and* \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) *be a simple random sample of size n*
_{2} *from the population π*
_{2} :* N*
_{p}(*μ*
^{(2)}, *Σ*), *Σ *>* O. Letting X be a vector at hand to be classified into π*
_{1} *or π*
_{2}
*, when n*
_{1} →*∞ and n*
_{2} →*∞, the likelihood ratio criterion for classification is the following: Classify X into π*
_{1} *if u *≥ 0 *and, X into π*
_{2} *if u *< 0 *or equivalently, A*
_{1} :* u *≥ 0 *and A*
_{2} :* u *< 0 *where* \(u=[X-\frac {1}{2}(\mu ^{(1)}+\mu ^{(2)})]'\varSigma ^{-1}(\mu ^{(1)}-\mu ^{(2)})\) *whose density is* \(u\sim N_1(\frac {1}{2}\varDelta ^2,\varDelta ^2)\) *when X is assigned to π*
_{1} *and* \(u\sim N_1(-\frac {1}{2}\varDelta ^2,\varDelta ^2)\) *when X is assigned to π*
_{2}
*, with Δ*
^{2} = (*μ*
^{(1)} −* μ*
^{(2)})*′Σ*
^{−1}(*μ*
^{(1)} −* μ*
^{(2)}) *denoting Mahalanobis’ distance.*

The likelihood ratio criterion for classification specified in (12.6.5) can also be given the following interpretation: For large values of *n*
_{1} and *n*
_{2}, the criterion reduces to the following: (*X* − *μ*
^{(2)})*′Σ*
^{−1}(*X* − *μ*
^{(2)}) − (*X* − *μ*
^{(1)})*′Σ*
^{−1}(*X* − *μ*
^{(1)}) ≥ 0 where (*X* − *μ*
^{(2)})*′Σ*
^{−1}(*X* − *μ*
^{(2)}) is the generalized distance between *X* and *μ*
^{(2)}, and (*X* − *μ*
^{(1)})*′Σ*
^{−1}(*X* − *μ*
^{(1)}) is the generalized distance between *X* and *μ*
^{(1)}, which means that the generalized distance between *X* and *μ*
^{(2)} is larger than the generalized distance between *X* and *μ*
^{(1)} when *u* > 0. That is, *X* is closer to *μ*
^{(1)} than *μ*
^{(2)} and accordingly, we classify *X* into *π*
_{1}, which is the case *u* > 0. Similarly, if *X* is closer to *μ*
^{(2)} when compared to the distance to *μ*
^{(1)}, we assign *X* to *π*
_{2}, which is the case *u* < 0. The case *u* = 0 is also included in the first inequality, but only for convenience. However, when *Pr*{*u* = 0|*π*
_{i}, *i* = 1, 2} = 0, replacing *u* > 0 by *u* ≥ 0 is fully justified.

### Note 12.6.1

The reader may refer to Example 12.3.3 for an illustration of the computations involved in connection with the probabilities of misclassification. For large values of *n*
_{1} and *n*
_{2}, one has the *z*
_{4} of *(viii)* as an approximation to the *u* appearing in the same equation as well as the *u* of (12.5.7) or that of Example 12.3.3. In order to apply Theorem 12.6.1, one needs to know the parameters *μ*
^{(1)}, *μ*
^{(2)} and *Σ*. When they are not available, one may substitute to them the corresponding estimates \(\bar {X}^{(1)},\ \bar {X}^{(2)}\) and \(\hat {\varSigma }=\frac {S_1+S_2}{n_1+n_2-2}\) when *n*
_{1} and *n*
_{2} are large. Then, the approximate probabilities of misclassification can be determined.

### Example 12.6.1

Redo the problem considered in Example 12.5.1 by making use of the maximum likelihood procedure.

### Solution 12.6.1

In order to answer the questions, we need to compute

In this case, \(\frac {n_1}{n_1+1}=\frac {100}{101}\approx 1\) and \(\frac {n_2}{n_2+1}=\frac {102}{103}\approx 1\) and hence, the criterion *z*
_{4} is the same as *w* of (12.5.4) and the decisions arrived at an Example 12.5.1 will remain unchanged in this example. Since *n*
_{1} and *n*
_{2} are large, we have reasonably accurate approximations of the parameters as

so that the probabilities of misclassification can be evaluated by using their estimates. The approximate distributions are then given by

where \(\hat {\varDelta }^2=(\bar {X}^{(1)}-\bar {X}^{(2)})'(\frac {S}{n_1+n_2-2})^{-1}(\bar {X}^{(1)}-\bar {X}^{(2)})\). From the computations done in Example 12.5.1, we have

As well, *A*
_{1} : *w* ≥ 0 and *A*
_{2} : *w* < 0. For the data pertaining to (1) of Example 12.5.1, we have *w* > 0 and *X*
_{1} is assigned to *π*
_{1}. Observing that *w* → *u* of (12.5.7),

In Example 12.5.1, the observed vector provided for (2) is classified into *π*
_{2} since *w* < 0. Thus, the probability of making the right decision is *P*(2|2, *A*) = *Pr*{*u* < 0|*π*
_{2}}≈ 0.76 and the probability of misclassification is *P*(2|1, *A*) = *Pr*{*u* < 0|*π*
_{1}}≈ 0.24. Given the data related to (3) of Example 12.5.1, the only difference is that the distributions in *π*
_{1} and *π*
_{2} will be slightly different, the mean values remaining the same but the variance \(\hat {\varDelta }^2\) being replaced by \(\hat {\varDelta }^2/n\) where *n* = 5. The computations are similar to those provided for (1), the sample mean being assigned to *π*
_{1} in this case.

## 12.7. Classification Involving *k* Populations

Consider the *p*-variate populations *π*
_{1}, …, *π*
_{k} and let *X* be a *p*-vector at hand to be classified into one of these *k* populations. Let *q*
_{1}, …, *q*
_{k} be the prior probabilities of selecting these populations, *q*
_{j} > 0, *j* = 1, …, *k*, with *q*
_{1} + ⋯ + *q*
_{k} = 1. Let the cost of misclassification of a *p*-vector belonging to *π*
_{i} being improperly classified into *π*
_{j} be *C*(*j*|*i*) for *i*≠*j* so that *C*(*i*|*i*) = 0, *i* = 1, …, *k*. A decision rule *A* = (*A*
_{1}, …, *A*
_{k}) determines subspaces *A*
_{j} ⊂ *R*
_{p}, *j* = 1, …, *k*, with *A*
_{i} ∩ *A*
_{j} = *ϕ* (the empty set) for all *i*≠*j*. Let the probability/density functions associated with the *k* populations be *P*
_{j}(*X*), *j* = 1, …, *k*, respectively. Let *P*(*j*|*i*, *A*) = *Pr*{*X* ∈ *A*
_{j}|*π*
_{i} : *P*
_{i}(*X*), *A*} = probability of an observation coming from or belonging to the population *π*
_{i} or originating from the probability/density function *P*
_{i}(*X*), being improperly assigned to *π*
_{j} or misclassified as coming from *P*
_{j}(*X*), and the cost associated with this misclassification be denoted by *C*(*j*|*i*). Under the rule *A* = (*A*
_{1}, …, *A*
_{k}), the probabilities of correctly classifying and misclassifying an observed vector are the following, assuming that the *P*
_{j}(*X*)*′*s, *j* = 1, …, *k*, are densities:

where *P*(*i*|*i*, *A*) is a probability of achieving a correct classification, that is, of assigning an observation *X* to *π*
_{i} when the population is actually *π*
_{i}, and *P*(*j*|*i*, *A*) is the probability of an observation *X* coming from *π*
_{i} being misclassified as originating from *π*
_{j}. Consider a *p*-vector *X* at hand. What is then the probability that this *X* came from *P*
_{i}(*X*), given that *X* is an observation vector from one of the populations *π*
_{1}, …, *π*
_{k}? This is in fact a conditional statement involving

Suppose that for specific *i* and *j*, the conditional probability

This is tantamount to presuming that the likeliness of *X* originating from *P*
_{i}(*X*) is greater than or equal to that of *X* coming from *P*
_{j}(*X*). In this case, we would like to assign *X* to *π*
_{i} rather than *π*
_{j}. If *(ii)* holds for all *j* = 1, …, *k*, *j*≠*i*, then we classify *X* into *π*
_{i}. Equation *(ii)* for *j* = 1, …, *k*, *j*≠*i*, implies that

Accordingly, we adopt (12.7.1) as a decision rule *A* = (*A*
_{1}, …, *A*
_{k}). This decision rule corresponds to the following: When *X* ∈ *A*
_{1} ⊂ *R*
_{p} or *X* falls in *A*
_{1}, then *X* is classified into *π*
_{1}, when *X* ∈ *A*
_{2}, then *X* is assigned to *π*
_{2}, and so on. What is the expected cost of an *X* belonging to *π*
_{i} being misclassified into *π*
_{j} under some decision rule *B* = (*B*
_{1}, …, *B*
_{k}), *B*
_{j} ⊂ *R*
_{p}, *j* = 1, …, *k*, *B*
_{i} ∩ *B*
_{j} = *O*, *i*≠*j*, for all *i* and *j*? This is *q*
_{i}
*P*
_{i}(*X*)*C*(*j*|*i*) ≡ *E*
_{i}(*B*). The expected cost of an *X* belonging to *π*
_{j} being misclassified into *π*
_{i} under the same decision rule *B* is *E*
_{j}(*B*) = *q*
_{j}
*P*
_{j}(*X*)*C*(*i*|*j*). If *E*
_{i}(*B*) < *E*
_{j}(*B*), then we favor *P*
_{i}(*X*) over *P*
_{j}(*X*) as it is always desirable to minimize the expected cost in any procedure or decision. If *E*
_{i}(*B*) < *E*
_{j}(*B*) for all *j* = 1, …, *k*, *j*≠*i*, then *P*
_{i}(*X*) or *π*
_{i} is preferred over all other populations to which *X* could be assigned. Note that

for *j* = 1, …, *k*, *j*≠*i*, so that *(iii)* is the situation resulting from the following misclassification rule: if

we classify *X* into *π*
_{i} or equivalently, *X* ∈ *A*
_{i}, which is the decision rule *A* = (*A*
_{1}, …, *A*
_{k}). Thus, the decision rule *B* in *(iii)* is identical to *A*. Observing that when *C*(*i*|*j*) = *C*(*j*|*i*), (12.7.2) reduces to (12.7.1); the decision rule *A* = (*A*
_{1}, …, *A*
_{k}) in (12.7.1) is seen to yield the maximum probability of assigning an observation *X* at hand to *π*
_{i} compared to the probability of assigning *X* to any other *π*
_{j}, *j* = 1, …, *k*, *j*≠*i*, when the costs of misclassification are equal. As well, it follows from (12.7.2) that the decision rule *A* = (*A*
_{1}, …, *A*
_{k}) gives the minimum expected cost associated with assigning the observation *X* at hand to *π*
_{i} compared to assigning *X* to any other population *π*
_{j}, *j* = 1, …, *k* , *j*≠*i*.

### 12.7.1. Classification when the populations are real Gaussian

Let the populations be *p*-variate real normal, that is, *π*
_{j} ∼ *N*
_{p}(*μ*
^{(j)}, *Σ*), *Σ* > *O*, *j* = 1, …, *k*, with different mean value vectors but the same covariance matrix *Σ* > *O*. Let the density of *π*
_{j} be denoted by *P*
_{j}(*X*) ≃ *N*
_{p}(*μ*
^{(j)}, *Σ*), *Σ* > *O*. A vector *X* at hand is to be assigned to one of the *π*
_{i}’s, *i* = 1, …, *k*. In Sect. 12.3 or Example 12.3.3, the decision rule involves two populations. Letting the two populations be *π*
_{i} : *P*
_{i}(*X*) and *π*
_{j} : *P*
_{j}(*X*) for specific *i* and *j*, it was determined that the decision rule consists of classifying *X* into *π*
_{i} if \(\ln \frac {P_i(X)}{P_j(X)}\ge \ln \rho , \) \(\rho =\frac {q_jC(i|j)}{q_iC(j|i)},\) with *ρ* = 1 so that \(\ln \rho =0\) whenever *C*(*i*|*j*) = *C*(*j*|*i*) and *q*
_{i} = *q*
_{j}. When \(\ln \rho =0\), we have seen that the decision rule is to classify the *p*-vector *X* into *π*
_{i} or *P*
_{i}(*X*) if *u*
_{ij}(*X*) ≥ 0 and to assign *X* to *P*
_{j}(*X*) or *π*
_{j} if *u*
_{ij}(*X*) < 0, where

Now, on applying the result obtained in *(iv)* to (12.7.1) and (12.7.2), one arrives at the following decision rule:

with \(\ln \rho =0\) occurring when *q*
_{i} = *q*
_{j} and *C*(*i*|*j*) = *C*(*j*|*i*).

### Note 12.7.1

What will interchanging *i* and *j* in *u*
_{ij}(*X*) entail? Note that, as defined, *u*
_{ij}(*X*) involves the terms (*μ*
^{(i)} − *μ*
^{(j)}) = −(*μ*
^{(j)} − *μ*
^{(i)}) and (*μ*
^{(i)} + *μ*
^{(j)}), the latter being unaffected by the interchange of *μ*
^{(i)} and *μ*
^{(j)}. Hence, for all *i* and *j*,

When the underlying population is *X* ∼ *N*
_{p}(*μ*
^{(i)}, *Σ*), \(E[u_{ij}(X)|\pi _i]=\frac {1}{2}\varDelta _{ij}^2\), which implies that \(E[u_{ji}|\pi _i]=-\frac {1}{2}\varDelta _{ij}^2=-E[u_{ij}(X)|\pi _i]\) where \(\varDelta _{ij}^2=(\mu ^{(i)}-\mu ^{(j)})'\varSigma ^{-1}(\mu ^{(i)}-\mu ^{(j)})\).

### Note 12.7.2

For computing the probabilities of correctly classifying and misclassifying an observed vector, certain assumptions regarding the distributions associated with the populations *π*
_{j}, *j* = 1, …, *k*, are needed, the normality assumption being the most convenient one.

### Example 12.7.1

A certain milk collection and distribution center collects and sells the milk supplied by local farmers to the community, the balance, if any, being dispatched to a nearby city. In that locality, there are three dairy cattle breeds, namely, Jersey, Holstein and Guernsey, and each farmer only keeps one type of cows. Samples are taken and the following characteristics are evaluated in grams per liter: *x*
_{1}, the fat content, *x*
_{2}, the glucose content, and *x*
_{3}, the protein content. It has been determined that *X′* = (*x*
_{1}, *x*
_{2}, *x*
_{3}) is normally distributed as *X* ∼ *N*
_{3}(*μ*
^{(1)}, *Σ*) for Jersey cows, *X* ∼ *N*
_{3}(*μ*
^{(2)}, *Σ*) for Holstein cows and *X* ∼ *N*
_{3}(*μ*
^{(3)}, *Σ*) for Guernsey cows, with a common covariance matrix *Σ* > *O*, where

(1): A farmer brought in his supply of milk from which one liter was collected. The three variables were evaluated, the result being \(X_0^{\prime }=(2,3,4)\). (2): Another one liter sample was taken from a second farmer’s supply and it was determined that the vector of the resulting measurements was \(X_1^{\prime }=(2,2,2)\). No prior probabilities or costs are involved. Which breed of dairy cattle is each of these farmers likely to own?

### Solution 12.7.1

Our criterion is based on *u*
_{ij}(*X*) where

Let us evaluate the various quantities of interest:

Hence,

In order to answer (1), we substitute *X*
_{0} to *X* and first, evaluate *u*
_{12}(*X*
_{0}) and *u*
_{13}(*X*
_{0}) to determine whether they are ≥ 0. Since \(u_{12}(X_0)=\tfrac {1}{3}(2)-(3)-2(4)+\tfrac {11}{2}<0\), the condition is violated and hence we need not check for *u*
_{13}(*X*
_{0}) ≥ 0. Thus, *X*
_{0} is not in *A*
_{1}. Now, consider \(u_{21}(X_0)=-\tfrac {1}{3}(2)+3+2(4)-\tfrac {11}{2}>0\) and \(u_{23}(X_0)=-\tfrac {1}{3}(2)-(3)-2(4)-\tfrac {17}{2}<0\); again the condition is violated and we deduce that *X*
_{0} is not in *A*
_{2}. Finally, we verify *A*
_{3}: *u*
_{31}(*X*
_{0}) = 2(3) + 2(4) − 14 = 0 and \(u_{32}(X_0)=\tfrac {1}{3}(2)+(3)+2(4)-\tfrac {17}{2}>0\). Thus, *X*
_{0} ∈ *A*
_{3}, that is, we conclude that the sample milk came from Guernsey cows.

For answering (2), we substitute *X*
_{1} to *X* in *u*
_{ij}(*X*). Noting that \(u_{12}(X_1)=\tfrac {1}{3}(2)-(2)-2(2)+\tfrac {11}{2}>0\) and *u*
_{13}(*X*
_{1}) = −2(2) − 4(2) + 14 > 0, we can surmise that *X*
_{1} ∈ *A*
_{1}, that is, the sample milk came from Jersey cows. Let us verify *A*
_{2} and *A*
_{3} to ascertain that no mistake has been made in the calculations. Since *u*
_{21}(*X*
_{1}) < 0, *X*
_{1} is not in *A*
_{2}, and since *u*
_{31}(*X*
_{0}) < 0, *X*
_{1} is not in *A*
_{3}. This completes the computations.

### 12.7.2. Some distributional aspects

For computing the probabilities of correctly classifying and misclassifying an observation, we require the distributions of our criterion *u*
_{ij}(*X*). Let the populations be normally distributed, that is, *π*
_{j} ∼ *N*
_{p}(*μ*
^{(j)}, *Σ*), *Σ* > *O*, with the same covariance matrix *Σ* for all *k* populations, *j* = 1, …, *k*. Then, the probability of achieving a correct classification when *X* is assigned to *π*
_{i} is the following under the decision rule *A* = (*A*
_{1}, …, *A*
_{k}):

where d*X* = d*x*
_{1} ∧… ∧d*x*
_{p} and the integral is actually a multiple integral. But *A*
_{i} is defined by the inequalities *u*
_{i1}(*X*) ≥ 0, *u*
_{i2}(*X*) ≥ 0, …, *u*
_{ik}(*X*) ≥ 0, where *u*
_{ii}(*X*) is excluded. This is the case when no prior probabilities and costs are involved or when the prior probabilities are equal and the cost functions are identical. Otherwise, the region is \(\{A_i: u_{ij}(X)\ge \ln k_{ij},\ k_{ij}=\tfrac {q_jC(i|j)}{q_iC(j|i)},\ j=1,\ldots ,k,\ j\ne i\}\). Integrating (12.7.5) is challenging as the region is determined by *k* − 1 inequalities.

When the parameters *μ*
^{(j)}, *j* = 1, …, *k*, and *Σ* are known, we can evaluate the joint distributions of *u*
_{ij}(*X*), *j* = 1, …, *k*, *j*≠*i*, under the normality assumption for *π*
_{j}, *j* = 1, …, *k*. Let us examine the distributions of *u*
_{ij}(*X*) for normally distributed *π*
_{i} : *P*
_{i}(*X*), *i* = 1, …, *k*. In this instance, *E*[*X*]|*π*
_{i} = *μ*
^{(i)}, and under *π*
_{i},

Since *u*
_{ij}(*X*) is a linear function of the vector normal variable *X*, it is normal and the distribution of *u*
_{ij}(*X*)|*π*
_{i} is

This normality holds for each *j*, *j* = 1, …, *k*, *j*≠*i*, and for a fixed *i*. Then, we can evaluate the joint density of *u*
_{i1}(*X*), *u*
_{i2}(*X*), …, *u*
_{ik}(*X*), excluding *u*
_{ii}(*X*), and we can evaluate *P*(*i*|*i*, *A*) from this joint density. Observe that for *j* = 1, …, *k*, *j*≠*i*, the *u*
_{ij}(*X*)’s are linear functions of the same vector normal variable *X* and hence, they have a joint normal distribution. In that case, the mean value vector is a (*k* − 1)-vector, denoted by *μ*
_{(ii)}, whose elements are \(\tfrac {1}{2}\varDelta _{ij}^2,\ j=1,\ldots ,k,\ j\ne i,\) for a fixed *i*, or equivalently,

excluding the elements *u*
_{ii}(*X*) and \(\varDelta _{ii}^2=0\). The subscript *ii* in *U*
_{ii} indicates the region *A*
_{i} and the original population *P*
_{i}(*X*). The covariance matrix of *U*
_{ii}, denoted by *Σ*
_{ii}, will be a (*k* − 1) × (*k* − 1) matrix of the form *Σ*
_{ii} = [Cov(*u*
_{ir}, *u*
_{it})] = (*c*
_{rt}), *c*
_{rt} = Cov(*u*
_{ir}(*X*), *u*
_{it}(*X*)). The subscript *ii* in *Σ*
_{ii} indicates the region *A*
_{i} and the original population *P*
_{i}(*X*). Observe that for two linear functions *t*
_{1} = *C′X* = *c*
_{1}
*x*
_{1} + ⋯ + *c*
_{p}
*x*
_{p} and *t*
_{2} = *B′X* = *b*
_{1}
*x*
_{1} + ⋯ + *b*
_{p}
*x*
_{p}, having a common covariance matrix Cov(*X*) = *Σ*, we have Var(*t*
_{1}) = *C′ΣC*, Var(*t*
_{2}) = *B′ΣB* and Cov(*t*
_{1}, *t*
_{2}) = *C′ΣB* = *B′ΣC*. Therefore,

Let the vector *U*
_{ii} be such that \(U_{ii}^{\prime }=(u_{i1}(X),\ldots ,u_{ik}(X))\), excluding *u*
_{ii}(*X*). Thus, for a specific *i*,

and its density function, denoted by *g*
_{ii}(*U*
_{ii}), is

Then,

the differential d*u*
_{ii} being absent from d*U*
_{ii}, which is also the case for *u*
_{ii}(*X*) ≥ 0 in the integral. If prior probabilities and cost functions are involved, then replace *u*
_{ij}(*X*) ≥ 0 in the integral (12.7.7) by \(u_{ij}(X)\ge \ln k_{ij}, \ k_{ij}=\frac {q_jC(i|j)}{q_iC(j|i)}\). Thus, the problem reduces to determining the joint density *g*
_{ii}(*U*
_{ii}) and then evaluating the multiple integrals appearing in (12.7.7). In order to compute the probability specified in (12.7.7), we standardize the normal density by letting \(V_{ii}=\varSigma _{ii}^{-\frac {1}{2}}U_{ii}\) where *V*
_{ii} ∼ *N*
_{k−1}(*O*, *I*), and with the help of this standard normal, we may compute this probability through *V*
_{ii}. Note that (12.7.7) holds for each *i*, *i* = 1, …, *k*, and thus, the probabilities of achieving a correct classification, *P*(*i*|*i*, *A*) for *i* = 1, …, *k*, are available from (12.7.7).

For computing probabilities of misclassification of the type *P*(*i*|*j*, *A*), we can proceed as follows: In this context, the basic population is \(\pi _j:P_j(X)\sim N_p(-\frac {1}{2}\varDelta _{ij}^2,\varDelta _{ij}^2)\), the region of integration being *A*
_{i} : {*u*
_{i1}(*X*) ≥ 0, …, *u*
_{ik}(*X*) ≥ 0}, excluding the element *u*
_{ii}(*X*) ≥ 0. Consider the vector *U*
_{ij} corresponding to the vector *U*
_{ii}. In *U*
_{ij}, *i* stands for the region *A*
_{i} and *j*, for the original population *P*
_{j}(*X*). The elements of *U*
_{ij} are the same as those of *U*
_{ii}, that is, \(U_{ij}^{\prime }=(u_{i1}(X),\ldots ,u_{ik}(X))\), excluding *u*
_{ii}(*X*). We then proceed as before and compute the covariance matrix *Σ*
_{ij} of *U*
_{ij} in the original population *P*
_{j}(*X*). The variances of *u*
_{im}(*X*), *m* = 1, …, *k*, *m*≠*i*, will remain the same but the covariances will be different since they depend on the mean values. Thus, *U*
_{ij} ∼ *N*
_{k−1}(*μ*
_{(ij)}, *Σ*
_{ij}), and on standardizing, one has *V*
_{ij} ∼ *N*
_{k−1}(*O*, *I*), so that the required probability *P*(*i*|*j*, *A*) can be computed from the elements of *V*
_{ij}. Note that when the prior probabilities and costs are equal,

excluding *u*
_{ii}(*X*) in the integral as well as the differential d*u*
_{ii}(*X*). Thus, d*U*
_{ij} = d*u*
_{i1}(*X*) ∧… ∧d*u*
_{ik}(*X*), excluding d*u*
_{ii}(*X*).

### Example 12.7.2

Given the data provided in Example 12.7.1, what is the probability of correctly assigning *X* to *π*
_{1}? That is, compute the probability *P*(1|1, *A*).

### Solution 12.7.2

Observe that the joint density of *u*
_{12}(*X*) and *u*
_{13}(*X*) is that of a bivariate normal distribution since *u*
_{12}(*X*) and *u*
_{13}(*X*) are linear functions of the same vector *X* where *X* has a multivariate normal distribution. In order to compute the joint bivariate normal density, we need *E*[*u*
_{1j}(*X*)], Var(*u*
_{1j}(*X*)), *j* = 2, 3 and Cov(*u*
_{12}(*X*), *u*
_{13}(*X*)). The following quantities are evaluated from the data given in Example 12.7.1:

Hence, the covariance matrix of , denoted by *Σ*
_{11}, is the following:

where

The bivariate normal density of *U*
_{11} is the following:

with *Σ*
_{11} and \(\varSigma _{11}^{-1}=B'B\) as previously specified. Letting *Y* = *B*(*U*
_{11} − *E*[*U*
_{11}]), *Y* ∼ *N*
_{2}(*O*, *I*). Note that

Then,

and we have

which yields \(u_{12}(X)=\frac {7}{6}+\frac {1}{\sqrt {3}}\,y_1+\sqrt {2}\,y_2\) and \(u_{13}(X)=4+2\sqrt {2}\,y_2\). The intersection of the two lines corresponding to *u*
_{12}(*X*) = 0 and *u*
_{13}(*X*) = 0 is the point \((y_1,y_2)=(\sqrt {3}(\frac {5}{6}),-\sqrt {2})\). Thus, *u*
_{12}(*X*) ≥ 0 and *u*
_{13}(*X*) ≥ 0 give \(y_2\ge -\frac {4}{2\sqrt {2}}=-\sqrt {2}\) and \(\frac {7}{6}+\frac {1}{\sqrt {3}}\,y_1+\sqrt {2}\,y_2\ge 0\). We can express the resulting probability as *ρ*
_{1} − *ρ*
_{2} where

which is explicitly available, where *Φ*(⋅) denotes the distribution function of a standard normal variable, and

Therefore, the required probability is

Note that all quantities, except the integral, are explicitly available from standard normal tables. The integral part can be read from a bivariate normal table. If a bivariate normal table is used, then one can approximate the required probability from (12.7.9). Alternatively, once evaluated numerically, the integral is found to be equal to 0.2182 which subtracted from 0.9941, yields a probability of 0.7759 for *P*(1|1, *A*).

### 12.7.3. Classification when the population parameters are unknown

When training samples are available from the populations *π*
_{i}, *i* = 1, …, *k*, we can estimate the parameters and proceed with the classification. Let \(X_j^{(i)},j=1,\ldots ,n_i,\) be a simple random sample of size *n*
_{i} from the *i*-th population *π*
_{i}. Then, the sample average is \(\bar {X}^{(i)}=\tfrac {1}{n_i}\sum _{j=1}^{n_i}X_j^{(i)}\), and with our usual notations, the sample matrix, the matrix of sample means and sample sum of products matrix are the following:

where

Note that **X**
^{(i)} and \(\bar {\mathbf {X}}^{\mathbf {(i)}}\) are *p* × *n*
_{i} matrices and \(X_j^{(i)}\) is a *p* × 1 vector for each *j* = 1, …, *n*
_{i}, and *i* = 1, …, *k*. Let the population mean value vectors and the common covariance matrix be *μ*
^{(1)}, …, *μ*
^{(k)}, and *Σ* > *O*, respectively. Then, the unbiased estimators for these parameters are the following, identifying the estimators/estimates by a hat: \(\hat {\mu }_j^{(i)}=\bar {X}^{(i)},\ i=1,\ldots ,k, \) and \(\hat {\varSigma }=\frac {S}{n_1+\cdots +n_k-k},\ S=S_1+\cdots +S_k\). On replacing the population parameters by their unbiased estimators, the classification criteria *u*
_{ij}(*X*), *j* = 1, …, *k*, *j*≠*i*, become the following: Classify an observation vector *X* into *π*
_{i} if \(\hat {u}_{ij}(X)\ge \ln k_{ij}, \ k_{ij}=\frac {q_jC(i|j)}{q_iC(j|i)},\ j=1,\ldots ,k,\ j\ne i,\) or \(\hat {u}_{ij}\ge 0,\ j=1,\ldots ,k,\ j\ne i\), if *q*
_{1} = ⋯ = *q*
_{k}, and the *C*(*i*|*j*)’s are equal *j* = 1, …, *k*, *j*≠*i*, where

for *j* = 1, …, *k*, *j*≠*i*. Unfortunately, the exact distribution of \(\hat {u}_{ij}(X)\) is difficult to obtain even when the populations *π*
_{i}’s have *p*-variate normal distributions. However, when \(n_j\to \infty , \ \bar {X}^{(j)}\to \mu ^{(j)},\ j=1,\ldots ,k,\) and when *n*
_{j} →*∞*, *j* = 1, …, *k*, \(\hat {\varSigma }\to \varSigma \). Then, asymptotically, that is, when *n*
_{j} →*∞*, *j* = 1, …, *k*, \(\hat {u}_{ij}(X)\to u_{ij}(X),\) so that the theory discussed in the previous sections is applicable. As well, the classification probabilities can then be evaluated as illustrated in Example 12.7.2.

## 12.8. The Maximum Likelihood Method when the Population Covariances Are Equal

Consider *k* real normal populations *π*
_{i} : *P*
_{i}(*X*) ≃ *N*
_{p}(*μ*
^{(i)}, *Σ*), *Σ* > *O*, *i* = 1, …, *k*, having the same covariance matrix but different mean value vectors *μ*
^{(i)}, *i* = 1, …, *k*. A *p*-vector *X* at hand is to be classified into one of these populations *π*
_{j}, *j* = 1, …, *k*. Consider a simple random sample \(X_1^{(i)},X_2^{(i)},\ldots ,X_{n_i}^{(i)}\) of sizes *n*
_{i} from *π*
_{i} for *i* = 1, …, *k*. Employing our usual notations, the sample means, sample matrices, matrices of sample means and the sample sum of products matrices are as follows:

Then, the unbiased estimators of the population parameters, denoted with a hat, are

The null hypothesis can be taken as \(X_1^{(i)},\ldots ,X_{n_i}^{(i)}\) and *X* originating from *π*
_{i} and \(X_1^{(j)},\ldots ,X_{n_j}^{(j)}\) coming from *π*
_{j}, *j* = 1, …, *k*, *j*≠*i*, the alternative hypothesis being: *X* and \(X_1^{(j)},\ldots ,X_{n_j}^{(j)}\) coming from *π*
_{j} for *j* = 1, …, *k*, *j*≠*i*, and \(X_1^{(i)},\ldots ,X_{n_i}^{(i)}\) originating from *π*
_{i}. On proceeding as in Sect. 12.6, when the prior probabilities are equal and the cost functions are identical, the criterion for classification of the observed vector *X* to *π*
_{i} for a specific *i* is

for *j* = 1, …, *k*, *j*≠*i*, where the decision rule is *A* = (*A*
_{1}, …, *A*
_{k}), *S* = *S*
^{(1)} + ⋯ + *S*
^{(k)} and *n*
_{(k)} = *n*
_{1} + *n*
_{2} + ⋯ + *n*
_{k} − *k*. Note that (12.8.3) holds for each *i*, *i* = 1, …, *k*, and hence, *A*
_{1}, …, *A*
_{k} are available from (12.8.3). Thus, the vector *X* at hand is classified into *A*
_{i}, that is, assigned to the population *π*
_{i}, if the inequalities in (12.8.3) are satisfied. This statement holds for each *i*, *i* = 1, …, *k*. The exact distribution of the criterion in (12.8.3) is difficult to establish but the probabilities of classification can be computed from the asymptotic theory discussed in Sect. 12.7 by observing the following:

When \(n_i\to \infty ,~ \bar {X}^{(i)}\to \mu ^{(i)},\ i=1,\ldots ,k,\) and when \(n_1\to \infty ,\ldots ,n_k\to \infty ,~ \hat {\varSigma }\to \varSigma \). Thus, asymptotically, when *n*
_{i} →*∞*, *i* = 1, …, *k*, the criterion specified in (12.8.3) reduces to the criterion (12.7.3) of Sect. 12.7. Accordingly, when *n*
_{i} →*∞* or for very large *n*
_{i}’s, *i* = 1, …, *k*, one may utilize (12.7.3) for computing the probabilities of classification, which was illustrated in Examples 12.7.1 and 12.7.2.

## 12.9. Maximum Likelihood Method and Unequal Covariance Matrices

The likelihood procedure can also provide a classification rule when the normal population covariance matrices are different. For example, let *π*
_{1} : *P*
_{1}(*X*) ≃ *N*
_{p}(*μ*
^{(1)}, *Σ*
_{1}), *Σ*
_{1} > *O*, and *π*
_{2} : *P*
_{2}(*X*) ≃ *N*
_{p}(*μ*
^{(2)}, *Σ*
_{2}), *Σ*
_{2} > *O*, where *μ*
^{(1)}≠*μ*
^{(2)} and *Σ*
_{1}≠*Σ*
_{2}. Let a simple random sample \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\) of size *n*
_{1} from *π*
_{1} and a simple random sample \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) of size *n*
_{2} from *π*
_{2} be available. Let \(\bar {X}^{(1)}\) and \(\bar {X}^{(2)}\) be the sample averages and *S*
_{1} and *S*
_{2} be the sample sum of products matrices, respectively. In classification problems, there is an additional vector *X* which comes from *π*
_{1} under the null hypothesis and from *π*
_{2} under the alternative. Then, the maximum likelihood estimators, denoted by a hat, will be the following:

respectively, when no additional vector is involved. However, these estimators will change in the presence of the additional vector *X*, where *X* is the vector at hand to be assigned to *π*
_{1} or *π*
_{2}. When *X* originates from *π*
_{1} or *π*
_{2}, *μ*
^{(1)} and *μ*
^{(2)} are respectively estimated as follows:

and when *X* comes from *π*
_{1} or *π*
_{2}, *Σ*
_{1} and *Σ*
_{2} are estimated by

where

referring to the derivations provided in Sect. 12.6 when discussing maximum likelihood procedures. Thus, the null hypothesis can be *X* and \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\) are from *π*
_{1} and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) are from *π*
_{2}, versus the alternative: *X* and \(X_1^{(2)},\ldots ,X_{n_2}^{(2)}\) being from *π*
_{2} and \(X_1^{(1)},\ldots ,X_{n_1}^{(1)}\), from *π*
_{1}. Let *L*
_{0} and *L*
_{1} denote the likelihood functions under the null and alternative hypotheses, respectively. Observe that under the null hypothesis, *Σ*
_{1} is estimated by \(\hat {\varSigma }_{1*}\) of *(iii)* and *Σ*
_{2} is estimated by \(\hat {\varSigma }\) of *(i)*, respectively, so that the likelihood ratio criterion *λ* is given by

The determinants in (12.9.1) can be represented as follows, referring to the simplifications discussed in Sect. 12.6:

The classification rule then consists of assigning the observed vector *X* to *π*
_{1} if *λ* ≥ 1 and, to *π*
_{2} if *λ* < 1. We could have expressed the criterion in terms of \(\lambda _1=\lambda ^{\frac {2}{n}}\) if *n*
_{1} = *n*
_{2} = *n*, which would have simplified the expressions appearing in (12.9.2).

## Author information

### Authors and Affiliations

## Rights and permissions

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## Copyright information

© 2022 The Author(s)

## About this chapter

### Cite this chapter

Mathai, A., Provost, S., Haubold, H. (2022). Chapter 12: Classification Problems. In: Multivariate Statistical Analysis in the Real and Complex Domains. Springer, Cham. https://doi.org/10.1007/978-3-030-95864-0_12

### Download citation

DOI: https://doi.org/10.1007/978-3-030-95864-0_12

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-030-95863-3

Online ISBN: 978-3-030-95864-0

eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)