Application: Recognizing SpeechTopics: Hidden Markov chain, Viterbi decoding, EM Algorithms

11.1 Learning: Concepts and Examples

In artificial intelligence, “learning” refers to the process of discovering the relationship between related items, for instance between spoken words and sounds heard (Fig. 11.1).

Fig. 11.1
figure 1

Can you hear me?

As a simple example, consider the binary symmetric channel example of Problem 7.5 in Chap. 7. The inputs X n are i.i.d. B(p) and, given the inputs, the output Y n is equal to X n with probability 1 − 𝜖, for n ≥ 0. In this example, there is a probabilistic relationship between the inputs and the outputs described by 𝜖. Learning here refers to estimating 𝜖.

There are two basic situations. In supervised learning, one observes the inputs {X n, n = 0, …, N} and the outputs {Y n, n = 0, …, N}. One can think of this form of learning as a training phase for the system. Thus, one observes the channel with a set of known input values. Once one has “learned” the channel, i.e., estimated 𝜖, one can then design the best receiver and use it on unknown inputs. In unsupervised learning, one observes only the outputs. The benefit of this form of learning is that it takes place while the system is operational and one does not “waste” time with a training phase. Also, the system can adapt automatically to slow changes of 𝜖 without having to re-train it with a new training phase.

As you can expect, there is a trade-off when choosing supervised versus unsupervised learning. A training phase takes time but the learning is faster than in unsupervised learning. The best method to use depends on characteristics of the practical situation, such as the likely rate of change of the system parameters.

11.2 Hidden Markov Chain

A hidden Markov chain is a Markov chain together with a state observation model. The Markov chain is {X(n), n ≥ 0} and it has its transition matrix P on the state space \(\mathcal {X}\) and its initial distribution π 0. The state observation model specifies that when the state of the Markov chain is x, one observes a value y with probability Q(x, y), for \(y \in \mathcal {Y}\). More precisely, here is the definition (Fig. 11.2).

Fig. 11.2
figure 2

The hidden Markov chain

Definition 11.1 (Hidden Markov Chain)

A hidden Markov chain is a random sequence {(X(n), Y (n)), n ≥ 0} such that \(X(n) \in \mathcal {X} = \{1, \ldots , N\}\) and \(Y(n) \in \mathcal {Y} = \{1, \ldots , M\}\) and

$$\displaystyle \begin{aligned} P(X(0) &= x_0, Y(0) = y_0, \ldots, X(n) = x_n, Y(n) = y_n) \\ & = \pi_0(x_0)Q(x_0, y_0)P(x_0, x_1)Q(x_1, y_1) \times \cdots \times P(x_{n-1}, x_n)Q(x_n, y_n), \\ & \mbox{ for all } n \geq 0, x_m \in \mathcal{X}, y_m \in \mathcal{Y}. {} \end{aligned} $$
(11.1)

In the speech recognition application, the X n are “parts of speech,” i.e., segments of sentences, and the Y n are sounds. The structure of the language determines relationships between the X n that can be approximated by a Markov chain. The relationship between X n and Y n is speaker-dependent.

The recognition problem is the following. Assume that you have observed that Y n := (Y 0, …, Y n) = y n := (y 0, …, y n). What is the most likely sequence X n := (X 0, …, X n)? That is, in the terminology of Chap. 7, we want to compute

$$\displaystyle \begin{aligned} MAP[{\mathbf{X}}^n \mid {\mathbf{Y}}^n = {\mathbf{y}}^n]. \end{aligned}$$

Thus, we want to find the sequence \({\mathbf {x}}^n \in \mathcal {X}^{n+1}\) that maximizes

$$\displaystyle \begin{aligned} P[ {\mathbf{X}}^n = {\mathbf{x}}^n \mid {\mathbf{Y}}^n = {\mathbf{y}}^n]. \end{aligned}$$

Note that

$$\displaystyle \begin{aligned} P[ {\mathbf{X}}^n = {\mathbf{x}}^n \mid {\mathbf{Y}}^n = {\mathbf{y}}^n] = \frac{P({\mathbf{X}}^n = {\mathbf{x}}^n, {\mathbf{Y}}^n = {\mathbf{y}}^n)} {P({\mathbf{Y}}^n = {\mathbf{y}}^n)}. \end{aligned}$$

The MAP is the value of x n that maximizes the numerator. Now, by (11.1), the logarithm of the numerator is equal to

$$\displaystyle \begin{aligned} \log(\pi_0(x_0)Q(x_0, y_0)) + \sum_{m=1}^{n} \log(P(x_{m-1}, x_m)Q(x_m, y_m)). \end{aligned}$$

Define

$$\displaystyle \begin{aligned} d(x_0) = - \log(\pi_0(x_0)Q(x_0, y_0)) \end{aligned}$$

and

$$\displaystyle \begin{aligned} d_m(x_{m-1}, x_m) = - \log(P(x_{m-1}, x_m)Q(x_m, y_m)). \end{aligned}$$

Then, the MAP is the sequence x n that minimizes

$$\displaystyle \begin{aligned} d(x_0) + \sum_{m=1}^n d_m(x_{m-1}, x_m). \end{aligned} $$
(11.2)

The expression (11.2) can be viewed as the length for a path in the graph shown in Fig. 11.3. Finding the MAP is then equivalent to solving a shortest path problem. There are a few standard algorithms for solving such problems. We describe the Bellman–Ford Algorithm due to Bellman (Fig. 11.4) and Ford.

Fig. 11.3
figure 3

The MAP as a shortest path

Fig. 11.4
figure 4

Richard Bellman, 1920–1984

For m = 0, …, n and \(x \in \mathcal {X}\), let V m(x) be the length of the shortest path from X(m) = x to the column X(n) in the graph. Also, let V n(x) = 0 for all \(x \in \mathcal {X}\). Then, one has

$$\displaystyle \begin{aligned} V_m(x) = \min_{x' \in \mathcal{X}} \left\{d_{m + 1}(x, x') + V_{m+1}(x')\right\}, x \in \mathcal{X}, m = 0, \ldots, n - 1. \end{aligned} $$
(11.3)

Finally, let

$$\displaystyle \begin{aligned} V = \min_{x \in \mathcal{X}} \{d_0(x) + V_0(x)\}. \end{aligned} $$
(11.4)

Then, V  is the minimum value of expression (11.2).

The algorithm is then as follows:

  • Step (1): Calculate \(\{V_m(x), x \in \mathcal {X}\}\) recursively for m = n − 1, n − 2, …, 0, using (11.3). At each step, note the arc out of each x that achieves the minimum. Say that the arc out of x m = x goes to x m+1 = s(m, x) for \(x \in \mathcal {X}\).

  • Step (2): Find the value x 0 that achieves the minimum in (11.4).

  • Step (3): The MAP is then the sequence

    $$\displaystyle \begin{aligned} x_0, x_1 = s(0, x_0), x_2 = s(1, x_1), \ldots, x_n = s(n-1, x_{n-1}). \end{aligned}$$

Equations (11.3) are the Bellman–Ford Equations. They are a particular version of Dynamic Programming Equations (DPE) for the shortest path problem.

Note that the essential idea was to define the length of the shortest remaining path starting from every node in the graph and to write recursive expressions for those quantities. Thus, one solves the DPE backwards and then one finds the shortest path forward. This application of the shortest path algorithm for finding a MAP is called the Viterbi Algorithm due to Andrew Viterbi (Fig. 11.5).

Fig. 11.5
figure 5

Andrew Viterbi, b. 1934

11.3 Expectation Maximization and Clustering

Expectation maximization is a class of algorithms to estimate parameters of distributions. We first explain these algorithms on a simple clustering problem. We apply expectation maximization to the HMC model in the next section.

The clustering problem consists in grouping sample points into clusters of “similar” values. We explain a simple instance of this problem and we discuss the expectation maximization algorithm.

11.3.1 A Simple Clustering Problem

You look at set of N exam results {X(1), …, X(N)} in your probability course and you must decide who are the A and the B students. To study this problem, we assume that the results of A students are i.i.d. \(\mathcal {N}(a, \sigma ^2)\) and those of B students are \(\mathcal {N}(b, \sigma ^2)\) where a > b.

For simplicity, assume that we know σ 2 and that each student has probability 0.5 of being an A student. However, we do not know the parameters (a, b).

(The same method applies when one does not know the variances of the scores of A and B students, nor the prior probability that a student is of type A.)

One heuristic is as follows (see Fig. 11.6). Start with a guess (a 1, b 1) for (a, b). Student n with score X(n) is more likely to be of type A if X(n) > (a 1 + b 1)∕2. Let us declare that such students are of type A and the others are of type B. Let then a 2 be the average score of the students declared to be of type A and b 2 that of the other students. We repeat the procedure after replacing (a 1, b 1) by (a 2, b 2) and we keep doing this until the values seem to converge. This heuristic is called the hard expectation maximization algorithm.

Fig. 11.6
figure 6

Clustering with hard EM. The initial guess is (a 1, b 1), which leads to the MAP of the types and the next guess (a 2, b 2), and so on

A slightly different heuristic is as follows (see Fig. 11.7). Again, we start with a guess (a 1, b 1).

Fig. 11.7
figure 7

Clustering with soft EM. The initial guess is (a 1, b 1), which leads to the probabilities of the types and the next guess (a 2, b 2), and so on

Using Bayes’ rule, we calculate the probability p(n) that student n with score X(n) is of type A. We then calculate

$$\displaystyle \begin{aligned} a_2 = \frac{\sum_n X(n)p(n)}{\sum_n p(n)} \mbox{ and } b_2 = \frac{\sum_n X(n)(1 - p(n))}{\sum_n (1 - p(n))}. \end{aligned}$$

We then repeat after replacing (a 1, b 1) by (a 2, b 2). Thus, the calculation of a 2 weighs the scores of the students by the likelihood that they are of type A, and similarly for the calculation of b 2.

This heuristic is called the soft expectation maximization algorithm.

11.3.2 A Second Look

In the previous example, one attempts to estimate some parameter θ = (a, b) based on some observations X = (X 1, …, X N). Let Z = (Z 1, …, Z N) where Z n = A if student n is of type A and Z n = B otherwise.

We would like to maximize f[x|θ] over θ, to find MLE[θ|X = x]. One has

$$\displaystyle \begin{aligned} f[\mathbf{x}| \theta] = \sum_{\mathbf{z}} f[\mathbf{x}| \mathbf{z}, \theta]P[\mathbf{z} | \theta], \end{aligned}$$

where the sum is over the 2N possible values of Z. This is computationally too difficult.

Hard EM (Fig. 11.8) replaces the sum over z by

$$\displaystyle \begin{aligned} f[\mathbf{x}| {\mathbf{z}}^*, \theta]P[{\mathbf{z}}^* | \theta], \end{aligned}$$

where z is the most likely value of Z given the observations and a current guess for θ. That is, if the current guess is θ k, then

$$\displaystyle \begin{aligned} {\mathbf{z}}^* = MAP[ \mathbf{Z} | \mathbf{X} = \mathbf{x}, \theta_k] = \arg \max_{\mathbf{z}} P[\mathbf{Z} = \mathbf{z} | \mathbf{X} = \mathbf{x}, \theta_k]. \end{aligned}$$

The next guess is then

$$\displaystyle \begin{aligned} \theta_{k+1} = \arg \max_{\theta} f[\mathbf{x}| {\mathbf{z}}^*, \theta]P[{\mathbf{z}}^* | \theta]. \end{aligned}$$
Fig. 11.8
figure 8

Hard and soft EM?

Soft EM makes a different approximation. First, it replaces

$$\displaystyle \begin{aligned} \log( f[\mathbf{x}| \theta]) = \log\left(\sum_{\mathbf{z}} f[\mathbf{x}| \mathbf{z}, \theta]P[\mathbf{z} | \theta]\right) \end{aligned}$$

by

$$\displaystyle \begin{aligned} \sum_{\mathbf{z}} \log(f[\mathbf{x}| \mathbf{z}, \theta])P[\mathbf{z} | \theta]. \end{aligned}$$

That is, it replaces the logarithm of an expectation by the expectation of the logarithm.

Second, it replaces the expression above by

$$\displaystyle \begin{aligned} \sum_{\mathbf{z}} \log(f[\mathbf{x}| \mathbf{z}, \theta]) P[\mathbf{z} | \mathbf{x}, \theta_k] \end{aligned}$$

and the new guess θ k+1 is the maximizer of that expression over θ. Thus, it replaces the distribution of Z by the conditional distribution given the current guess and the observations.

If this heuristic did not work in practice, nobody would mention it. Surprisingly, it seems to work for some classes of problems. There is some theoretical justification for the heuristic. One can show that it converges to a local maximum of f[x|θ]. Generally, this is little comfort because most problems have many local maxima. See Roche (2012).

11.4 Learning: Hidden Markov Chain

Consider once again a hidden Markov chain model but assume that (π, P, Q) are functions of some parameter θ that we wish to estimate. We write this explicitly as (π θ, P θ, Q θ). We are interested in the value of θ that makes the observed sequence y n most likely.

Recall that MLE of θ given that Y n = y n is defined as

$$\displaystyle \begin{aligned} MLE[\theta | {\mathbf{Y}}^n = {\mathbf{y}}^n] = \arg \max_\theta P[ {\mathbf{Y}}^n = {\mathbf{y}}^n \mid \theta ]. \end{aligned}$$

As in the discussion of clustering, we have

$$\displaystyle \begin{aligned} P[ {\mathbf{Y}}^n = {\mathbf{y}}^n \mid \theta ] = \sum_{{\mathbf{x}}^n} P[ {\mathbf{Y}}^n = {\mathbf{y}}^n \mid {\mathbf{X}}^n = {\mathbf{x}}^n, \theta ]P[{\mathbf{X}}^n = {\mathbf{x}}^n | \theta]. \end{aligned} $$
(11.5)

11.4.1 HEM

The HEM algorithm replaces the sum over x n by

$$\displaystyle \begin{aligned} P[ {\mathbf{Y}}^n = {\mathbf{y}}^n \mid {\mathbf{X}}^n = {\mathbf{x}}_*^n, \theta ]P[{\mathbf{X}}^n = {\mathbf{x}}_*^n | \theta] \end{aligned}$$

and then \(P[{\mathbf {X}}^n = {\mathbf {x}}_*^n | \theta ]\) by

$$\displaystyle \begin{aligned} P[{\mathbf{X}}^n = {\mathbf{x}}_*^n | Y^n, \theta_0], \end{aligned}$$

where

$$\displaystyle \begin{aligned} {\mathbf{x}}_*^n = MAP[{\mathbf{x}}^n | {\mathbf{Y}}^n, \theta_0]. \end{aligned}$$

Recall that one can find \({\mathbf {x}}_*^n\) by using Viterbi’s algorithm. Also,

$$\displaystyle \begin{aligned} & P[ {\mathbf{Y}}^n = {\mathbf{y}}^n \mid {\mathbf{X}}^n = {\mathbf{x}}^n, \theta ] \\ &~~~ = \pi_\theta (x_0) Q_\theta (x_0, y_0) Q_\theta (x_1, y_1) \times \cdots \times P_\theta (x_{n-1}, x_n)Q_\theta (x_n, y_n). \end{aligned} $$

11.4.2 Training the Viterbi Algorithm

The Viterbi algorithm requires knowing P and Q. In practice, Q depends on the speaker and P may depend on the local dialect. (Valley speech uses more “likes” than Berkeley speakers.) We explained that if a parametric model is available, then one can use HEM.

Without a parametric model, a simple supervised training approach where one knows both x n and y n is to estimate P and Q by using empirical frequencies. For instance, the number of pairs (x m, x m+1) that are equal to (a, b) in x n divided by the number of times that x m = a provides an estimate of P(a, b). The estimation of Q is similar.

11.5 Summary

  • Hidden Markov Chain;

  • Viterbi Algorithm for MAP[X|Y];

  • Clustering and Expectation Maximization;

  • EM for HMC.

11.5.1 Key Equations and Formulas

Definition of HMC

X(n) = MC & P[Y n|X n]

D.11.1

Bellman–Ford Equations

V n(x) =miny{d(x, y) + V n+1(y)}

(11.3)

EM, Soft and Hard

θ →z →x; Heuristics to compute MAP[θ|x]

S.11.3

11.6 References

The text Wainwright and Jordan (2008) is great presentation of graphical models. It covers expectation maximization and many other useful techniques.

11.7 Problems

Problem 11.1

Let (X n, Y n) be a hidden Markov chain. Let Y n = (Y 0, …, Y n) and X n = (X 0, …, X n). The Viterbi algorithm computes

:

MLE[Y n|X n];

:

MLE[X n|Y n];

:

MAP[Y n|X n];

:

MAP[X n|Y n].

Problem 11.2

Assume that the Markov chain X n is such that \(\mathcal {X} = \{a, b\}\), π 0(a) = π 0(b) = 0.5 and P(x, x′) = α for x ≠ x′ and P(x, x) = 1 − α. Assume also that X n is observed through a BSC with error probability 𝜖, as shown in Fig. 11.8. Implement the Viterbi algorithm and evaluate its performance.

Fig. 11.8
figure 9

A simple hidden Markov chain

Problem 11.3

Suppose that the grades of students in a class are distributed as a mixture of two Gaussian distribution, \(N(\mu _1,\sigma ^2_1)\) with probability p and \(N(\mu _2,\sigma ^2_2)\) with probability 1 − p. All the parameters θ = (μ 1, σ 1, μ 2, σ 2, p) are unknown.

  1. (a)

    You observe n i.i.d. samples, y 1, …, y n drawn from the mixed distribution. Find f(y 1, …, y n|θ).

  2. (b)

    Let the type random variable X i be 0 if \(Y_i \sim N(\mu _1,\sigma ^2_1)\) and 1 if \(Y_i \sim N(\mu _2,\sigma ^2_2)\). Find MAP[X i|Y i, θ].

  3. (c)

    Implement Hard EM algorithm to approximately find MLE[θ|Y 1, …, Y n]. To this end, use MATLAB to generate 1000 data points (y 1, …, y 1000), according to θ = (10, 4, 30, 6, 0.4). Use your data to estimate θ. How well is your algorithm working?