Abstract
Speech recognition can be formulated as the problem of guessing a sequence of words that produces a sequence of sounds. The human brain is remarkably good at solving this problem, even though the same words correspond to many different sounds, because of accents or characteristics of the voice. Moreover, the environment is always noisy, to that the listeners hear a corrupted version of the speech.
Computers are getting much better at speech recognition and voice command systems are now common for smartphones (Siri), automobiles (GPS, music, and climate control), call centers, and dictation systems. In this chapter, we explain the main ideas behind the algorithms for speech recognition and for related applications.
The starting point is a model of the random sequence (e.g., words) to be recognized and of how this sequence is related to the observation (e.g., voice). The main model is called a hidden Markov chain. The idea is that the successive parts of speech form a Markov chain and that each word maps randomly to some sounds. The same model is used to decode strings of symbols in communication systems.
Section 11.1 is a general discussion of learning. The hidden Markov chain model used in speech recognition and in error decoding is introduced in Sect. 11.2. That section explains the Viterbi algorithm. Section 11.3 discusses expectation maximization and clustering algorithms. Section 11.4 covers learning for hidden Markov chains.
You have full access to this open access chapter, Download chapter PDF
Application: Recognizing SpeechTopics: Hidden Markov chain, Viterbi decoding, EM Algorithms
11.1 Learning: Concepts and Examples
In artificial intelligence, “learning” refers to the process of discovering the relationship between related items, for instance between spoken words and sounds heard (Fig. 11.1).
As a simple example, consider the binary symmetric channel example of Problem 7.5 in Chap. 7. The inputs X n are i.i.d. B(p) and, given the inputs, the output Y n is equal to X n with probability 1 − 𝜖, for n ≥ 0. In this example, there is a probabilistic relationship between the inputs and the outputs described by 𝜖. Learning here refers to estimating 𝜖.
There are two basic situations. In supervised learning, one observes the inputs {X n, n = 0, …, N} and the outputs {Y n, n = 0, …, N}. One can think of this form of learning as a training phase for the system. Thus, one observes the channel with a set of known input values. Once one has “learned” the channel, i.e., estimated 𝜖, one can then design the best receiver and use it on unknown inputs. In unsupervised learning, one observes only the outputs. The benefit of this form of learning is that it takes place while the system is operational and one does not “waste” time with a training phase. Also, the system can adapt automatically to slow changes of 𝜖 without having to re-train it with a new training phase.
As you can expect, there is a trade-off when choosing supervised versus unsupervised learning. A training phase takes time but the learning is faster than in unsupervised learning. The best method to use depends on characteristics of the practical situation, such as the likely rate of change of the system parameters.
11.2 Hidden Markov Chain
A hidden Markov chain is a Markov chain together with a state observation model. The Markov chain is {X(n), n ≥ 0} and it has its transition matrix P on the state space \(\mathcal {X}\) and its initial distribution π 0. The state observation model specifies that when the state of the Markov chain is x, one observes a value y with probability Q(x, y), for \(y \in \mathcal {Y}\). More precisely, here is the definition (Fig. 11.2).
Definition 11.1 (Hidden Markov Chain)
A hidden Markov chain is a random sequence {(X(n), Y (n)), n ≥ 0} such that \(X(n) \in \mathcal {X} = \{1, \ldots , N\}\) and \(Y(n) \in \mathcal {Y} = \{1, \ldots , M\}\) and
◇
In the speech recognition application, the X n are “parts of speech,” i.e., segments of sentences, and the Y n are sounds. The structure of the language determines relationships between the X n that can be approximated by a Markov chain. The relationship between X n and Y n is speaker-dependent.
The recognition problem is the following. Assume that you have observed that Y n := (Y 0, …, Y n) = y n := (y 0, …, y n). What is the most likely sequence X n := (X 0, …, X n)? That is, in the terminology of Chap. 7, we want to compute
Thus, we want to find the sequence \({\mathbf {x}}^n \in \mathcal {X}^{n+1}\) that maximizes
Note that
The MAP is the value of x n that maximizes the numerator. Now, by (11.1), the logarithm of the numerator is equal to
Define
and
Then, the MAP is the sequence x n that minimizes
The expression (11.2) can be viewed as the length for a path in the graph shown in Fig. 11.3. Finding the MAP is then equivalent to solving a shortest path problem. There are a few standard algorithms for solving such problems. We describe the Bellman–Ford Algorithm due to Bellman (Fig. 11.4) and Ford.
For m = 0, …, n and \(x \in \mathcal {X}\), let V m(x) be the length of the shortest path from X(m) = x to the column X(n) in the graph. Also, let V n(x) = 0 for all \(x \in \mathcal {X}\). Then, one has
Finally, let
Then, V is the minimum value of expression (11.2).
The algorithm is then as follows:
-
Step (1): Calculate \(\{V_m(x), x \in \mathcal {X}\}\) recursively for m = n − 1, n − 2, …, 0, using (11.3). At each step, note the arc out of each x that achieves the minimum. Say that the arc out of x m = x goes to x m+1 = s(m, x) for \(x \in \mathcal {X}\).
-
Step (2): Find the value x 0 that achieves the minimum in (11.4).
-
Step (3): The MAP is then the sequence
$$\displaystyle \begin{aligned} x_0, x_1 = s(0, x_0), x_2 = s(1, x_1), \ldots, x_n = s(n-1, x_{n-1}). \end{aligned}$$
Equations (11.3) are the Bellman–Ford Equations. They are a particular version of Dynamic Programming Equations (DPE) for the shortest path problem.
Note that the essential idea was to define the length of the shortest remaining path starting from every node in the graph and to write recursive expressions for those quantities. Thus, one solves the DPE backwards and then one finds the shortest path forward. This application of the shortest path algorithm for finding a MAP is called the Viterbi Algorithm due to Andrew Viterbi (Fig. 11.5).
11.3 Expectation Maximization and Clustering
Expectation maximization is a class of algorithms to estimate parameters of distributions. We first explain these algorithms on a simple clustering problem. We apply expectation maximization to the HMC model in the next section.
The clustering problem consists in grouping sample points into clusters of “similar” values. We explain a simple instance of this problem and we discuss the expectation maximization algorithm.
11.3.1 A Simple Clustering Problem
You look at set of N exam results {X(1), …, X(N)} in your probability course and you must decide who are the A and the B students. To study this problem, we assume that the results of A students are i.i.d. \(\mathcal {N}(a, \sigma ^2)\) and those of B students are \(\mathcal {N}(b, \sigma ^2)\) where a > b.
For simplicity, assume that we know σ 2 and that each student has probability 0.5 of being an A student. However, we do not know the parameters (a, b).
(The same method applies when one does not know the variances of the scores of A and B students, nor the prior probability that a student is of type A.)
One heuristic is as follows (see Fig. 11.6). Start with a guess (a 1, b 1) for (a, b). Student n with score X(n) is more likely to be of type A if X(n) > (a 1 + b 1)∕2. Let us declare that such students are of type A and the others are of type B. Let then a 2 be the average score of the students declared to be of type A and b 2 that of the other students. We repeat the procedure after replacing (a 1, b 1) by (a 2, b 2) and we keep doing this until the values seem to converge. This heuristic is called the hard expectation maximization algorithm.
A slightly different heuristic is as follows (see Fig. 11.7). Again, we start with a guess (a 1, b 1).
Using Bayes’ rule, we calculate the probability p(n) that student n with score X(n) is of type A. We then calculate
We then repeat after replacing (a 1, b 1) by (a 2, b 2). Thus, the calculation of a 2 weighs the scores of the students by the likelihood that they are of type A, and similarly for the calculation of b 2.
This heuristic is called the soft expectation maximization algorithm.
11.3.2 A Second Look
In the previous example, one attempts to estimate some parameter θ = (a, b) based on some observations X = (X 1, …, X N). Let Z = (Z 1, …, Z N) where Z n = A if student n is of type A and Z n = B otherwise.
We would like to maximize f[x|θ] over θ, to find MLE[θ|X = x]. One has
where the sum is over the 2N possible values of Z. This is computationally too difficult.
Hard EM (Fig. 11.8) replaces the sum over z by
where z ∗ is the most likely value of Z given the observations and a current guess for θ. That is, if the current guess is θ k, then
The next guess is then
Soft EM makes a different approximation. First, it replaces
by
That is, it replaces the logarithm of an expectation by the expectation of the logarithm.
Second, it replaces the expression above by
and the new guess θ k+1 is the maximizer of that expression over θ. Thus, it replaces the distribution of Z by the conditional distribution given the current guess and the observations.
If this heuristic did not work in practice, nobody would mention it. Surprisingly, it seems to work for some classes of problems. There is some theoretical justification for the heuristic. One can show that it converges to a local maximum of f[x|θ]. Generally, this is little comfort because most problems have many local maxima. See Roche (2012).
11.4 Learning: Hidden Markov Chain
Consider once again a hidden Markov chain model but assume that (π, P, Q) are functions of some parameter θ that we wish to estimate. We write this explicitly as (π θ, P θ, Q θ). We are interested in the value of θ that makes the observed sequence y n most likely.
Recall that MLE of θ given that Y n = y n is defined as
As in the discussion of clustering, we have
11.4.1 HEM
The HEM algorithm replaces the sum over x n by
and then \(P[{\mathbf {X}}^n = {\mathbf {x}}_*^n | \theta ]\) by
where
Recall that one can find \({\mathbf {x}}_*^n\) by using Viterbi’s algorithm. Also,
11.4.2 Training the Viterbi Algorithm
The Viterbi algorithm requires knowing P and Q. In practice, Q depends on the speaker and P may depend on the local dialect. (Valley speech uses more “likes” than Berkeley speakers.) We explained that if a parametric model is available, then one can use HEM.
Without a parametric model, a simple supervised training approach where one knows both x n and y n is to estimate P and Q by using empirical frequencies. For instance, the number of pairs (x m, x m+1) that are equal to (a, b) in x n divided by the number of times that x m = a provides an estimate of P(a, b). The estimation of Q is similar.
11.5 Summary
-
Hidden Markov Chain;
-
Viterbi Algorithm for MAP[X|Y];
-
Clustering and Expectation Maximization;
-
EM for HMC.
11.5.1 Key Equations and Formulas
11.6 References
The text Wainwright and Jordan (2008) is great presentation of graphical models. It covers expectation maximization and many other useful techniques.
11.7 Problems
Problem 11.1
Let (X n, Y n) be a hidden Markov chain. Let Y n = (Y 0, …, Y n) and X n = (X 0, …, X n). The Viterbi algorithm computes
- :
-
MLE[Y n|X n];
- :
-
MLE[X n|Y n];
- :
-
MAP[Y n|X n];
- :
-
MAP[X n|Y n].
Problem 11.2
Assume that the Markov chain X n is such that \(\mathcal {X} = \{a, b\}\), π 0(a) = π 0(b) = 0.5 and P(x, x′) = α for x ≠ x′ and P(x, x) = 1 − α. Assume also that X n is observed through a BSC with error probability 𝜖, as shown in Fig. 11.8. Implement the Viterbi algorithm and evaluate its performance.
Problem 11.3
Suppose that the grades of students in a class are distributed as a mixture of two Gaussian distribution, \(N(\mu _1,\sigma ^2_1)\) with probability p and \(N(\mu _2,\sigma ^2_2)\) with probability 1 − p. All the parameters θ = (μ 1, σ 1, μ 2, σ 2, p) are unknown.
-
(a)
You observe n i.i.d. samples, y 1, …, y n drawn from the mixed distribution. Find f(y 1, …, y n|θ).
-
(b)
Let the type random variable X i be 0 if \(Y_i \sim N(\mu _1,\sigma ^2_1)\) and 1 if \(Y_i \sim N(\mu _2,\sigma ^2_2)\). Find MAP[X i|Y i, θ].
-
(c)
Implement Hard EM algorithm to approximately find MLE[θ|Y 1, …, Y n]. To this end, use MATLAB to generate 1000 data points (y 1, …, y 1000), according to θ = (10, 4, 30, 6, 0.4). Use your data to estimate θ. How well is your algorithm working?
References
E. Roche, EM algorithm and variants: an informal tutorial (2012). arXiv:1105.1476v2 [stat.CO]
M.J. Wainwright, M. Jordan, Graphical Models, Exponential Families, and Variational Inference (Now Publishers, Boston, 2008)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2021 The Author(s)
About this chapter
Cite this chapter
Walrand, J. (2021). Speech Recognition: A. In: Probability in Electrical Engineering and Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-49995-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-49995-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49994-5
Online ISBN: 978-3-030-49995-2
eBook Packages: Computer ScienceComputer Science (R0)