The human body is an intricate network of multiple functioning sub-systems. Many unobserved processes quietly keep running within the body even while we remain largely unconscious of them. For decades, scientists have sought to understand how different physiological systems work and how they can be mathematically modeled. Mathematical models of biological systems provide key scientific insights and also help guide the development of technologies for treating disorders when proper functioning no longer occurs. One of the challenges encountered with physiological systems is that, in a number of instances, the quantities we are interested in are difficult to observe directly or remain completely inaccessible. This could be either because they are located deep within the body or simply because they are more abstract (e.g., emotion). Consider the heart, for instance. The left ventricle pumps out blood through the aorta to the rest of the body. Blood pressure inside the aorta (known as central aortic pressure) has been considered a useful predictor of the future risk of developing cardiovascular disease, perhaps even more useful than the conventional blood pressure measurements taken from the upper arm [1]. However, measuring blood pressure inside the aorta is difficult. Consequently, researchers have had to rely on developing mathematical models with which to estimate central aortic pressure using other peripheral measurements (e.g., [2]). The same could be said regarding the recovery of CRH (corticotropin-releasing hormone) secretion timings within the hypothalamus—a largely inaccessible structure deep within the brain—using cortisol measurements in the blood based on mathematical relationships [3]. Emotions could also be placed in this same category. They are difficult to measure because of their inherently abstract nature. Emotions, however, do cause changes in heart rate, sweating, and blood pressure that can be measured and with which someone’s feelings can be estimated. What we have described so far, in a sense, captures the big picture underlying this book. We have physiological quantities that are difficult to observe directly, we have measurements that are easier to acquire, and we have the ability to build mathematical models to estimate those inaccessible quantities.

Let us now consider some examples where the quantities we are interested in are rather abstract. Consider a situation where new employees at an organization are being taught a new task to be performed at a computer. Let us assume that each employee has a cognitive “task learning” state. Suppose also that the training sessions are accompanied by short quizzes at the end of each section. If we were to record how the employees performed (e.g., how many answers they got correct and how much time they took), could we somehow determine this cognitive learning state, and see how it gradually changes over time? The answer indeed is yes, with the help of a mathematical model, we can estimate such a state and track an employee’s progress over time. We will, however, first need to build such a model that relates learning to quiz performance. As you can see, the basic idea of building models that relate difficult-to-access quantities to measurements that we can acquire more easily and then estimate those quantities is a powerful concept. In this book, we will see how state-space models can be used to relate physiological/behavioral variables to experimental measurements.

State-space modeling is a mature field within controls engineering. In this book, we will address a specific subset of state-space models. Namely, we will consider a class of models where all or part of the observations are binary. You may wonder why binary observations are so important? In reality, a number of phenomena within the human body are binary in nature. For instance, the millions of neurons within our bodies function in a binary-like manner. When these neurons receive inputs, they either fire or they do not. The pumping action of the heart can also be seen as a binary mechanism. The heart is either in contraction and pumping out blood or it is not. The secretion of a number of pulsatile hormones can also be viewed in a similar manner. The glands responsible for pulsatile secretion are either secreting the hormone or not. In reality, a number of other binary phenomena exist and are often encountered in biomedical applications. Consequently, physiological state-space models involving binary-valued observations have found extensive applications across a number of fields including behavioral learning [4,5,6,7,8,9], position, and movement decoding based on neural spiking observations [10,11,12,13,14,15,16,17], anesthesia, and comatose state regulation [18,19,20], sleep studies [21], heart rate analysis [22, 23], and cognitive flexibility [9, 24]. In this book, we will see how some of these models can be built and how they can be used to estimate unobserved states of interest.

1.1 Physiology, State-Space Models, and Estimation

As we have just stated, many things happen inside the human body, even while we are largely unaware that they are occurring. Energy continues to be produced through the actions of hormones and biochemicals, changes in emotion occur within the brain, and mental concentration varies throughout the day depending on the task at hand. Despite the fact that they cannot be observed, these internal processes do give rise to changes in different physiological phenomena that can indeed be measured. For instance, while energy production cannot be observed directly, we can indeed measure the hormone concentrations in the blood that affect the production mechanisms. Similarly, we can also measure physiological changes that emotions cause (e.g., changes in heart rate). Concentration or cognitive load also cannot be observed, but we can measure how quickly someone is getting their work done and how accurately they are performing. Let us now consider how these state-space models relate unobserved quantities to observed measurements.

Think of any control system such as a spring–mass–damper system or RLC circuit (Fig. 1.1). Typically, in such a system, we have several internal state variables and some sensor measurements. Not all the states can be observed directly. However, sensor readings can and do provide some information about them. By deriving mathematical relationships between the sensor readings and the internal states, we can develop tools that enable us to estimate the unobserved states over time. For instance, we may not be able to directly measure all the voltages and currents in a circuit, but we can use Kirchoff’s laws to derive relationships between what we cannot observe and what we do measure. Similarly, we may not be able to measure all the positions, velocities, or accelerations within a mechanical system, but we can derive similar relationships using Newton’s laws. Thus, a typical engineering system can be characterized via a state-space formulation as shown below (for the time-being, we will ignore any noise terms and non-linearities).

$$\displaystyle \begin{aligned} {\mathbf{x}}_{k + 1} &= A{\mathbf{x}}_{k} + B{\mathbf{u}}_{k} {} \end{aligned} $$
(1.1)
$$\displaystyle \begin{aligned} {\mathbf{y}}_{k} &= C{\mathbf{x}}_{k}.{} \end{aligned} $$
(1.2)

Here, \({\mathbf {x}}_{k}\) is a vector representing the internal states of the system, \({\mathbf {y}}_{k}\) is a vector representing the sensor measurements, \({\mathbf {u}}_{k}\) is an external input, and A, B, and C are matrices. The state evolves with time following the mathematical relationship in (1.1). While we may be unable to observe \({\mathbf {x}}_{k}\) directly, we do have the sensor readings \({\mathbf {y}}_{k}\) that are related to it. The question is, can we now apply this formulation to the human body? In this case, \({\mathbf {x}}_{k}\) could be any of the unobserved quantities we just mentioned (e.g., energy production, emotion, or concentration) and \({\mathbf {y}}_{k}\) could be any related physiological measurement(s).

Fig. 1.1
2 schematics. A. A spring-mass-damper system illustrates an external input b connected to the mass of the object, with an external force acting on it. B. An R L C circuit consists of 3 components such as a resistor, an inductor, and a capacitor, with input signals connected to form a circuit.

Some examples of engineering systems that can be modeled using state-space representations. The left sub-figure depicts a spring–mass–damper system, and the right sub-figure depicts an RLC circuit. We may not be able to directly observe all the states within each system, but we can build state-space models and use whatever measurements we have to estimate them

In this book, we will make use of an approach known as expectation–maximization (EM) for estimating unobserved quantities using state-space models. In a very simple way, here is what the EM algorithm does when applied to state estimation. Look back at (1.1) and (1.2). Now assume that this formulation governs how emotional states (\({\mathbf {x}}_{k}\)) vary within the brain and how they give rise to changes in heart rate and sweat secretions (\({\mathbf {y}}_{k}\)) that can be measured. We do not know \({\mathbf {x}}_{k}\) for \(k = 1, 2, \ldots , K\), and neither do we know A, B, or C. We only have the recorded sensor measurements (features) \({\mathbf {y}}_{k}\). First, we will assume some values for A, B, and C, i.e., we will begin by assuming that we know them. We will use this knowledge of A, B, and C to estimate \({\mathbf {x}}_{k}\) for \(k = 1, 2, \ldots , K\). We now know \({\mathbf {x}}_{k}\) at every point in time. We will then use these \({\mathbf {x}}_{k}\)’s to come up with an estimate for A, B, and C. We will then use those new values of A, B, and C to calculate an even better estimate for \({\mathbf {x}}_{k}\). The newest \({\mathbf {x}}_{k}\) will again be used to determine an even better A, B, and C. We will repeat these steps in turn until there is hardly any change in \({\mathbf {x}}_{k}\), A, B, or C. Our EM algorithm is said to have converged at this point. The step where \({\mathbf {x}}_{k}\) is estimated is known as the expectation-step or E-step and the step where A, B, and C are calculated is known as the maximization-step or M-step. For the purpose of this book, we will label the E-step as the state estimation step and the M-step as the parameter estimation step. What follows next is a basic description of what we do at these steps in slightly more detail.

1.1.1 State Estimation Step

As we have just stated, our EM algorithm consists of two steps: the state estimation step and the parameter estimation step. At the state estimation step we assume to know A, B, and C and try to estimate \({\mathbf {x}}_{k}\) for \(k = 1, 2, \ldots , K\). We do this sequentially. Again, look back at (1.1) and (1.2). Suppose you are at time index k and you know what A, B, C, and \({\mathbf {x}}_{k - 1}\) are, could you come up with a guess for \({\mathbf {x}}_{k}\)? You can also assume that you know what the external input \({\mathbf {u}}_{k}\) is for \(k = 1, 2, \ldots , K\). How would you do determine \({\mathbf {x}}_{k}\)? First, note that we can re-write the equations as

$$\displaystyle \begin{aligned} {\mathbf{x}}_{k} &= A{\mathbf{x}}_{k - 1} + B{\mathbf{u}}_{k - 1} {} \end{aligned} $$
(1.3)
$$\displaystyle \begin{aligned} {\mathbf{y}}_{k} &= C{\mathbf{x}}_{k}{}. \end{aligned} $$
(1.4)

If you knew A, B, C, \({\mathbf {x}}_{k - 1}\), and \({\mathbf {u}}_{k - 1}\), and had to determine \({\mathbf {x}}_{k}\) just at time index k, you would encounter a small problem here. Do you see that \({\mathbf {x}}_{k}\) appears in both equations? You could simply plug-in the values of \({\mathbf {x}}_{k - 1}\) and \({\mathbf {u}}_{k - 1}\) into (1.3) and get a value for \({\mathbf {x}}_{k}\). Since you are using the past values up to time index \((k - 1)\) to determine \({\mathbf {x}}_{k}\), this could be called the predict step. You are done, right? Not quite. If you determine \({\mathbf {x}}_{k}\) solely based on (1.3), you would always be discounting the sensor measurement \({\mathbf {y}}_{k}\) in (1.4). This sensor measurement is also an important source of information about \({\mathbf {x}}_{k}\). Therefore, at each time index k, we will first have the predict step where we make use of (1.3) to guess what \({\mathbf {x}}_{k}\) is, and then apply an update step, where we will make use of \({\mathbf {y}}_{k}\) to improve the \({\mathbf {x}}_{k}\) value that we just predicted. The full state estimation step will therefore consist of a series of repeated predict, update, predict, update, \(\ldots \) steps for \(k = 1, 2, \ldots , K\). At the end of the state estimation step, we will have a complete set of values for \({\mathbf {x}}_{k}\).

Dealing with uncertainty is a reality with any engineering system model. These uncertainties arise due to noise in our sensor measurements, models that are unable to fully account for actual physical systems and so on. We need to deal with this notion of uncertainty when designing state estimators. To do so, we will need some basic concepts in probability and statistics. What we have said so far regarding estimating \({\mathbf {x}}_{k}\) can be mathematically formulated in terms of two fundamental ideas in statistics: mean and variance. In reality, (1.3) and (1.4) should be

$$\displaystyle \begin{aligned} {\mathbf{x}}_{k} &= A{\mathbf{x}}_{k - 1} + B{\mathbf{u}}_{k - 1} + {\mathbf{e}}_{k} \end{aligned} $$
(1.5)
$$\displaystyle \begin{aligned} {\mathbf{y}}_{k} &= C{\mathbf{x}}_{k} + {\mathbf{v}}_{k}, \end{aligned} $$
(1.6)

where \({\mathbf {e}}_{k}\) is what we refer to as process noise and \({\mathbf {v}}_{k}\) is sensor noise. Therefore, when we “guess” what \({\mathbf {x}}_{k}\) is at the predict step, what we are really doing is determining the mean value of \({\mathbf {x}}_{k}\) given that we have observed all the data up to time index \((k - 1)\). There will also be a certain amount of uncertainty regarding this prediction for \({\mathbf {x}}_{k}\). We quantify this uncertainty in terms of variance. Thus we need to determine the mean and variance of \({\mathbf {x}}_{k}\) at our predict step. But what happens after we observe \({\mathbf {y}}_{k}\)? Again, the idea is the same. Now that we have two sources of information regarding \({\mathbf {x}}_{k}\) (one based on the prediction from \({\mathbf {x}}_{k - 1}\) and \({\mathbf {u}}_{k - 1}\), and the other based on the sensor reading \({\mathbf {y}}_{k}\)), we will still be determining the mean and variance of \({\mathbf {x}}_{k}\). So we need to calculate one mean and variance of \({\mathbf {x}}_{k}\) at the predict step, and another mean and variance of \({\mathbf {x}}_{k}\) at the update step.

1.1.2 Parameter Estimation Step

Recall that our EM algorithm iterates between the state estimation step and the parameter estimation step until convergence. Assume that we sequentially progressed through repeated predict, update, predict, update, \(\ldots \) steps for \(k = 1, 2, \ldots , K\) and determined a set of mean and variance (uncertainty) values for \({\mathbf {x}}_{k}\). How could we use all of these mean and variance values to determine what A, B, and C are? Here is how we proceed. We first calculate the joint probability for all the \({\mathbf {x}}_{k}\) and \({\mathbf {y}}_{k}\) values. The best estimates for A, B, and C are the values that maximize this probability (or the log of this probability). Therefore, we need to maximize this probability with respect to A, B, and C. One simple way to determine the value at which a function is maximized is to take its derivative and solve for the location where it is 0. This is basically what we do to determine A, B, and C (in reality, we actually maximize the expected value or mean of the joint log probability of all the \({\mathbf {x}}_{k}\) and \({\mathbf {y}}_{k}\) values to determine A, B, and C).

1.1.3 Algorithm Summary

In summary, we have to calculate means and variances at the state estimation step and derivatives at the parameter estimation step. We will show how these equations are derived in a number of examples in the chapters that follow. The EM approach enables us to build powerful state estimators that can determine internal physiological quantities that are only accessible through a set of sensor measurements.

What we have described so far is a very simple introduction to the EM algorithm as applied to state estimation. Moreover, for someone already familiar with state-space models, the predict and update steps we have just described should also sound familiar. These are concepts that are found in Kalman filtering. The derivation of the Kalman filter equations is generally approached from the point of view of solving a set of simultaneous equations when new sensor measurements keep coming in. In this book, we will not approach the design of the filters through traditional recursive least squares minimization approaches involving matrix computations. Instead, we will proceed from a statistical viewpoint building up from the basics of mean and variance. Nevertheless, we will use the terminology of a filter when deriving the state estimation step equations. For reasons that will become clearer as we proceed, we can refer to these state estimators as Bayesian filters.

1.2 Book Outline

State-space models have been very useful in a number of physiological applications. In this book, we consider state-space models that give rise, fully or partially, to binary observations. We will begin our discussion of how to build Bayesian filters for physiological state estimation starting with the simplest cases. We will start by considering a scalar-valued state \(x_{k}\) that follows the simple random walk

$$\displaystyle \begin{aligned} x_{k} &= x_{k - 1} + \varepsilon_{k}, \end{aligned} $$
(1.7)

where \(\varepsilon _{k} \sim \mathcal {N}(0, \sigma ^{2}_{\varepsilon })\) is process noise. We will consider how to derive the state and parameter estimation step equations when \(x_{k}\) gives rise to a single binary observation \(n_{k}\). We will next proceed to more complicated cases. For instance, one of the cases will be where we have a forgetting factor \(\rho \) such that

$$\displaystyle \begin{aligned} x_{k} &= \rho x_{k - 1} + \varepsilon_{k}{}, \end{aligned} $$
(1.8)

and \(x_{k}\) gives rise to both a binary observation \(n_{k}\) and a continuous observation \(r_{k}\). An even more complicated case will involve an external input so that

$$\displaystyle \begin{aligned} x_{k} &= \rho x_{k - 1} + \alpha I_{k} + \varepsilon_{k}{}, \end{aligned} $$
(1.9)

where \(\alpha I_{k}\) is similar to the \(B {\mathbf {u}}_{k}\) in (1.1), and \(x_{k}\) gives rise to a binary observation \(n_{k}\) and two continuous observations \(r_{k}\) and \(s_{k}\). As we shall see, changes in the state equation primarily affect the predict step within the state estimation step. In contrast, changes in the observations mainly affect the update step.

Note that we mentioned the observation of binary and continuous features. When introducing the concept of physiological state estimation for the first time, we used the formulation

$$\displaystyle \begin{aligned} {\mathbf{y}}_{k} &= C{\mathbf{x}}_{k} + {\mathbf{v}}_{k} \end{aligned} $$
(1.10)

for the sensor measurements. In reality, this represents a very simple case, and the equations turn out to be similar to that of a Kalman filter. Sensor measurements in biomedical experiments can take many forms. They can take the form of binary-valued observations, continuous-valued observations, and spiking-type observations, to name a few. For instance, we may need to estimate the learning state of a macaque monkey in a behavioral experiment based on whether the monkey gets the answers correct or incorrect in different trials (a binary observation), how quickly the monkey responds in each trial (a continuous observation), and how electrical activity from a specific neuron varies over the trials (a spiking-type observation). These types of measurements result in filter equations that are more complicated than in the case of a Kalman filter. We will rely heavily on Bayes’ rule to derive the mean and variance of \(x_{k}\) at the update step in each case.

While the state estimation step relies primarily on mean and variance calculations, the parameter estimation step relies mainly on derivatives. At the parameter estimation step, we take the derivatives of the probability terms (or equivalently, of the log-likelihood terms) to determine the model parameters. For instance, if we use the state equation in (1.8), we will need to derive \(\rho \) at the parameter estimation step. Moreover, we also need to determine the model parameters related to our observations. For instance, we may choose to model a continuous observation \(r_{k}\) as

$$\displaystyle \begin{aligned} r_{k} &= \gamma_{0} + \gamma_{1}x_{k} + v_{k}, \end{aligned} $$
(1.11)

where \(\gamma _{0}\) and \(\gamma _{1}\) are constant coefficients and \(v_{k} \sim \mathcal {N}(0, \sigma ^{2}_{v})\) is sensor noise. The three parameters \(\gamma _{0}\), \(\gamma _{1}\), and \(\sigma ^{2}_{v}\) all need to be determined at the parameter estimation step. We could thus divide the parameter estimation step derivations into two parts. First, there will be the derivations for model parameters in the state equation (e.g., \(\rho \), \(\alpha \), and \(\sigma ^{2}_{\varepsilon }\)). And second, there will be the derivations corresponding to each of the observations (features). Choosing to include a continuous-valued observation in a state-space model will necessitate the determination of a certain set of model parameters. Adding a spiking-type observation necessitates a further set of model parameters. We will see examples of these in due course.

Having laid some of the basic groundwork, we will next proceed with our tutorial discussion of how to derive the state and parameter estimation step equations for several different physiological state-space models. Shown below is a list of the state-space models we will look at along with examples of where they have been applied:

  • State-space model with one binary observation:

    • Behavioral learning [4]

    • Sympathetic arousal estimation using skin conductance signals [25, 26]

  • State-space model with one binary and one continuous observation:

    • Behavioral learning [5]

    • Emotional valence estimation using electromyography (EMG) signals [27]

    • Seizure state estimation using scalp electroencephalography (EEG) signals [28]

  • State-space model with one binary and two continuous observations:

    • Sympathetic arousal estimation using skin conductance signals [29]

    • Energy state estimation using blood cortisol concentrations [30]

  • State-space model with one binary, two continuous, and a spiking-type observation:

    • Sympathetic arousal estimation using skin conductance and electrocardiography (EKG) signals [31]

  • State-space model with one marked point process (MPP) observation:

    • Sympathetic arousal estimation using skin conductance signals [32]

  • State-space model with one MPP and one continuous observation:

    • Energy state estimation using blood cortisol concentrations [33]

    • Sympathetic arousal estimation using skin conductance signals [33]

Wearable and smart healthcare technologies are likely to play a key role in the future [34, 35]. A number of the state-space models listed above have applicability to healthcare. For instance, patients suffering from emotional disorders, hormone dysregulation, or epileptic seizures could be fitted with wearable devices that implement some of the state-space models (and corresponding EM-based estimators) listed above for long-term care and monitoring. One of the advantages of the state-space framework is that it readily presents itself to the design of the closed-loop control necessary to correct deviation from healthy functioning. Consequently, state-space controllers can be designed to treat some of these disorders [36, 37]. Looking at the human body and brain from a control-theoretic perspective could also help design bio-inspired controllers that are similar to its already built-in feedback control loops [38, 39]. The applications, however, are not just limited to healthcare monitoring, determining hidden psychological and cognitive states also has applications in fields such as neuromarketing [40], smart homes [41], and smart workplaces [42].

Excursus—A Brief Sketch of How the Kalman Filter Equations Can be Derived

Here we provide a brief sketch of how the Kalman filter equations can be derived. We will utilize an approach known as recursive least squares. The symbols used within this excursus are self-contained and should not be confused with the standard terminology that is used throughout the rest of this book.

Suppose we have a column vector of unknowns \(\mathbf {x}\) and a column vector of measurements \({\mathbf {y}}_{1}\) that are related to each other through

$$\displaystyle \begin{aligned} {\mathbf{y}}_{1} &= A_{1}\mathbf{x} + {\mathbf{e}}_{1}, \end{aligned} $$
(1.12)

where \(A_{1}\) is a matrix and \({\mathbf {e}}_{1} \sim \mathcal {N}(0, \Sigma _{1})\) is noise (\(\Sigma _{1}\) is the noise covariance matrix). In general, we may have more measurements than we have unknowns. Therefore, a solution to this system of equations is given by

$$\displaystyle \begin{aligned} {\mathbf{x}}_{1} &= (A_{1}^{\intercal}\Sigma_{1}^{-1}A_{1})^{-1}A^{\intercal}\Sigma_{1}^{-1}{\mathbf{y}}_{1}, {} \end{aligned} $$
(1.13)

where we have used \({\mathbf {x}}_{1}\) to denote that this solution is only based on the first set of measurements. Now suppose that we have another set of measurements \({\mathbf {y}}_{2}\) such that

$$\displaystyle \begin{aligned} {\mathbf{y}}_{2} &= A_{2}\mathbf{x} + {\mathbf{e}}_{2}, \end{aligned} $$
(1.14)

where \(A_{2}\) is a matrix and \({\mathbf {e}}_{2} \sim \mathcal {N}(0, \Sigma _{2})\). In theory, we could just concatenate all the values to form a single set of equations and solve for \(\mathbf {x}\). However, this would result in a larger matrix inversion each time we get more data. Is there a better way? It turns out that we can use our previous solution \({\mathbf {x}}_{1}\) to obtain a better estimate \({\mathbf {x}}_{2}\) without having to solve everything again. If we assume that \({\mathbf {e}}_{1}\) and \({\mathbf {e}}_{2}\) are uncorrelated with each other, the least squares solution is given by

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= \Bigg[\Bigg(\begin{array}{c} A_{1} \\ A_{2} \end{array}\Bigg)^{\intercal} \Bigg(\begin{array}{cc} \Sigma_{1} & 0 \\ 0 & \Sigma_{2} \end{array}\Bigg)^{-1} \Bigg(\begin{array}{c} A_{1} \\ A_{2} \end{array}\Bigg)\Bigg]^{-1} \Bigg(\begin{array}{c} A_{1} \\ A_{2} \end{array}\Bigg)^{\intercal} \Bigg(\begin{array}{cc} \Sigma_{1} & 0 \\ 0 & \Sigma_{2} \end{array}\Bigg)^{-1} \Bigg(\begin{array}{c} {\mathbf{y}}_{1} \\ {\mathbf{y}}_{2} \end{array}\Bigg)^{\intercal} \end{aligned} $$
(1.15)
$$\displaystyle \begin{aligned} &= \Bigg[\Big(\begin{array}{cc} A_{1}^{\intercal} & A_{2}^{\intercal} \end{array}\Big) \Bigg(\begin{array}{cc} \Sigma_{1}^{-1} & 0 \\ 0 & \Sigma_{2}^{-1} \end{array}\Bigg) \Bigg(\begin{array}{c} A_{1} \\ A_{2} \end{array}\Bigg)\Bigg]^{-1} \Big(\begin{array}{cc} A_{1}^{\intercal} & A_{2}^{\intercal} \end{array}\Big) \Bigg(\begin{array}{cc} \Sigma_{1}^{-1} & 0 \\ 0 & \Sigma_{2}^{-1} \end{array}\Bigg) \Bigg(\begin{array}{c} {\mathbf{y}}_{1} \\ {\mathbf{y}}_{2} \end{array}\Bigg)^{\intercal} \end{aligned} $$
(1.16)
$$\displaystyle \begin{aligned} &= \Big(A_{1}^{\intercal}\Sigma_{1}^{-1}A_{1} + A_{2}^{\intercal}\Sigma_{2}^{-1}A_{2}\Big)^{-1} \Big(A_{1}^{\intercal}\Sigma_{1}^{-1}{\mathbf{y}}_{1} + A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}\Big){}. \end{aligned} $$
(1.17)

Let us see how this simplifies. We will begin by defining the term \(P_{1} = (A_{1}^{\intercal }\Sigma _{1}^{-1}A_{1})^{-1}\). Now,

$$\displaystyle \begin{aligned} {\mathbf{x}}_{1} &= (A_{1}^{\intercal}\Sigma_{1}^{-1}A_{1})^{-1}A^{\intercal}\Sigma_{1}^{-1}{\mathbf{y}}_{1} \end{aligned} $$
(1.18)
$$\displaystyle \begin{aligned} &= P_{1}A^{\intercal}\Sigma_{1}^{-1}{\mathbf{y}}_{1} \end{aligned} $$
(1.19)
$$\displaystyle \begin{aligned} \implies P_{1}^{-1}{\mathbf{x}}_{1} &= A^{\intercal}\Sigma_{1}^{-1}{\mathbf{y}}_{1} \end{aligned} $$
(1.20)

based on (1.13). Substituting \(P_{1}^{-1}\) for \(A_{1}^{\intercal }\Sigma _{1}^{-1}A_{1}\) and \(P_{1}^{-1}{\mathbf {x}}_{1}\) for \(A^{\intercal }\Sigma _{1}^{-1}{\mathbf {y}}_{1}\) in (1.17), we obtain

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} = \Big(P_{1}^{-1} + A_{2}^{\intercal}\Sigma_{2}^{-1}A_{2}\Big)^{-1} \Big(P_{1}^{-1}{\mathbf{x}}_{1} + A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}\Big). \end{aligned} $$
(1.21)

We use the matrix inversion lemma to simplify this to

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= \Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big] \Big(P_{1}^{-1}{\mathbf{x}}_{1} + A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}\Big). \end{aligned} $$
(1.22)

We then perform the multiplication.

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &=\Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big] P_{1}^{-1}{\mathbf{x}}_{1} \\[3pt]&\quad + \Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big]A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}. \end{aligned} $$
(1.23)

For the time-being, we will ignore the terms on the right and make the substitution \(K = P_{1}A_{2}^{\intercal }(\Sigma _{2} + A_{2}P_{1}A_{2}^{\intercal })^{-1}\) for the term on the left. Therefore,

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= \Big(P_{1} - KA_{2}P_{1}\Big) P_{1}^{-1}{\mathbf{x}}_{1}\\ &\quad + \Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big]A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.24)
$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= P_{1}P_{1}^{-1}{\mathbf{x}}_{1} - KA_{2}P_{1}P_{1}^{-1}{\mathbf{x}}_{1} \\&\quad + \Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big] A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.25)
$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + \Big[P_{1} - P_{1}A_{2}^{\intercal}(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal})^{-1}A_{2}P_{1}\Big] A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}. \end{aligned} $$
(1.26)

When multiplying the terms on the right, we will define the term \(Q = (\Sigma _{2} + A_{2}P_{1}A_{2}^{\intercal })^{-1}\). Making this substitution, we obtain

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + \Big(P_{1} - P_{1}A_{2}^{\intercal}QA_{2}P_{1}\Big) A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.27)
$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2} - P_{1}A_{2}^{\intercal}QA_{2}P_{1} A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2}. \end{aligned} $$
(1.28)

Here is where we will use a small trick. We will insert \(QQ^{-1}\) into the third term and then simplify.

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}QQ^{-1}\Sigma_{2}^{-1}{\mathbf{y}}_{2} - P_{1}A_{2}^{\intercal}QA_{2}P_{1} A_{2}^{\intercal}\Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.29)
$$\displaystyle \begin{aligned} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}Q(Q^{-1} - A_{2}P_{1}A_{2}^{\intercal})\Sigma_{2}^{-1}{\mathbf{y}}_{2}. {} \end{aligned} $$
(1.30)

Since \(Q = (\Sigma _{2} + A_{2}P_{1}A_{2}^{\intercal })^{-1}\), \(Q^{-1} = \Sigma _{2} + A_{2}P_{1}A_{2}^{\intercal }\). We will substitute this into (1.30) to obtain

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}Q(\Sigma_{2} + A_{2}P_{1}A_{2}^{\intercal} - A_{2}P_{1}A_{2}^{\intercal})\Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.31)
$$\displaystyle \begin{aligned} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}Q\Sigma_{2} \Sigma_{2}^{-1}{\mathbf{y}}_{2} \end{aligned} $$
(1.32)
$$\displaystyle \begin{aligned} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + P_{1}A_{2}^{\intercal}Q {\mathbf{y}}_{2}. \end{aligned} $$
(1.33)

Note that \(P_{1}A_{2}^{\intercal }Q = P_{1}A_{2}^{\intercal }(\Sigma _{2} + A_{2}P_{1}A_{2}^{\intercal })^{-1} = K\). Therefore,

$$\displaystyle \begin{aligned} {\mathbf{x}}_{2} &= {\mathbf{x}}_{1} - KA_{2}{\mathbf{x}}_{1} + K{\mathbf{y}}_{2} \end{aligned} $$
(1.34)
$$\displaystyle \begin{aligned} &= {\mathbf{x}}_{1} + K({\mathbf{y}}_{2} - A_{2}{\mathbf{x}}_{1}). \end{aligned} $$
(1.35)

What does the final equation mean? We simply take our previous solution \({\mathbf {x}}_{1}\), predict what \({\mathbf {y}}_{2}\) will be by multiplying it with \(A_{2}\), calculate the prediction error \({\mathbf {y}}_{2} - A_{2}{\mathbf {x}}_{1}\), and apply this correction to \({\mathbf {x}}_{1}\) based on the multiplication factor K. These equations, therefore, provide a convenient way to continually update \(\mathbf {x}\) when we keep receiving more and more data.

Excursus—A Brief Sketch of How the EM Algorithm Works

Here we will provide a brief overview of how the EM algorithm works in the kind of state estimation problems that we shall see. Assume that we have a set of sensor measurements \(\mathcal {Y} = \{y_{1}, y_{2}, \ldots , y_{K}\}\) and a set of unobserved states \(\mathcal {X} = \{x_{1}, x_{2}, \ldots , x_{K}\}\) that we need to estimate. We also have the model parameters \(\Theta \) that need to be determined.

Let us begin by asking the question as to how we can determine \(\Theta \). In general, we select \(\Theta \) such that it maximizes the probability \(p(\Theta |\mathcal {Y})\). Assuming that we do not have a particular preference for any of the \(\Theta \) values, we can use Bayes’ rule to instead select the \(\Theta \) that maximizes \(p(\mathcal {Y}|\Theta )\). Now,

$$\displaystyle \begin{aligned} p(\mathcal{Y}|\Theta) &= \int_{\mathcal{X}} p(\mathcal{X} \cap \mathcal{Y}|\Theta) d\mathcal{X}. {} \end{aligned} $$
(1.36)

We do not know what the true \(\Theta \) is, but let us make a guess that it is \(\hat {\Theta }\). Let us now introduce the term \(p(\mathcal {X}|\mathcal {Y} \cap \hat {\Theta })\) into (1.36).

$$\displaystyle \begin{aligned} p(\mathcal{Y}|\Theta) &= \int_{\mathcal{X}} \frac{p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})}{p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})} p(\mathcal{X} \cap \mathcal{Y}|\Theta) d\mathcal{X} \end{aligned} $$
(1.37)
$$\displaystyle \begin{aligned} &= \int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}) \frac{p(\mathcal{X} \cap \mathcal{Y}|\Theta)}{p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})} d\mathcal{X}. \end{aligned} $$
(1.38)

Take a moment to look carefully at what the integral is doing. It is actually calculating the expected value of the fraction term with respect to \(p(\mathcal {X}|\mathcal {Y} \cap \hat {\Theta })\). Taking the log on both sides, we have

$$\displaystyle \begin{aligned} \log\big[p(\mathcal{Y}|\Theta)\big] &= \log\Bigg[\int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}) \frac{p(\mathcal{X} \cap \mathcal{Y}|\Theta)}{p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})} d\mathcal{X}\Bigg]. \end{aligned} $$
(1.39)

Since \(\log (\cdot )\) is a concave function, the following inequality holds true.

$$\displaystyle \begin{aligned} \log\big[p(\mathcal{Y}|\Theta)\big] &\geq \int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}) \log\Bigg[\frac{p(\mathcal{X} \cap \mathcal{Y}|\Theta)}{p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})}\Bigg] d\mathcal{X} \end{aligned} $$
(1.40)
$$\displaystyle \begin{aligned} \log\big[p(\mathcal{Y}|\Theta)\big] &\geq \int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}) \log\big[p(\mathcal{X} \cap \mathcal{Y}|\Theta)\big] d\mathcal{X}\\&\quad - \int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})\log\big[p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta})\big] d\mathcal{X}. {} \end{aligned} $$
(1.41)

Recall that we set out to choose the \(\Theta \) that maximized \(p(\mathcal {Y}|\Theta )\), or that equivalently maximized \(\log \big [p(\mathcal {Y}|\Theta )\big ]\). Typically, we would approach this maximization by calculating the derivative of the probability term with respective to \(\Theta \), set it to \(\mathbf {0}\), and then solve. For instance, if we had a continuous-valued observation \(r_{k}\) in our state-space model, we would have to take the derivatives with respect to \(\gamma _{0}\), \(\gamma _{1}\), and \(\sigma ^{2}_{v}\), set them each to 0, and solve. Look back at (1.41). Assume we were to calculate the derivative of the term on the right-hand side of the inequality with respective to \(\Theta \). Do you see that the second term does not contain \(\Theta \)? In other words, the derivative would just treat the second term as a constant. If we had to determine \(\gamma _{0}\), \(\gamma _{1}\), and \(\sigma ^{2}_{v}\), for instance, they would only be present in the first term when taking derivatives. We can, therefore, safely ignore the second term. This leads to an important conclusion. If we need to determine the model parameters \(\Theta \) by maximizing \(\log \big [p(\mathcal {Y}|\Theta )\big ]\), we only need to concentrate on maximizing

$$\displaystyle \begin{aligned} \int_{\mathcal{X}} p(\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}) \log\big[p(\mathcal{X} \cap \mathcal{Y}|\Theta)\big] d\mathcal{X}. {} \end{aligned} $$
(1.42)

We could equivalently write (1.42) as

$$\displaystyle \begin{aligned} \mathbb{E}_{\mathcal{X}|\mathcal{Y} \cap \hat{\Theta}} \Big[\log\big[p(\mathcal{X} \cap \mathcal{Y}|\Theta)\big]\Big] \end{aligned} $$
(1.43)

since this is indeed an expected value. Do you now see the connection between what we have been discussing so far and the EM algorithm? In reality, what we are doing at the state estimation step is calculating \(\mathbb {E}[\mathcal {X}|\mathcal {Y} \cap \hat {\Theta }]\). At the parameter estimation step, we calculate the partial derivatives of the expected value of \(\log \big [p(\mathcal {X} \cap \mathcal {Y}|\Theta )\big ]\) with respect to all of the model parameters. During the actual implementation of the EM algorithm, we keep alternating between the two steps until the model parameters converge. At this point, we have reached one of the localized maximum values of \(\mathbb {E}_{\mathcal {X}|\mathcal {Y} \cap \hat {\Theta }} \Big [\log \big [p(\mathcal {X} \cap \mathcal {Y}|\Theta )\big ]\Big ]\).