In this chapter, we will consider a state-space model where a single state variable \(x_{k}\) gives rise to binary observations. We will see how the state and parameter estimation equations are derived for this case. However, prior to deriving any of the equations, we will first look at two example scenarios where the need for such a model arises.

We human beings learn. We start learning since the time we were born, and learning continues thereafter as a life-long process. How exactly do we learn? And how do animals learn? These are interesting problems that scientists have investigated for years. One of the problems that arises in learning experiments with animal models is determining when an animal is considered to have learned something. For instance, suppose that a macaque monkey needs to learn how to correctly identify a particular visual target shown on a computer screen. The monkey may receive a reward for every correct answer. Similarly, a rat may have to learn to how to recognize an audio cue to receive a reward in a maze (Fig. 3.1). How could we know that the animal has actually learned? This is an interesting question. We could, for instance, come up with heuristic rules such as stating that the animal has indeed learned when five consecutive correct answers (or some other number) are recorded. But could something more systematic be developed? This problem is what motivated the work in [4]. Here, learning was characterized using a state-space model. Since correct and incorrect are the only possible trial outcomes, the observations are binary-valued. Moreover, rather than just deciding whether the animal has learned or not yet learned, the objective was to estimate a continuous learning state \(x_{k}\) based on the sequence of binary responses \(n_{k}\). When learning has not yet occurred, more incorrect responses occur in the trials and \(x_{k}\) remains low. However, as the animal begins to learn, more correct responses occur and \(x_{k}\) increases. Thus it is possible to see how learning continuously progresses over successive trials.

Fig. 3.1
A schematic of a mouse inside a T-shaped maze. A cheese piece is placed at the top left corner. The correct and incorrect responses are noted. The cognitive learning state is plotted as a wave that gradually rises.

A rat in a T-maze experiment with binary-valued correct/incorrect responses. Binary-valued correct/incorrect responses can be used to estimate the cognitive learning state of a rat based on its responses in successive trials. The model was used in [4] for this purpose where the rat had to learn to recognize which direction to proceed in based on an audio cue

The second example relates to emotions and the nervous system. We primarily sweat to maintain internal body temperature. However, tiny bursts of sweat are also released in response to psychologically arousing stimuli. These variations in sweat secretions cause changes in the conductivity of the skin and can be picked up easily by skin conductance sensors. Since the sweat glands are innervated by nerve fibers belonging to the sympathetic branch of the autonomic nervous system [43], a skin conductance signal becomes a sensitive index of sympathetic arousal [44]. Now a skin conductance signal comprises a slow-varying tonic component on top of which a faster-varying phasic component is superimposed [45, 46]. The phasic component consists of what are known as skin conductance responses (SCRs). These SCRs have characteristic bi-exponential shapes. Each of these SCRs can be thought of as being produced by a single burst of neuroelectric activity to the sweat glands [47]. It is these phasic SCRs that give a skin conductance signal its “spikey” appearance (Fig. 3.2). A deconvolution algorithm can be used to recover the bursts of neural activity underlying a skin conductance signal [47,48,49,50]. Importantly, the occurrence of these neural impulses is related to a person’s arousal level. In particular, the higher the underlying sympathetic arousal, the higher the rate at which neural impulses to the sweat glands (or SCRs) occur [51]. Thus the same state-space model with binary observations based on neural impulses to the sweat glands was used in [26] to estimate sympathetic arousal. By tracking the occurrence of the impulses \(n_{k}\), a person’s arousal state could be estimated over time.

Fig. 3.2
A waveform on skin conductance versus time and neural impulse amplitude. The highest amplitude is at 2 and 6.5 minutes. It is low at 4.1 and 6.1 minutes. The skin conductance from 0 to 12 minutes is noted and it indicates multiple peaks. The highest peak is at 1 minute. The lowest is at 12 minutes.

A deconvolved skin conductance signal. A skin conductance signal comprises both a tonic and phasic component. The neural impulses underlying phasic variations can be extracted via deconvolution. The figure depicts a skin conductance signal (blue) and the sequence of neural impulses that underlie its phasic variations (red). From [32], used under Creative Commons CC-BY license

3.1 Deriving the Predict Equations in the State Estimation Step

Let us now consider the state-space model itself. For simplicity, we will also not use upper case letters for the unknowns although they are indeed random variables. Instead, we will follow the more familiar notation for state-space control systems with lower case letters. Let us begin by assuming that \(x_{k}\) evolves with time following a random walk.

$$\displaystyle \begin{aligned} x_{k} &= x_{k - 1} + \varepsilon_{k}{}, \end{aligned} $$
(3.1)

where the process noise term \(\varepsilon _{k} \sim \mathcal {N}(0, \sigma ^{2}_{\varepsilon })\) is independent of any of the \(x_{k}\) values.

For now, let us not think of (3.1) as being the state equation in a control system. Instead, let us just consider (3.1) purely as a relationship between three random variables. Supposing we only had this equation and had to determine \(x_{k}\), what would be the best guess that we could come up with and how uncertain would we be about it? Our best estimate for \(x_{k}\) would be its mean, and the uncertainty associated with it would be its variance. We will use the basic formulas in (2.1)–(2.6) to determine the mean and variance of \(x_{k}\). We will first derive the mean.

$$\displaystyle \begin{aligned} \mathbb{E}[x_{k}] &= \mathbb{E}[x_{k - 1} + \varepsilon_{k}] \end{aligned} $$
(3.2)
$$\displaystyle \begin{aligned} &= \mathbb{E}[x_{k - 1}] + \mathbb{E}[\varepsilon_{k}] \enspace \text{using (2.1)} \end{aligned} $$
(3.3)
$$\displaystyle \begin{aligned} &= \mathbb{E}[x_{k - 1}] \enspace \text{since }\mathbb{E}[\varepsilon_{k}] = 0 \end{aligned} $$
(3.4)
$$\displaystyle \begin{aligned} \therefore \mathbb{E}[x_{k}] &= x_{k - 1|k - 1}, \end{aligned} $$
(3.5)

where we have used the notation \(x_{k - 1|k - 1}\) to denote the expected value \(\mathbb {E}[x_{k - 1}]\). In a typical state-space control system, \(x_{k - 1|k - 1}\) represents the best estimate for \(x_{k - 1}\) given that we have observed all the sensor measurements up to time index \((k - 1)\). We will also use the notation \(\mathbb {E}[x_{k}] = x_{k|k - 1}\) to denote the mean state estimate at time index k, given that we have only observed the sensor readings until time index \((k - 1)\).

Next we will derive the uncertainty or variance of \(x_{k}\) using the same basic formulas.

$$\displaystyle \begin{aligned} V(x_{k}) &= V(x_{k - 1} + \varepsilon_{k}) \end{aligned} $$
(3.6)
$$\displaystyle \begin{aligned} &= V(x_{k - 1}) + V(\varepsilon_{k}) + 2 Cov(x_{k - 1}, \varepsilon_{k}) \enspace \text{using (2.4)} \end{aligned} $$
(3.7)
$$\displaystyle \begin{aligned} &= V(x_{k - 1}) + V(\varepsilon_{k}) \enspace \text{since }\varepsilon_{k}\text{ is uncorrelated with any of the }x_{k}\text{ terms} \end{aligned} $$
(3.8)
$$\displaystyle \begin{aligned} \therefore V(x_{k}) &= \sigma^{2}_{k - 1|k - 1} + \sigma^{2}_{\varepsilon}, \end{aligned} $$
(3.9)

where we have used the notation \(\sigma ^{2}_{k - 1|k - 1}\) to denote the variance \(V(x_{k - 1})\). Again, in a typical state-space control system, \(\sigma ^{2}_{k - 1|k - 1}\) represents the uncertainty or variance of \(x_{k - 1}\) given that we have observed all the sensor readings up to time index \((k - 1)\). Just like in the case of the mean, we will use the notation \(V(x_{k}) = \sigma ^{2}_{k|k - 1}\) to denote that this is the variance estimate at time index k, given that we have only observed the sensor readings until time index \((k - 1)\). Therefore, our predict equations in the state estimation step are

$$\displaystyle \begin{aligned} x_{k|k - 1} &= x_{k - 1|k - 1} \end{aligned} $$
(3.10)
$$\displaystyle \begin{aligned} \sigma^{2}_{k|k - 1} &= \sigma^{2}_{k - 1|k - 1} + \sigma^{2}_{\varepsilon}. \end{aligned} $$
(3.11)

From our knowledge of Gaussian distributions in (2.23), we also know that \(x_{k}\) is Gaussian distributed since \(x_{k - 1}\) and \(\varepsilon _{k}\) are Gaussian distributed and independent of each other. Since we have just derived the mean and variance of \(x_{k}\), we can state that

$$\displaystyle \begin{aligned} p(x_{k}|n_{1:k - 1}) &= \frac{1}{\sqrt{2 \pi \sigma^{2}_{k|k - 1}}}e^{\frac{-(x_{k} - x_{k|k - 1})^{2}}{2 \sigma^{2}_{k|k - 1}}}{}, \end{aligned} $$
(3.12)

where the conditioning on \(n_{1:k - 1}\) indicates that we have observed the sensor measurements up to time index \((k - 1)\). What happens when we observe measurement \(n_{k}\) at time index k? We will see how our estimates \(x_{k|k - 1}\) and \(\sigma ^{2}_{k|k - 1}\) can be improved/updated once we observe \(n_{k}\) in the next section.

When \(x_{k}\) evolves with time following \(x_{k} = x_{k - 1} + \varepsilon _{k}\), the predict equations in the state estimation step are

$$\displaystyle \begin{aligned} x_{k|k - 1} &= x_{k - 1|k - 1} {} \end{aligned} $$
(3.13)
$$\displaystyle \begin{aligned} \sigma^{2}_{k|k - 1} &= \sigma^{2}_{k - 1|k - 1} + \sigma^{2}_{\varepsilon}. \end{aligned} $$
(3.14)

3.2 Deriving the Update Equations in the State Estimation Step

The binary observations \(n_{k}\) that we consider here could be in the form of correct/incorrect responses in a behavioral experiment, neural impulses in a skin conductance signal, hormone pulses, etc. Let us assume that \(x_{k}\) is related to the probability \(p_{k}\) with which the binary events occur through

$$\displaystyle \begin{aligned} p_{k} &= \frac{1}{1 + e^{-(\beta_{0} + x_{k})}}{}, \end{aligned} $$
(3.15)

where \(\beta _{0}\) is a constant. Here, \(p_{k} = P(n_{k} = 1)\) and \((1 - p_{k}) = P(n_{k} = 0)\). Equation (3.15) depicts what is known as a sigmoid relationship. Accordingly, the higher \(x_{k}\) is, the higher will be \(p_{k}\). In other words, the higher \(x_{k}\) is, the higher the probability of 1’s occurring in the observations.

At this point, we need to note an important result concerning the derivative of the sigmoid function.

$$\displaystyle \begin{aligned} \frac{dp_{k}}{dx_{k}} &= \frac{(-1)}{[1 + e^{-(\beta_{0} + x_{k})}]^{2}} \times e^{-(\beta_{0} + x_{k})} \times (-1) \end{aligned} $$
(3.16)
$$\displaystyle \begin{aligned} &=\frac{1}{1 + e^{-(\beta_{0} + x_{k})}} \times \Bigg[\frac{e^{-(\beta_{0} + x_{k})}}{1 + e^{-(\beta_{0} + x_{k})}}\Bigg] \end{aligned} $$
(3.17)
$$\displaystyle \begin{aligned} &= \frac{1}{1 + e^{-(\beta_{0} + x_{k})}} \times \Bigg[\frac{1 + e^{-(\beta_{0} + x_{k})} - 1}{1 + e^{-(\beta_{0} + x_{k})}}\Bigg] \end{aligned} $$
(3.18)
$$\displaystyle \begin{aligned} &= \frac{1}{1 + e^{-(\beta_{0} + x_{k})}} \times \Bigg[1 - \frac{1}{1 + e^{-(\beta_{0} + x_{k})}}\Bigg] \end{aligned} $$
(3.19)
$$\displaystyle \begin{aligned} &= p_{k}(1 - p_{k}). {} \end{aligned} $$
(3.20)

Now the occurrence of \(n_{k} = 0\) or \(n_{k} = 1\) follows a Bernoulli distribution. Therefore,

$$\displaystyle \begin{aligned} p(n_{k}|x_{k}) &= p_{k}^{n_{k}}(1 - p_{k})^{1 - n_{k}} {} \end{aligned} $$
(3.21)
$$\displaystyle \begin{aligned} &= \Bigg[\frac{1}{1 + e^{-(\beta_{0} + x_{k})}}\Bigg]^{n_{k}}\Bigg[1 - \frac{1}{1 + e^{-(\beta_{0} + x_{k})}}\Bigg]^{1 - n_{k}}. \end{aligned} $$
(3.22)

We will also utilize another useful result here. For a positive number a, \(a = e^{\log (a)}\) (this can be easily verified by taking the log value on both sides). We can use this to express \(p(n_{k}|x_{k})\) as shown below.

$$\displaystyle \begin{aligned} p(n_{k}|x_{k}) &= p_{k}^{n_{k}}(1 - p_{k})^{1 - n_{k}} \end{aligned} $$
(3.23)
$$\displaystyle \begin{aligned} &= e^{\log\big[p_{k}^{n_{k}}(1 - p_{k})^{1 - n_{k}}\big]} \end{aligned} $$
(3.24)
$$\displaystyle \begin{aligned} &= e^{\log\big[p_{k}^{n_{k}}\big] + \log\big[(1 - p_{k})^{1 - n_{k}}\big]} {} \end{aligned} $$
(3.25)
$$\displaystyle \begin{aligned} &= e^{n_{k}\log(p_{k}) + (1 - n_{k})\log(1 - p_{k})} {} \end{aligned} $$
(3.26)
$$\displaystyle \begin{aligned} &= e^{n_{k}\log\Big(\frac{p_{k}}{1 - p_{k}}\Big) + \log(1 - p_{k})}.{} \end{aligned} $$
(3.27)

Now assume that we just observed \(n_{k}\). What would be our best estimate of \(x_{k}\) given that we have observed \(n_{1:k}\)? In other words, what is \(p(x_{k}|n_{1:k})\), and how can we derive its mean and variance? We can use the result in (2.16) to determine what \(p(x_{k}|n_{1:k})\) is.

$$\displaystyle \begin{aligned} p(x_{k}|n_{1:k}) &= p(x_{k}|n_{k}, n_{1:k-1}) = \frac{p(n_{k}|x_{k}, n_{1:k-1})p(x_{k}|n_{1:k-1})}{p(n_{k}|n_{1:k-1})}.{} \end{aligned} $$
(3.28)

Let us consider the terms in the numerator. Now \(p(n_{k}|x_{k}, n_{1:k - 1}) = p(n_{k}|x_{k})\) since we have an explicit relationship between \(n_{k}\) and \(x_{k}\) as shown in (3.15), which makes the additional conditioning on the history \(n_{1:k - 1}\) irrelevant. We know what \(p(n_{k}|x_{k})\) is based on (3.26). We also know what \(p(x_{k}|n_{1:k - 1})\) is based on (3.12).

We now need to determine the mean and variance of \(p(x_{k}|n_{1:k})\). To do so, we will assume that it is approximately Gaussian distributed. Recall from the earlier results in (2.19) and (2.21) that the mean and variance of a Gaussian distribution can be derived from its exponent term alone. Therefore, we only need to consider the exponent of \(p(x_{k}|n_{1:k})\) and can ignore the other terms. We will therefore substitute the terms for \(p(n_{k}|x_{k})\) and \(p(x_{k}|n_{1:k - 1})\) in (3.26) and (3.12), respectively, into (3.28).

$$\displaystyle \begin{aligned} p(x_{k}|n_{1:k}) \propto p(n_{k}|x_{k})p(x_{k}|n_{1:k - 1}) \propto e^{n_{k}\log(p_{k}) + (1 - n_{k})\log(1 - p_{k})} \times e^{\frac{-(x_{k} - x_{k|k - 1})^{2}}{2 \sigma^{2}_{k|k - 1}}}. \end{aligned} $$
(3.29)

Taking the log on both sides and labeling it as q, we have

$$\displaystyle \begin{aligned} q &= \log[p(x_{k}|n_{1:k})] = n_{k}\log(p_{k}) + (1 - n_{k})\log(1 - p_{k}) \\&\quad - \frac{(x_{k} - x_{k|k - 1})^{2}}{2 \sigma^{2}_{k|k - 1}} + \enspace \text{constant}. \end{aligned} $$
(3.30)

This equation provides us the exponent of \(p(x_{k}|n_{1:k})\), which we will use to derive the mean and variance. We can obtain the mean by taking the first derivative of the exponent and then solving for the location where it is 0. Likewise the variance is given by the negative inverse of the second derivative.

Let us first proceed with calculating the mean. We will make use of the formula for the derivative of \(p_{k}\) in (3.20).

$$\displaystyle \begin{aligned} \frac{dq}{dx_{k}} = n_{k}\frac{1}{p_{k}}\frac{dp_{k}}{dx_{k}} + (1 - n_{k})\frac{1}{(1 - p_{k})}\frac{d}{dx_{}}(1 - p_{k}) - \frac{2(x_{k} - x_{k|k - 1})}{2 \sigma^{2}_{k|k - 1}} &= 0 \end{aligned} $$
(3.31)
$$\displaystyle \begin{aligned} n_{k}\frac{1}{p_{k}}p_{k}(1 - p_{k}) - (1 - n_{k})\frac{1}{(1 - p_{k})}p_{k}(1 - p_{k}) - \frac{(x_{k} - x_{k|k - 1})}{\sigma^{2}_{k|k - 1}} &= 0 \end{aligned} $$
(3.32)
$$\displaystyle \begin{aligned} n_{k}(1 - p_{k}) - (1 - n_{k})p_{k} - \frac{(x_{k} - x_{k|k - 1})}{\sigma^{2}_{k|k - 1}} &= 0 \end{aligned} $$
(3.33)
$$\displaystyle \begin{aligned} n_{k} - p_{k} - \frac{(x_{k} - x_{k|k - 1})}{\sigma^{2}_{k|k - 1}} = 0& \end{aligned} $$
(3.34)
$$\displaystyle \begin{aligned} n_{k} - p_{k} = \frac{(x_{k} - x_{k|k - 1})}{\sigma^{2}_{k|k - 1}}& \end{aligned} $$
(3.35)
$$\displaystyle \begin{aligned} x_{k} = x_{k|k - 1} + \sigma^{2}_{k|k - 1}(n_{k} - p_{k})&. {} \end{aligned} $$
(3.36)

This equation gives us the mean of \(x_{k}\), which is now our new best estimate given that we have observed all the data up to time index k. We will call this new mean \(x_{k|k}\). It is an improvement over \(x_{k|k - 1}\), which did not include information from the latest observation. Since

$$\displaystyle \begin{aligned} p_{k} &= \frac{1}{1 + e^{-(\beta_{0} + x_{k})}}, \end{aligned} $$
(3.37)

the \(x_{k}\) term appears on both sides of (3.36). Therefore, the equation has to be solved numerically (e.g., using Newton’s method). To make this dependency explicit, we will use the notation \(p_{k|k}\) and express the mean as

$$\displaystyle \begin{aligned} x_{k|k} &= x_{k|k - 1} + \sigma^{2}_{k|k - 1}(n_{k} - p_{k|k}){}. \end{aligned} $$
(3.38)

We will next derive the variance. Now the first derivative of the exponent simplified to

$$\displaystyle \begin{aligned} \frac{dq}{dx_{k}} &= n_{k} - p_{k} - \frac{(x_{k} - x_{k|k - 1})}{\sigma^{2}_{k|k - 1}}{}. \end{aligned} $$
(3.39)

The second derivative yields

$$\displaystyle \begin{aligned} \frac{d^{2}q}{dx_{k}^{2}} &= -p_{k}(1 - p_{k}) - \frac{1}{\sigma^{2}_{k|k -1}}{}. \end{aligned} $$
(3.40)

Based on our knowledge of how variance can be derived from the exponent term in a Gaussian distribution, the uncertainty or variance associated with our new state estimate is

$$\displaystyle \begin{aligned} \sigma^{2}_{k|k} &= -\Bigg(\frac{d^{2}q}{dx_{k}^{2}}\Bigg)^{-1} = \Bigg[\frac{1}{\sigma^{2}_{k|k - 1}} + p_{k}(1 - p_{k})\Bigg]^{-1}. \end{aligned} $$
(3.41)

Again, we will make the dependence of \(p_{k}\) on \(x_{k|k}\) explicit and state

$$\displaystyle \begin{aligned} \sigma^{2}_{k|k} &= \Bigg[\frac{1}{\sigma^{2}_{k|k - 1}} + p_{k|k}(1 - p_{k|k})\Bigg]^{-1}. \end{aligned} $$
(3.42)

When \(x_{k}\) gives rise to a single binary observation \(n_{k}\), the update equations in the state estimation step are

$$\displaystyle \begin{aligned} x_{k|k} &= x_{k|k - 1} + \sigma^{2}_{k|k - 1}(n_{k} - p_{k|k}) {} \end{aligned} $$
(3.43)
$$\displaystyle \begin{aligned} \sigma^{2}_{k|k} &= \Bigg[\frac{1}{\sigma^{2}_{k|k - 1}} + p_{k|k}(1 - p_{k|k})\Bigg]^{-1}{}. \end{aligned} $$
(3.44)

3.3 Smoothing in the State Estimation Step

Although we previously stated that the state estimation step primarily consisted of the predict and update steps, in reality, there is a third step that we follow. The equations for this third step, however, do not vary much depending on the state-space model and consequently do not require re-derivations every time we have a new model. In fact, as we shall see, there is only one case where we need to make changes to this third step. Now we first perform the predict, update, predict, update, \(\ldots \) steps in turn for \(k = 1, 2, \ldots , K\) to determine \(x_{k}\) at each point in time. After coming to the end, we reverse direction and obtain a set of smoothened mean and variance estimates. The equations for this backward smoother are

$$\displaystyle \begin{aligned} A_{k} &\triangleq \frac{\sigma^{2}_{k|k}}{\sigma^{2}_{k + 1|k}} \end{aligned} $$
(3.45)
$$\displaystyle \begin{aligned} x_{k|K} &= x_{k|k} + A_{k}(x_{k + 1|K} - x_{k + 1|k}) \end{aligned} $$
(3.46)
$$\displaystyle \begin{aligned} \sigma^{2}_{k|K} &= \sigma^{2}_{k|k} + A^{2}_{k}(\sigma^{2}_{k + 1|K} - \sigma^{2}_{k + 1|k}). \end{aligned} $$
(3.47)

The only change that occurs in these equations is if there is a forgetting factor \(\rho \) in the state equation (e.g., \(x_{k} = \rho x_{k - 1} + \varepsilon _{k}\)). In this case, we would have

$$\displaystyle \begin{aligned} A_{k} &\triangleq \rho\frac{\sigma^{2}_{k|k}}{\sigma^{2}_{k + 1|k}}{}. \end{aligned} $$
(3.48)

Since we reverse direction making use of all the data through \(k = 1, 2, \ldots , K\) to obtain the smoothened mean and variance estimates, we use the notation \(x_{k|K}\) and \(\sigma ^{2}_{k|K}\) to denote their values. These new estimates turn out to be smoother since we now determine \(x_{k}\) not just based on \(k = 1, 2, \ldots , k\) (what we have observed up to that point), but rather on \(k = 1, 2, \ldots , K\) (all what we have observed).

We will also make a further observation. We need to note that \(x_{k|K}\) and \(\sigma ^{2}_{k|K}\) can be formally expressed as

$$\displaystyle \begin{aligned} x_{k|K} &= \mathbb{E}[x_{k}|n_{1:K}, \Theta] {} \end{aligned} $$
(3.49)
$$\displaystyle \begin{aligned} \sigma^{2}_{k|K} &= V(x_{k}|n_{1:K}, \Theta){}, \end{aligned} $$
(3.50)

where \(\Theta \) represents all the model parameters. In the case of our current state-space model, the only unknown model parameter is \(\sigma ^{2}_{\varepsilon }\) (and \(\beta _{0}\), but we will assume that this is calculated differently). Why is the expected value conditioned on \(\Theta \)? Recall that the EM algorithm consists of the state and parameter estimation steps. At the state estimation step, we assume that we know all the model parameters and proceed with calculating \(x_{k}\). Mathematically, we could express this knowledge of the model parameters in terms of conditioning on \(\Theta \). In reality, we could also have expressed \(x_{k|k - 1}\) and \(x_{k|k}\) (and the variances) in a similar manner, i.e.,

$$\displaystyle \begin{aligned} x_{k|k - 1} = \mathbb{E}[x_{k}|n_{1:k - 1}, \Theta] \end{aligned} $$
(3.51)
$$\displaystyle \begin{aligned} x_{k|k} = \mathbb{E}[x_{k}|n_{1:k}, \Theta]. \end{aligned} $$
(3.52)

Finally, we also need to note that we often require not only \(\mathbb {E}[x_{k}|n_{1:K}, \Theta ]\), but also \(\mathbb {E}[x^{2}_{k}|n_{1:K}, \Theta ]\) and \(\mathbb {E}[x_{k}x_{k + 1}|n_{1:K}, \Theta ]\) when we move on to the parameter estimation step. Making use of the state-space covariance algorithm [52], these values turn out to be

$$\displaystyle \begin{aligned} \mathbb{E}[x^{2}_{k}|n_{1:K}, \Theta] &= U_{k} = x^{2}_{k|K} + \sigma^{2}_{k|K} {} \end{aligned} $$
(3.53)
$$\displaystyle \begin{aligned} \mathbb{E}[x_{k}x_{k + 1}|n_{1:K}, \Theta] &= U_{k, k + 1} = x_{k|K}x_{k + 1|K} + A_{k}\sigma^{2}_{k + 1|K}{}, \end{aligned} $$
(3.54)

where we have defined the two new terms \(U_{k}\) and \(U_{k, k + 1}\).

3.4 Deriving the Parameter Estimation Step Equations

Recall our earliest discussion of the EM algorithm. To describe how it functioned, we assumed the simple state-space model

$$\displaystyle \begin{aligned} {\mathbf{x}}_{k + 1} &= A {\mathbf{x}}_{k} + B {\mathbf{u}}_{k} \end{aligned} $$
(3.55)
$$\displaystyle \begin{aligned} {\mathbf{y}}_{k} &= C {\mathbf{x}}_{k}. \end{aligned} $$
(3.56)

We stated that, at our state estimation step, we would assume that we knew A, B, and C and then determine the best estimates for \({\mathbf {x}}_{k}\). The state estimation step consists of the predict step, the update step, and the smoothing step that we perform at then end. At the predict step, we make a prediction for \({\mathbf {x}}_{k}\) using the state equation based on the past history of values. At the update step, we improve this prediction by making use of the sensor reading \({\mathbf {y}}_{k}\) that we just observed. After proceeding through the predict, update, predict, update\(\ldots \) steps, we finally reverse direction and perform smoothing. We primarily make use of the ideas of mean and variance at the state estimation step. It is after performing the state estimation step that we proceed to the parameter estimation step where we make use of the \({\mathbf {x}}_{k}\) estimates and determine A, B, and C. We select A, B, and C to maximize a particular probability. This probability is the joint density of all our \({\mathbf {x}}_{k}\) and \({\mathbf {y}}_{k}\) values. We also stated that, in reality, it was not strictly the probability that we maximize, but rather the expected value or mean of its log. Do you now see why the state estimation step involved calculating the expected values of \(x_{k}\)?

Let us now consider the joint probability term whose expected value of the log we need to maximize. It is

$$\displaystyle \begin{aligned} p(x_{1:K} \cap y_{1:K}|\Theta) &= p(y_{1:K}|x_{1:K}, \Theta)p(x_{1:K}|\Theta){}. \end{aligned} $$
(3.57)

Since we only observe a single binary variable, we have \(y_{k} = n_{k}\). Therefore,

$$\displaystyle \begin{aligned} p(x_{1:K} \cap n_{1:K}|\Theta) &= p(n_{1:K}|x_{1:K}, \Theta)p(x_{1:K}|\Theta). \end{aligned} $$
(3.58)

We will first consider \(p(x_{1:K}|\Theta )\). What would be the total probability of all the \(x_{k}\) values if we only knew the model parameters \(\Theta \)? In other words, if we had no sensor readings \(n_{k}\), what would be the probability of our \(x_{k}\) values? To calculate this, we would only be able to make use of the state equation, but not the output equation. This probability is

$$\displaystyle \begin{aligned} p(x_{1:K}|\Theta) &= p(x_{1}|\Theta) \times p(x_{2}|x_{1}, \Theta)\\ &\quad \times p(x_{3}|x_{1}, x_{2}, \Theta) \times \ldots \times p(x_{K}|x_{1}, x_{2}, \ldots, x_{K - 1}, \Theta) \end{aligned} $$
(3.59)
$$\displaystyle \begin{aligned} &= \prod_{k = 1}^{K}\frac{1}{\sqrt{2\pi \sigma^{2}_{\varepsilon}}} e^{\frac{-(x_{k} - x_{k - 1})^{2}}{2\sigma^{2}_{\varepsilon}}}. \end{aligned} $$
(3.60)

Note that in the case of each term \(x_{k}\), \(x_{k - 1}\) contains within it the history needed to get to it. Let us take the log of this value and label it \(\tilde {Q}\).

$$\displaystyle \begin{aligned} \tilde{Q} &= \frac{-K}{2}\log\big(2\pi \sigma^{2}_{\varepsilon}\big) - \sum_{k = 1}^{K}\frac{(x_{k} - x_{k - 1})^{2}}{2\sigma^{2}_{\varepsilon}}. \end{aligned} $$
(3.61)

Now the only model parameter we need to determine is \(\sigma ^{2}_{\varepsilon }\) (ignoring \(\beta _{0}\)). It turns out that \(\sigma ^{2}_{\varepsilon }\) only shows up in this term involving \(p(x_{1:K}|\Theta )\) and not in the term involving \(p(n_{1:K}|x_{1:K}, \Theta )\). Let us now take the expected value of \(\tilde {Q}\) and label it Q.

$$\displaystyle \begin{aligned} Q &= \frac{-K}{2}\log\big(2\pi \sigma^{2}_{\varepsilon}\big) - \sum_{k = 1}^{K}\frac{\mathbb{E}\Big[(x_{k} - x_{k - 1} )^{2}\Big]}{2\sigma^{2}_{\varepsilon}}.{} \end{aligned} $$
(3.62)

What do we need to do at the parameter estimation step to determine \(\sigma ^{2}_{\varepsilon }\)? We simply need to take the derivative of Q with respect to \(\sigma ^{2}_{\varepsilon }\), set it to 0, and solve. But the expected value we need should be calculated conditioned on knowing \(\Theta \) and having observed \(n_{1:K}\) (i.e., we need \(\mathbb {E}[x_{k}|n_{1:K}, \Theta ]\)). Do you now see why we expressed \(x_{k|K}\) and \(\sigma ^{2}_{k|K}\) in the way that we did in (3.49) and (3.50)?

3.4.1 Deriving the Process Noise Variance

While it is possible to determine the starting state \(x_{0}\) as a separate parameter, we follow one of the options in [4, 5] and set \(x_{0} = x_{1}\). This permits some bias at the beginning. Therefore,

$$\displaystyle \begin{aligned} Q &= \frac{-K}{2}\log\big(2\pi \sigma^{2}_{\varepsilon}\big) - \sum_{k = 2}^{K}\frac{\mathbb{E}\Big[(x_{k} - x_{k - 1})^{2}\Big]}{2\sigma^{2}_{\varepsilon}}. \end{aligned} $$
(3.63)

We will follow this method of setting \(x_{0} = x_{1}\) in all of our parameter estimation step derivations. We take the partial derivative of Q with respect to \(\sigma ^{2}_{\varepsilon }\) and set it to 0 to solve for the parameter estimation step update.

$$\displaystyle \begin{aligned} \frac{\partial Q}{\partial \sigma^{2}_{\varepsilon}} = &\frac{-K}{2\sigma^{2}_{\varepsilon}} + \frac{1}{2\sigma^{4}_{\varepsilon}}\sum_{k = 2}^{K}\mathbb{E}\Big[(x_{k} - x_{k - 1})^{2}\Big] = 0 \end{aligned} $$
(3.64)
$$\displaystyle \begin{aligned} \implies \sigma^{2}_{\varepsilon} = &\frac{1}{K}\sum_{k = 2}^{K}\Big\{\mathbb{E}\big[x_{k}^{2}\big] - 2\mathbb{E}\big[x_{k}x_{k - 1}\big] + \mathbb{E}\big[x_{k - 1}^{2}\big]\Big\} \end{aligned} $$
(3.65)
$$\displaystyle \begin{aligned} = &\frac{1}{K}\Bigg\{\sum_{k = 2}^{K}U_{k} - 2\sum_{k = 1}^{K - 1}U_{k, k + 1} + \sum_{k = 1}^{K - 1}U_{k}\Bigg\}. \end{aligned} $$
(3.66)

The parameter estimation step update for \(\sigma ^{2}_{\varepsilon }\) when \(x_{k}\) evolves with time following \(x_{k} = x_{k - 1} + \varepsilon _{k}\) is

$$\displaystyle \begin{aligned} \sigma^{2}_{\varepsilon} &= \frac{1}{K}\Bigg\{\sum_{k = 2}^{K}U_{k} - 2 \sum_{k = 1}^{K - 1}U_{k, k + 1} + \sum_{k = 1}^{K - 1}U_{k}\Bigg\}. \end{aligned} $$
(3.67)

3.5 MATLAB Examples

In this book, we also provide a set of MATLAB code examples that implement the EM algorithms described in each chapter. The code examples are organized into the folder structure shown below:

  • one_bin∖

    • sim∖

      • data_one_bin.mat

      • filter_one_bin.m

    • expm∖

      • expm_data_one_bin.mat

      • expm_filter_one_bin.m

  • one_mpp∖

    • sim∖

      • data_one_mpp.mat

      • filter_one_mpp.m

    • expm∖

      • expm_data_one_mpp.mat

      • expm_filter_one_mpp.m

  • one_bin_two_cont∖

    • \(\ldots \)

  • one_mpp_one_cont∖

    • \(\ldots \)

  • \(\ldots \)

In the case of each state-space model, the corresponding “.m” file with the code is self-contained and no additional path variables have to be set up in MATLAB. The code is written in such a manner that the “.m” file can be run directly (it loads the necessary data from the corresponding “.mat” file). The code in the “sim∖” and “expm∖” folders correspond to examples running on simulated and experimental data, respectively.

Estimating an unobserved state \(x_{k}\) from a single binary observation \(n_{k}\) gives rise to the simplest state-space model and EM algorithm equations. The state-space model with only \(n_{k}\) was originally developed in [4]. The code for running the examples for this model are in the “one_bin∖sim” and “one_bin∖expm” folders. The “one_bin∖sim” folder contains the “filter_one_bin.m” and the “data_one_bin.mat” files. The “.m” file contains the code and the “.mat” file contains the data. We will use a similar naming style for all the code examples accompanying this book.

The state-space model we considered in this chapter contained the term \(\beta _{0}\) in \(p_{k}\). However, we did not yet explain how it was calculated. In several studies involving behavioral learning experiments (e.g., [4]), \(\beta _{0}\) was determined empirically instead of being estimated as a separate term at the parameter estimation step. Now

$$\displaystyle \begin{aligned} p_{k} &= \frac{1}{1 + e^{-(\beta_{0} + x_{k})}} \implies \log\Bigg(\frac{p_{k}}{1 - p_{k}}\Bigg) = \beta_{0} + x_{k}, \end{aligned} $$
(3.68)

and if we assume that \(x_{k} \approx 0\) at the very beginning, we have

$$\displaystyle \begin{aligned} \beta_{0} &\approx \log\Bigg(\frac{p_{0}}{1 - p_{0}}\Bigg). \end{aligned} $$
(3.69)

We can use this to calculate \(\beta _{0}\) [4]. But what is \(p_{0}\)? In a typical learning experiment involving correct/incorrect responses, \(p_{0}\) can be taken to be the probability of getting an answer correct prior to any learning taking place. For instance, if there are only two possible answers in each trial, then \(p_{0} = 0.5\). If there are four possible answers from which to choose, \(p_{0} = 0. 25\). Similarly, in experiments involving the estimation of sympathetic arousal from skin conductance, \(p_{0}\) can be taken to be the person’s baseline probability of neural impulse occurrence. If the experiment involves both relaxation and stress periods, this baseline can be approximated by the average probability of an impulse occurring in the whole data.

Let us first consider a basic outline of the code itself. The code takes the binary inputs \(n_{k}\) for which we use the variable n. Only a few parameters need to be set in this particular code. One of the parameters is the baseline probability \(p_{0}\) for which we use the variable base_prob. In general, we will set base_prob to the average probability of \(n_{k} = 1\) occurring in the data. Recall that in the EM algorithm, we repeat the state estimation step and the parameter estimation step until the model parameters converge. In this code example, we use the variable tol to determine the tolerance level. Here we have set it to \(10^{-6}\) (i.e., the EM algorithm continues to execute until there is no change in the model parameters to a precision level in the order of \(10^{-6}\)). The variable ve denotes the process noise variance. We also use x_pred, x_updt, and x_smth to denote \(x_{k|k - 1}\), \(x_{k|k}\), and \(x_{k|K}\), respectively. We also use v_pred, v_updt, and v_smth to denote the corresponding variances \(\sigma ^{2}_{k|k - 1}\), \(\sigma ^{2}_{k|k}\), and \(\sigma ^{2}_{k|K}\). Prior to performing all the computations, the model parameters need to be initialized at some values. Here we have initialized the process noise variance to \(0.005\) and set the initial value of the \(x_{k}\) to 0.

At a given iteration of the EM algorithm, the code first proceeds in the forward direction from \(k = 1, 2, \ldots , K\) calculating both \(x_{k|k - 1}\) and \(x_{k|k}\).

Here the mean state update \(x_{k|k}\) is calculated using the function shown below (Newton–Raphson method).

After proceeding in the forward direction, we reverse direction and proceed through \(k = K, (K - 1), \ldots , 1\) to obtain the smoothened \(x_{k|K}\) and \(\sigma ^{2}_{k|K}\) values. In the code shown below, the variables W and CW denote \(U_{k}\) and \(U_{k, k + 1}\) in (3.53) and (3.54), respectively.

After performing state estimation at a particular iteration, we then perform parameter estimation. The state estimation and the parameter estimation steps continue to be executed in turn until convergence.

3.5.1 Application to Skin Conductance and Sympathetic Arousal

Running both the simulated and experimental data examples produces the results shown in Fig. 3.3. The code running on simulated data implements the EM algorithm described in this chapter. The code running on experimental data, on the other hand, runs a slightly modified version closer to what was implemented in [25, 26] for estimating sympathetic arousal based on skin conductance. This version of the code additionally attempts to estimate the starting state \(x_{0}\) as a separate model parameter.

Fig. 3.3
9 waveforms on state estimation with simulated data and state estimation with experimental data on skin conductance, state, probability, and H A I versus time. There is an overall rise. A time index on input quantities versus standard normal quantities. The time index is uniformly increasing.

State estimation based on observing one binary variable. The left sub-figure depicts estimation on simulated data, and the right sub-figure depicts the estimation of sympathetic arousal from skin conductance data. The sub-panels on the left, respectively, depict: (a) the binary event occurrences \(n_{k}\); (b) the probability of binary event occurrence \(p_{k}\) (blue) and its estimate (red); (c) the state \(x_{k}\) (blue) and its estimate (red); (d) the quantile–quantile (QQ) plot for the residual error of \(x_{k}\). The sub-panels on the right, respectively, depict: (a) the skin conductance signal; (b) the neural impulses; (c) the arousal state \(x_{k}\) and its 95% confidence limits; (d) the probability of impulse occurrence and its 95% confidence limits; (e) the HAI (the regions above 90% and below 10% are shaded in red and green, respectively). The background colors on the right sub-figure correspond to the instruction period, a counting task, a color–word association task, relaxation, and watching a horror movie clip. From [32], used under Creative Commons CC-BY license

If this code is used to estimate sympathetic arousal based on skin conductance, the only input that is required is the sequence of \(n_{k}\) values (denoted by the variable n) that represents the presence or absence of neural impulses responsible for SCRs. Ideally, the sequence of neural impulses must be extracted by deconvolving the skin conductance data using a deconvolution procedure such as described in [47]. If, however, deconvolution of the skin conductance data is not possible, a simpler peak detection mechanism could be also used to provide these locations (peak detection was used in [25] and deconvolution was used in [26] for sympathetic arousal estimation). Also with the experimental data, and in several other examples that follow, we use the term “HAI” to denote “High Arousal Index” since many of our examples involve the estimation of sympathetic arousal from physiological data. The HAI is inspired by the “Ideal Observer Certainty” term in [4] and is an estimate of how much \(p_{k}\) is above a certain baseline. The HAI can also be calculated based on \(x_{k}\) exceeding an equivalent baseline since \(p_{k}\) is related to \(x_{k}\).

The right sub-figure in Fig. 3.3 provides an example of how sympathetic arousal varied for a particular subject engaged in an experiment involving different stressors. The experiment is described in [53]. The first three shaded backgrounds correspond to a period of instructions followed by two cognitive tasks. Arousal remains high during this period. Arousal drops significantly during the relaxation period that follows and briefly increases at the beginning of the emotional stressor (horror movie) after that. Figure 3.4 also provides an additional example of how arousal varied in a driver stress experiment. The data come from the study described in [54]. In the experiment, each subject had to drive a vehicle along a set route comprising of city driving, toll roads, and highways. Figure 3.4 shows how sympathetic arousal varied during the different road conditions and the rest periods that preceded and followed the actual drive.

Fig. 3.4
5 waveforms on sympathetic arousal estimation for driver stress. The waveforms demonstrate multiple peaks from 1000 to 4000 seconds. The peak height gradually increases.

Driver stress estimation. The sub-panels, respectively, depict: (a) the skin conductance signal; (b) the neural impulses; (c) the arousal state \(x_{k}\) and its 95% confidence limits; (d) the probability of impulse occurrence and its 95% confidence limits; (e) the HAI (the regions above 90% and below 10% are shaded in red and green, respectively). The background colors in turn denote rest, city driving, toll road, highway, toll road, city driving, toll road, highway, toll road, city driving, and rest. From [32], used under Creative Commons CC-BY license