1 Introduction

Following the identification of Dinophysis fortii as the causative agent of shellfish poisoning outbreaks in 1976 and 1977 in northeastern Japan, there has been interest in understanding the dinoflagellates of the genus Dinophysis (Yasumoto et al. 1978, 1980; FAO and WHO 2016). Dinophysis has been shown to be toxigenic in a variety of environments (Fux et al. 2011; Mafra et al. 2013; Gao et al. 2017), and a global distribution system for these toxins is generally recognized (FAO and WHO 2016) as they occur in a multitude of different habitats. To date, 10 species of Dinophysis have been shown to produce one or two major types of lipophilic toxins, okadaic acid (OA) and its dinophysistoxins derivatives, and pectenotoxins (Reguera et al. 2012, 2014). We collectively refer to the set OA, its derivatives, and pectenotoxins as diarrhetic shellfish toxins (DST). While all DST are known to be harmful, OA is particularly important to study due to the knowledge gap about its potential for understudied chronic disease associations (IOC et al. 2005) as well as its likely tumor promotion properties (Suganuma et al. 1988; Fujiki et al. 2018; Valdiglesias et al. 2013). The potential for low-dose chronic exposures of human populations to OA (along with pectenotoxins) is environmentally plausible because the toxins may persist in the water for extended periods of time (Blanco et al. 2018; Pizarro et al. 2009) and the toxin has been found in the absence of Dinophysis (Fernández et al. 2019).

In 1994 the autonomous government of Andalucía implemented a phytoplankton monitoring system (Fernández et al. 2019). Since the late 1990s, multiple species of Dinophysis (e.g., D. acuminata complex, D. caudata, D. acuta, and D. fortii) have been detected in most of the sampled areas (Fernández et al. 2019). Levels of toxins have been found in various species of edible shellfish (e.g. Callista chione and Venus verrucose), sometimes exceeding the legal limit of 160 \(\mu \)g/kg (e.g. Donax trunculus, Chamelea gallina, Mytilus galloprovincialis, and Cerastoderma edule) (Fernández et al. 2019). The concentration of DST in a shellfish at a given time is a function of shellfish-specific features such as the rate of uptake, biotransformation, and elimination of each particular toxin (Reguera et al. 2014), but also reflects the environmental dynamics of toxigenic algae.

Previous efforts to model the dynamics of toxigenic Dinophysis have had limitations. Artificial neural networks have been applied to the coast of Huelva in Andalucía (Velo-Suárez and Estrada 2007); however, that modeling effort used the last 5 weekly D. acuminata concentrations as the only input variables to predict the upcoming week’s concentration. Kulawiak (2016) used GIS and advanced very-high-resolution radiometer data to detect algae blooms in the Baltic sea, but this did not leverage toxin measurements. To our knowledge, it has been an unaddressed modeling challenge to account for the fact that DST can persist in water for extended periods of time, while also giving interpretable model parameters. Hidden Markov Models (HMM) have proven to be an effective modeling tool (Zucchini and Macdonald 2009). In addition, they have specifically been used to study algal blooms in other contexts and provide a potential scaffold for an improved Dinophysis model. Rousseeuw et al. (2015) used a hybrid HMM to detect and understand the dynamics of phytoplankton blooms in France using data on nutrients and water characteristics, but lacking direct data on algae. In the freshwater harmful algal bloom modeling field, Jiang et al. (2016) employed a continuous HMM alongside principal component analysis of water quality parameters and nutrient to forecast microcystins. Kim et al. (2020) analyzed chlorophyll-a concentrations, a metric to understand eutrophication, in the Nakdong river of South Korea using a continuous HMM. All of these approaches model a multivariate outcome of water quality and chemical parameters using a HMM with an unknown number of states. Rousseeuw et al. (2015) and Jiang et al. (2016) reduce the dimension of the multivariate outcome by clustering and principal components, respectively. Kim et al. (2020) models the spatial distribution of chlorophyll-a conditional on the latent state. In our work, we use a 2-state HMM to directly infer whether there is algae in the water column over time. This is important because we want to reconstruct a daily assessment of this variable for health surveillance. Autoregressive HMMs on the other hand, were initially developed for speech recognition (Juang and Rabiner 1986, 1985). They have since been applied to various issues in recent years (Urban et al. 2020; Bartolucci et al. 2014; Shannon and Byrne 2010), but have not been used to model algae.

This paper presents a first-order autoregressive HMM approach to modeling potentially toxigenic Dinophysis in western Andalucía with the purpose of reconstructing the maximum-likelihood profile of whether algae were absent or present in the water column above a threshold count (e.g., \(\ge \) 500 cells / L) at daily intervals using incomplete time-series historical data on both DST levels and algal counts. We model the presence/absence of algae in the water column by using algae cell counts from water column samples and DST measurements (in \(\mu g\) of OA equivalents per kg) from shellfish gathered from the regional government’s website. Then, using the estimated model parameters we can reconstruct indicators of algae in the water column at a daily interval, even when data is often missing. Since OA can remain in the water for extended periods of time, it is important that we allow for serial dependence in the model formulation. Specifically, we introduce autoregressive dependence in the observed DST measurements after accounting for the hidden Markov structure. The forward-backward algorithm needed to be adapted (Stanculescu et al. 2014) for computing E-step calculations in order to implement the EM algorithm for maximum-likelihood in this setting. We assume a first order autoregressive model as algae blooms erupts quickly and this assumption provides a useful framework over a longer time horizon to capture the quickly moving event. Section 2 presents an in-depth explanation of the model while Sect. 3 explains the estimation procedure, along with the adapted forward-backward algorithm that can account for both missing data and dependence on previously observed states. In Sect. 4 we talk about three simulations which compare estimation with different amounts of missing data. In the \(5^\text {th}\) section we apply our model to the Andalucía data and discuss the estimated parameters and we consider our results in Sect. 6.

2 Methods

We consider a binary first order autoregressive HMM for the true algae state in the water column to model water sample algae counts and DST. We assume that the true water column algae state is binary as algae can either be absent or present in the water column. Let \(S_{t}\), \(X_{t}\) and \(Y_{t}\) denote the daily true water column algae state, water sample algae count, and DST state, respectively, for day t where \(1< t < T\). The domain for these three variables is defined below. Also let \(\mathbf{S} = (S_{1}, \ldots , S_{T})\), and similarly for \(\mathbf{X} \) and \(\mathbf{Y} \). \(\mathbf{S} , \mathbf{X} \) and \(\mathbf{Y} \) all have the same follow up with equally spaced daily observations.

We assume that algae in the water column can be modeled by a Markov chain, where \(S_{t}\) is either 0 or 1 depending on if algae is absent or present in the water column. We define the notation \(r_{S_t}\) to be the probability of starting in state \(S_t\) at time \(t=1\) and \(p_{S_{t-1}S_t}\) to be the probability of transitioning from state \(S_{t-1}\) to state \(S_t\) at time t. Specifically, \(S_t \sim MC(p_{01},p_{10},r_1)\) where \(p_{01}\) denotes the probability of initiating water column algae over a day, \(p_{10}\) indicates the probability of ending the episode over a day, and \(r_1\) denotes the probability of being in the algae state at time \(t = 1\). For use in calculations, we also define \(p_{00}\) as the probability of algae remaining out of the water column over a day, \(p_{11}\) as the probability of algae continuing to remain in the water column over a day, and \(r_0\) as the probability of not being in the algae state at time \(t=0\)

To model the algae cell counts from the water sample, we chose a negative binomial as it can account for overdispersion in the count data. \(X_{t}\) takes on the integer count of algal cells in a sample of water and is conditional on \(S_{t}\). Thus the possible values of \(X_t\) are \(0 \le X_t \le \infty \). Although \(X_t\) is directly observed, it is also subject to measurement error. Due to the spatial heterogeneity of the algae and the water sampling technique used (Fernández et al. 2019), excess zeros are possible based on the specific latitude and longitude sampled. When \(X_t = 0\) we can’t be sure if the sample missed the algae or if algae is truly absent from the water column, however when \(X_t > 0\) we know for sure that algae is present in the water column. Reversing this, when \(S_{t} = 0\) then \(X_{t} = 0\) but when \(S_{t} = 1\) then it is possible but not necessary that \(X_t\) be greater than 0. We chose to model the absence and presence of algae (quantified as above or below the 50 cells/L detection limit), however other thresholds can easily be chosen. Appendix A discusses the implications when two additional thresholds are considered. When algae is present in the water column, we model \(X_{t}\) with a negative binomial distribution with mean \(\mu _a\) and size \(k_a\) where \(E[X_t|S_t = 1] = \mu _a\) and \(V[X_t|S_t = 1] = \mu _a + \frac{\mu _a^2}{k_a}\). This relationship can be concisely summed up as:

$$\begin{aligned} X_{t} = {\left\{ \begin{array}{ll} 0 &{} \text {if } S_{t} = 0\\ NB(\mu _a,k_a) &{} \text {if } S_{t} = 1.\\ \end{array}\right. } \end{aligned}$$
(1)

We discretized the DST measurements into four states as a large proportion of values are below the quantification limit and are non-normally distributed on all commonly considered transformed scales. Discrete measurements also help with the computational feasibility of the method. Therefore, let \(Y_t^*\) represents the continuous toxin measurements and let \(Y_t\) represent the discretized measurements as follows:

$$\begin{aligned} Y_{t} = {\left\{ \begin{array}{ll} 0 &{} \text {if } Y_t^* \le 40 \mu g \text { of OA equ/kg}\\ 1 &{} \text {if } 40< Y_t^* \le 100 \mu g \text { of OA equ/kg}\\ 2 &{} \text {if } 100 < Y_t^* \le 160 \mu g \text { of OA equ/kg}\\ 3 &{} \text {if } Y_t^* > 160 \mu g \text { of OA equ/kg}\\ \end{array}\right. } \end{aligned}$$
(2)

These specific cutoffs were chosen as the limit of detection is \(40 \mu g\) and fisheries must close at \(160\mu g\) while \(100\mu g\) lies halfway between the two other constraints. The DST states follow an ordinal logistic regression model with two regression parameters and three intercept parameters. The full model is,

$$\begin{aligned} \text {logit}(P(Y_t \le c)) = \alpha _c + \beta _1 \times Y_{t-1} + \beta _2 \times S_t \end{aligned}$$
(3)

where \(0 \le c \le 2\), \(\alpha _c\) is the intercept parameter, \(\beta _1\) is the regression parameter for the last toxin measurement, and \(\beta _2\) is the regression parameter for the current Markov chain state. The regression coefficients can be interpreted as: there is \(e^{\beta _1}\) times the odds of \(Y_t=c+1\) compared to \(Y_t = c\) with each increase by one in \(Y_{t-1}\) and when \(S_t\) increases from 0 to 1 there is \(e^{\beta _2}\) times odds of \(Y_t=c+1\) compared to \(Y_t = c\). The dependence of \(Y_t\) on \(Y_{t-1}\) creates issues when \(Y_{t-1}\) is missing, however these problems will be dealt with in Sect. 3.

The joint distribution of the latent water column algae state, the observed water sample algae count, and DST measurement can be calculated by multiplying the following three components: (1) the probability of being in a water column algae state, (2) conditional on the water column state, the probability of the algae cell count from the water sample, and (3) conditional on the water column state and the last DST state, the probability of the current DST state. The complete-data joint distribution can be written as:

$$\begin{aligned} \begin{aligned} f(\mathbf{S} ,\mathbf{X} ,\mathbf{Y} )&= r_{S_{1}} \prod _{t=2}^T p_{S_{t-1}S_{t}}\\&\times \prod _{t=1}^T NB(X_{t}| \mu _a,k_a)^{S_{t}}\\&\times \prod _{t=1}^T P(Y_t| S_t, Y_{t-1}). \end{aligned} \end{aligned}$$
(4)

We assume that \(Y_0 = 0\) because most values are of \(Y_t\) are zero. Additionally as a sensitivity analysis we ran our analysis where \(Y_0=1\), \(Y_0=2\), and \(Y_0=3\) and the results did not change. The estimation procedures are described in the next section. Parametric bootstrap standard errors were calculated by simulating data 500 times per site using the estimated parameters. Bootstrap standard errors were then calculated for each parameter by calculating the standard deviation of the 500 samples. Finally, we apply the Viterbi algorithm to reconstruct the highest likelihood hidden state path.

3 Estimation

We will now introduce the procedure for estimation assuming no missing data. Define the indicator variable Z such that

$$\begin{aligned} Z_{s_t}(S_t) = {\left\{ \begin{array}{ll} 0 &{} \text {if } S_t \ne s_t \\ 1 &{} \text {if } S_t = s_t.\\ \end{array}\right. } \end{aligned}$$

This indicator function is critical for later computations and can be used with variables other than \(S_t\). We adopt the notation where \(S_{t}\) refers to the random variable, while \(s_{t}\) refers to a possible value of the random variable \(S_{t}\). We use this notation across all random variables. This indicator function is equal to 1 when the random variable of choice is equal to a specific realization of the random variable. We can then rewrite the complete-data joint distribution as:

$$\begin{aligned} \begin{aligned} f(\mathbf{S} ,\mathbf{X} ,\mathbf{Y} )&= \prod _{s_1=0}^1r_{s_1}^{Z_{s_1}(S_1)} \prod _{t=2}^T \prod _{s_1=0}^1\prod _{s_2 = 0}^1 p_{s_{t-1}s_t}^{Z_{s_{t-1}}(S_{t-1})Z_{s_t}(S_t)}\\&\times \prod _{t=1}^T NB(X_t = x_t|\mu _a,k_a)^{S_t}\\&\times \prod _{t=1}^T \prod _{s_t=0}^1 P(Y_t = y_t | S_t=s_t, Y_{t-1} = y_{t-1})^{Z_{s_t}(S_t)}. \end{aligned} \end{aligned}$$
(5)

As \(\mathbf{S} \) is not observable, to maximize this likelihood directly we would have to iterate over every possible value, thus calculating

$$\begin{aligned} \begin{aligned} \sum ^k_{s_{1}} \cdots \sum ^k_{s_{T}} f((s_{1},\ldots ,s_{T}),\mathbf{X} ,\mathbf{Y} ). \end{aligned} \end{aligned}$$
(6)

Maximizing this directly becomes intractable as T increases and as \(T = 2177\) days, it is not feasible for our application. By using the expectation-maximization (EM) algorithm we can maximize this likelihood in a timely manner by alternating between an expectation and a maximization step, converging at the estimated parameters. The expectation step calculates the following complete-data log likelihood:

$$\begin{aligned} \begin{aligned} E[\text {log }f(\mathbf{S} ,\mathbf{X} ,\mathbf{Y} )|\mathbf{X} ,\mathbf{Y} ] =&\sum _{s_1=0}^1E[Z_{s_1}(S_1)|\mathbf{X} ,\mathbf{Y} ]\text {log}(r_{s_1})\\&+ \sum ^T_{t=2}\sum _{s_{t-1} = 0}^1\sum _{s_t=0}^1E[Z_{s_{t-1}}(S_{t-1})Z_{s_t}(S_t)|\mathbf{X} ,\mathbf{Y} ]\text {log}(p_{s_{t-1}s_t})\\&+ \sum ^T_{t=1}E[S_t|\mathbf{X} ,\mathbf{Y} ]\text {log}(NB(x_t|\mu _a,k_a))\\&+ \sum ^T_{t=1}\sum _{s_t = 0}^1 E[Z_{s_t}(S_t)|\mathbf{X} ,\mathbf{Y} ]\text {log}(P(y_t | s_t, y_{t-1})), \end{aligned} \end{aligned}$$
(7)

where the expectations can all be calculated using the forward-backward algorithm (Baum 1972).

Fig. 1
figure 1

Sample of observations for a three month period in 2016 for area 1. The left panel shows algae count observations while the right shows DST state observations. Dots correspond to observations, while x’s signify a missing observation. The dashed line on the left shows the algae threshold of 50 cells/L

Fig. 2
figure 2

Decoded water column state (\(S_t\)) path using the Viterbi algorithm for 2016 in area 1. The line is the decoded path while the dots indicate absence and presence of observed algae. When algae counts are above 0 we know that \(S_t\) must equal one, however when algae counts are 0 we cannot say anything about \(S_t\). For the figure, this is why when we observe algae (indicated by a dot at 1) the path (the line) must pass through it, but when we do not observe algae (a dot at 0) the path may or may not pass through it

3.1 Estimation with missing data

Often times some parts of the observable data are missing. As shown in Fig. 1 most observations are missing (other areas are shown in Fig. 2). This time frame was chosen as spring and summer are often when most algae blooms occur. In our example, when \(X_t\) is missing we simply leave out the second line of the likelihood calculation, however when \(Y_t\) is missing a more complicated method is required. As the current DST state depends on the last DST state, when \(Y_t\) is missing we must account for it to calculate the probability of \(Y_{t+1}\). By conditioning over all possible DST states for \(Y_t\), we can calculate the probability of \(Y_{t+1}\). The complete data joint distribution, accounting for missing data, can be written as

$$\begin{aligned} \begin{aligned} f(\mathbf{S} ,\mathbf{X} ,\mathbf{Y} )&= \prod _{s_1=0}^1r_{s_1}^{Z_{s_1}(S_1)} \prod _{t=2}^T \prod _{s_{t-1}=0}^1\prod _{s_t = 0}^1 p_{s_{t-1}s_t}^{Z_{s_{t-1}}(S_{t-1})Z_{s_t}(S_t)}\\&\times \prod _{t=1}^T NB(X_t = x_t|\mu _a,k_a)^{Z_1(S_t)}\\&\times \prod _{t=1}^T \prod _{s_t=0}^1 \prod _{y_{t-1}=0}^3 \prod _{y_{t}=0}^3 P(Y_t = y_t | S_t=s_t, Y_{t-1} = y_{t-1})^{Z_{s_t}(S_t)Z_{y_{t-1}}(Y_{t-1})Z_{y_{t}}(Y_{t})}, \end{aligned} \end{aligned}$$
(8)

where the complete-data log likelihood is now

$$\begin{aligned} \begin{aligned} E[\text {log }f(\mathbf{S} ,\mathbf{X} ,\mathbf{Y} )|\mathbf{X} ,\mathbf{Y} ] =&\sum _{s_1=0}^1E[Z_{s_1}(S_1)|\mathbf{X} ,\mathbf{Y} ]\text {log}(r_{s_1})\\&+ \sum ^T_{t=2}\sum _{s_{t-1} = 0}^1\sum _{s_t=0}^1E[Z_{s_{t-1}}(S_{t-1})Z_{s_t}(S_t)|\mathbf{X} ,\mathbf{Y} ]\text {log}(p_{s_{t-1}s_t})\\&+ \sum ^T_{t=1}E[Z_1(S_t)|\mathbf{X} ,\mathbf{Y} ]\text {log }(NB(x_t|\mu _a,k_a))\\&+ \sum ^T_{t=1}\sum _{s_t = 0}^1 \sum _{y_{t-1} = 0}^3 \sum _{y_{t} = 0}^3\\&\quad E[Z_{s_t}(S_t)Z_{y_{t-1}}(Y_{t-1})Z_{y_{t}}(Y_{t})|\mathbf{X} ,\mathbf{Y} ]\text {log}(P(y_t | s_t, y_{t-1})). \end{aligned} \end{aligned}$$
(9)

To account for the missing data and the dependency in the emissions distribution we use the adapted forward-backward algorithm from Stanculescu et al. (2014), however our application has a bivariate rather than univariate emissions distribution. For the maximization step, we maximize equation (9) given the E step calculations. The E step calculations are hard to calculate so we use the Forward-Backward algorithm described in the next section. We consider convergence to occur when the log likelihood increase between iterations is less than 0.01.

3.2 Forward–backward algorithm

To account for missing toxin values, we will keep track of every possible DST state value. Assume that \(Y_t\) is missing. By calculating the probability of all possible DST states at time t, we can then calculate the probability of \(Y_{t+1}\). We redefine the indicator variable Z to account for the scenario of missing data. Let

$$\begin{aligned} Z_{y_t}(Y_t) = {\left\{ \begin{array}{ll} 0 &{} \text {if } Y_t \ne y_t \\ 1 &{} \text {if } Y_t = y_t \text { or if } y_t \text { is missing.}\\ \end{array}\right. } \end{aligned}$$

The forward quantity is:

$$\begin{aligned} \begin{aligned} \alpha _{s_t}(t,\omega ) = P(X_1 = x_1, \ldots , X_t = x_t, Y_1 = y_1, \ldots , Y_{t} =\omega , S_t = s_t) \times Z_{y_t}(\omega ). \end{aligned} \end{aligned}$$
(10)

The indicator function, not present in the standard forward quantity, allows us to incorporate different possible values of \(Y_t\) when \(Y_t\) is missing. If \(Y_t\) is observed the forward quantity is zero except when \(y_t = \omega \). However, when \(Y_t\) is missing, \(\omega \) corresponds to a possible DST state value at time t. This quantity is calculated recursively by

$$\begin{aligned} \alpha _{s_{t}}(t,\omega ) = {\left\{ \begin{array}{ll} r_{s_1}NB(x_1|\mu _a,k_a)^{s_1}P(Y_1 = \omega |s_1, Y_0 = 0) \times Z_{Y_1}(\omega )&{} \text {if } t = 1 \\ \sum _{{s_{t-1}} = 0}^1 \sum _{\omega _0 = 0}^3\alpha _{s_{t-1}}(t-1,\omega _0) p_{s_{t-1}s_{t}} NB(x_t|\mu _a,k_a)^{s_t}P(Y_t = \omega |s_t,Y_{t-1}=\omega _0) \times Z_{y_t}(\omega ) &{} \text {if } t > 1. \\ \end{array}\right. } \end{aligned}$$
(11)

The backward quantity is defined as

$$\begin{aligned} \begin{aligned}&\beta _{s_t}(t,\omega ) = P(X_{t+1} = x_{t+1}, \ldots , X_T = x_T, Y_{t+1} = y_{t+1}, \ldots , Y_T = y_T | S_t = s_t,Y_t = \omega )\\&\quad \times Z_{y_{t}}(\omega ), \end{aligned} \end{aligned}$$
(12)

where, similarly to the forward quantity, if \(Y_t\) is observed the backward quantity is zero except when \(y_t = \omega \) and if \(Y_t\) is missing \(\omega \) corresponds to a possible DST state value at time t. It is also calculated recursively:

$$\begin{aligned} \beta _{s_{t}}(t,\omega ) = {\left\{ \begin{array}{ll} 1 &{} \text {if } t = T \\ \sum ^1_{{s_{t+1}}=0}\sum _{\omega _0=0}^3p_{{s_{t}}{s_{t+1}}} NB(x_{t+1}|\mu _a,k_a)^{s_{t+1}}P(Y_{t+1} \\ \quad = \omega _0|s_{t+1},Y_t = \omega )\beta _{s_{t+1}}(t+1,\omega _0)\times Z_{y_t}(\omega ) &{} \text {if } t < T. \\ \end{array}\right. } \end{aligned}$$
(13)

3.3 Calculating expectations

The expectations from the complete-data log likelihood are calculated as follows:

$$\begin{aligned} \begin{aligned} E[Z_{s_t}(S_t)|\mathbf{X} ,\mathbf{Y} ] = P(S_t = s_t|\mathbf{X} ,\mathbf{Y} ) =\frac{\sum _{\omega =0}^3\alpha _{s_t}(t,\omega )\beta _{s_t}(t,\omega )}{P(\mathbf{X} ,\mathbf{Y} )} \end{aligned} \end{aligned}$$
(14)
$$\begin{aligned} \begin{aligned}&E[Z_{s_{t-1}}(S_{t-1})Z_{s_t}(S_t) | \mathbf{X} ,\mathbf{Y} ]\\&\quad =\frac{P(S_{t-1} = s_{t-1},S_t = S_t, \mathbf{X} ,\mathbf{Y} )}{P(\mathbf{X} ,\mathbf{Y} )} \\&\quad =\frac{\sum _{\omega _1=0}^3\sum _{\omega _2=0}^3\alpha _{s_{t-1}}(t-1,\omega _1)p_{{s_{t-1}}{s_t}} g(x_t|\mu _a,k_a)^{S_t}P(Y_t=\omega _2|s_t,Y_{t-1} = \omega _1)\beta _{s_t}(t,\omega _2)}{P(\mathbf{X} ,\mathbf{Y} )} \end{aligned} \end{aligned}$$
(15)
$$\begin{aligned} \begin{aligned}&E[Z_{s_1}(S_t)Z_{y_{t-1}}(Y_{t-1})Z_{y_{t}}(Y_{t})|\mathbf{X} ,\mathbf{Y} ]\\&\quad = \frac{P(S_t = s_1, Y_{t-1} = y_{t-1}, Y_{t} = y_{t}, \mathbf{X} ,\mathbf{Y} )}{P(\mathbf{X} ,\mathbf{Y} )} \\&\quad = \sum _{s_0} \frac{P(S_t = s_1, Y_{t-1} = y_{t-1}, Y_{t} = y_{t}, \mathbf{X} ,\mathbf{Y} |S_{t-1} = s_0)}{P(\mathbf{X} ,\mathbf{Y} )} \\&\quad = \sum _{s_0} \frac{ \alpha _{s_0}(t-1,y_{t-1})p_{s_0s_1}g(x_t|\mu _a,k_a)^{s_1}P(y_t|s_1,y_{t-1})\beta _{s_1}(t,y_t)}{P(\mathbf{X} ,\mathbf{Y} )} \end{aligned} \end{aligned}$$
(16)
$$\begin{aligned} \begin{aligned} P(\mathbf{X} ,\mathbf{Y} ) = \sum _{s_T = 0}^1\sum _{\omega =0}^3\alpha _{s_T}(T,\omega ) \end{aligned} \end{aligned}$$
(17)
Table 1 Comparison of truth and estimated parameter from three different simulations with 0%, 33%, and 85% missing data

In equation 15 and 16 the function \(g(x_t|\mu _a k_a)\) is the probability density function of negative binomial distribution with parameters \(\mu _a\) and \(k_a\), calculating the probability of \(x_t\).

4 Simulation

We examine the performance of our proposed method by analyzing three simulations with varying amounts of missing data. We simulated data sets with no missing data, one-third of the data missing, and 85% of the data missing. For each category, 500 data sets were generated with the same follow up length as the data (2177 days). The simulated data structure corresponds to the application presented in our application section. These three amounts of missing data were chosen as they account for a wide variety of scenarios while also testing this specific application, which has (depending on the site) at most 83% of the data missing. By varying the level of missingness, we can measure how well our method preforms at recovering the true parameters with different levels of information available. It should also be noted that this is especially important to test for the DST measurements. When there is no missing data the estimation for the DST is straightforward, however with missing data the adapted forward backward algorithm explained in the estimation section is needed. With more missing data there are longer times between observations, meaning more reliance on the proposed adaption to account for \(Y_{t}\) when calculating the probability of \(Y_{t+1}\).

For the simulations that have no missing data, the estimates are extremely accurate with minimal standard errors. Table 1 contains the simulation estimates for the three different levels of missing data, and shows that our method is extremely accurate at recovering the true parameters regardless of missing data. Additional computation time is required when our method encounters missing data as all possible values of the last DST state are iterated over. Thus, the time needed for our method scales linearly with the amount of missing data.

Table 2 Summary of basic information about the collected algae count and DST state data. DST continuous measurements were discretized into four states: 0 (\(\le 40 \mu g\) of OA equ/kg), 1 (40 \(\le 100 \mu g\) of OA equ/kg), 2 (100 \(\le 160\mu g\) of OA equ/kg), and 3 (\(\ge \) 160\(\mu g\) of OA equ/kg)

Even with most of the data missing, our method accurately estimates the parameters. However, as more data is missing, the standard errors increase. While this increase is quite small when one-third of the data is missing, it is much larger when 85% of the data is missing. The standard errors for the 33% missing simulation roughly double when compared to the 0% missing simulation, however the standard errors increase by a factor ranging from 9 to 37 when comparing the 85% and the 0% missing simulations. This can most easily be seen in the third row of Table 1 for \(\mu _a\). The estimate itself is accurate across the three levels, however the standard error when 85% of the data is missing is extremely large, even compared to the standard error when 33% of the data is missing. Despite this high variability, there is no relationship between the initial and estimated value in the simulations.

5 Application

5.1 Dataset description

We illustrate our method on data gathered from the regional government of Andalucía’s website (Zonas de producción.(n.d.) xxxx). The Andalucían government established a phytoplankton toxin monitoring program for shellfish in 1994 to help deal with the recurrent blooms of Dinophysis that are linked to DSP outbreaks (Bouza and Aboal 2008). We used data on toxin levels, measured in \(\mu g\) of OA equivalent/kg, sampled from the bivalve Donax trunculus in the time frame from January 2015 to December 2020. The follow-up length is 2,177 days. Toxin levels were calculated as specified by Yasumoto et al. (1984) and liquid chromatography-tandem mass spectrometry was used as the chemical analysis technique (Fernández et al. 2019). Water column samples used to calculate algae cell counts were gathered using a 10-meter-long weighted plastic hose. 25mL water samples from sedimentation chambers were then used to extrapolate the number of cells/L (Velo-Suárez and Estrada 2007). Data from eight geographical sites (areas 1, 2, 3, 4, 5, 6, 7,  and 8) were analyzed separately. Areas 7 and 8 were recorded as a single area until May 2018 and were then split into distinct areas; we analyzed each site in a separate model. Table 2 contains some summary statistics about the data.

5.2 Results

Our HMM has two states representing the presence or absence of potentially toxigenic Dinophysis algae in the water column at a concentration exceeding a threshold (e.g., \(\ge \) 500 cells/L). Both the initiation (\(p_{01}\)) and termination (\(p_{10})\) probabilities are low as can be seen in Table 2, indicating a tendency for algae to stay in or remain out of the water column for a number of days (corresponding to the 1 or 0 state of the HMM). Despite the minor differences between each of the different sites, there is broad homogeneity among the sites with the initiation probabilities being slightly lower than 10% and most termination probabilities hovering just above 10%. Within each area, initiation probabilities were lower than their corresponding termination probabilities. Although the state path of the hidden Markov model is unobserved, it is an important metric to recover as it can be useful in determining long-term changes in algae and can have implications for the effects of climate change. We reconstructed the hidden state paths using the Viterbi algorithm, producing the path with the highest likelihood. Figure 3 shows a visualization of the Viterbi path for area 1 in 2016 (other areas are shown in Fig. 4). Using the Viterbi path we can then calculate different summary statistics for algae presence/absence across time. As shown in Fig. 5, distinct trends can be seen within each year and across years. For instance, for areas 1-7 the earlier and later years have a higher proportion of algae in the water column when compared to the middle years. The proportion of days with algae was estimated to be at least 54.52 % and 61.1% for 2015 and 2019, while in 2017 it was estimated to be at most 48.22%.

Fig. 3
figure 3

Proportion of predicted days with algae presence in the Viterbi path across all areas and years

Fig. 4
figure 4

Sample of observations for a three month period in 2016 for area 2–8. Panels on the left shows algae count observations while panels on the right show DST state observations while areas are grouped by row. Dots correspond to observations, while x’s signify a missing observation. The dashed line on the left shows the algae threshold of 50 cells/L

Fig. 5
figure 5

Decoded water column state (\(S_t\)) path using the Viterbi algorithm for 2016 for areas 2–8. The line is the decoded path while the dots indicate absence and presence of observed algae. When algae counts are above the threshold of 50 cells/L, indicated with a black dot at 1, \(S_t\) must equal one

As noted previously, we modeled the algae from the water sampled with a negative binomial model when \(S_t = 1\), and assumed that there cannot be any algae in the water sample when \(S_t=0\). When there is algae in the water sample we can assume that \(S_t = 1\) because otherwise it would not be possible for there to be algae present. On the contrary, we cannot draw any conclusions when there is not algae in the water sample. This is the case because the algae cell count from the water sample is serving as an observable representative of algae in the water column with measurement error. Because the algae is not distributed evenly by either latitude, longitude, or depth, the water sample may not accurately capture whether algae is in the water column. As noted in the methods section, for this application we consider algae to be present in the water sample when it can be detected (a threshold of 50 cells per liter). We examine the consequences of higher thresholds in appendix A. Areas 3 and 5 have a mean parameter (\(\mu _a\)) around 230–240, while areas 1, 2, 4, and 6 have a higher mean parameter ranging from 265 to 290. Area 8 has a larger mean parameter of 320, and area 7 has a significantly larger mean around 440. Barring area 7, the size parameter (\(k_a\) in Eq. 1) is above 1 indicating that the negative binomial model is essential to help account for over-dispersion. As area 7 has the largest mean parameter and smallest size parameter (see expression 1), this leads to larger, more variable predictions for area 7.

The continuous DST measurements were discretized to form four (0 to 3) DST states, which follow an ordinal logistic model. The continuous measurements are highly skewed as the limit of quantification is 40 \(\mu \)g of OA equ/kg. Binning the continuous measurements less than 40 \(\mu \)g of OA equ/kg reduced the number of distributional assumptions. Unlike the algae cell count from the water sample that only depended on the current Markov chain state, we assume that the DST states are dependent on the current Markov chain state as well as the last DST state. This additional dependency is necessary in the emission distribution as major components of DST have been found to be very stable in the water column after a Dinophysis bloom (Blanco et al. 2018; Pizarro et al. 2009). This dependency cannot be estimated using the standard forward-backward algorithm as the standard HMM assumes that the observed states are all conditionally independent given the current latent state. In our model, the current observed state is dependent on the previous observed state and the current latent state, violating this assumption. Instead, by using procedures developed for autoregressive HMMs (Stanculescu et al. 2014), we can incorporate this dependency into the estimation procedure. We believe that a first order autoregressive model is applicable as algae blooms are rapid events. By having a shorter time dependency we are better able to model these events.

Our ordinal logistic model has five parameters: \(\beta _1\) is the effect of the DST state at time \(t-1\), \(\beta _2\) is the effect of the current Markov state, and \(\alpha _0, \alpha _1,\) and \(\alpha _2\) are the intercept parameters. The effect of the last DST state is additive in relation to the log odds of the probability of the current DST state such that the effect of the last DST state is \(\beta _1\) when \(Y_{t-1}=1\), \(2 \times \beta _1\) when \(Y_{t-1}=2\), and \(3 \times \beta _1\) when \(Y_{t-1}=3\). Table 3 contains all parameter estimates for the eight different areas along with their standard errors.

Despite the somewhat high standard errors for the regression coefficients, the probabilities themselves have a low standard error. The probabilities and standard errors for area 1, along with the other areas, are shown in Table 4. Interestingly, we can see the difference in predictive power between \(S_t\) and \(Y_t\) by looking this table. For each area, the difference between the left and right halves of the table is not nearly as drastic as the difference between the rows, indicating that the autoregressive effect on \(Y_t\) is indispensable to this model.

6 Discussion

In this paper we have focused on the historical reconstruction of an incomplete time-series by developing a model that recreates the most likely pattern of Dinophysis spp. algae occurrence at each of the eight different sites on a daily timescale, using a HMM with extensions to account for challenges inherent to the data. DST measurements were highly autocorrelated, even after accounting for the hidden states of the HMM, violating one of the standard assumptions of HMMs (Rabiner and Juang 1986). However, by using an autoregressive HMM we are able to model this. The sampling frequency of the monitoring program resulted in large amounts of missing data (at most 83%). Furthermore, the distribution of DST was skewed and left-censored at the assay limits of quantification, \( 40 \mu g\) of OA equivalents per kg (Zonas de producción.(n.d.) xxxx). We addressed these challenges with an advanced HMM that included a bivariate emissions distribution with a negative binomial distribution for algae counts and ordinal autoregressive model for the serial DST measurements. We showed with simulations that the approach accurately estimated the parameters even with extensive missing measurements.

Table 3 Markov transition probabilities for algae presence/absence in the water column, mean and size negative binomial parameters for water sample algae count, and ordinal logistic regression coefficients for DST state
Table 4 Probability of being in each DST state for every area given the last DST state and the current Markov state

This paper presents a modified forward-backward algorithm in an EM context from Stanculescu et al. (2014) with an additional observed variable applied to data from a phytoplankton toxin monitoring program in western Andalucía. This generalized form allows us to estimate a model with both dependence in the emissions distribution and missing data. In our application, DST states are dependent on both the current Markov state as well as the last DST state. The proposed method works by keeping track of all possible DST states (when the DST state is missing) in the forward-backward algorithm. We can then condition on and sum over the most recent DST state to calculate the probability of the current DST state. Although this does lead to additional computation complexity, the time needed scales linearly with the amount of missing data and is still feasible when nearly all of the data is missing.

We applied this method to 2,177 days of algae water samples and DST data from eight geographical sites with dates ranging from January 2015 to December 2020. Despite the long stretch of time covered in the study, most days had no recorded data. Although the data available varied by site, it ranged between 377 (17%) and 524 (24%) days with recorded measurements out of the total 2,177 days. Although HMMs have not been applied to this problem in this area before, our application of this method shows that HMMs are capable of modeling complex processes that don’t necessarily conform to the standard assumptions in the presence of large amount of missing data. Running our method on western Andalucía phytoplankton monitoring data we see that accounting for the last DST state requires the additional complexity of an autoregressive HMM.

One of the major advantages from our method is that we are able to reconstruct paths of the latent variable on a daily interval using historical time-series data in the presence of intermittent measurements and measurement error. Rather than forecast the future, our method focuses on predicting whether algae were absent or present in the water column for every day in our data set. This is useful when trying to identify long term algae trends for the different areas across time as well as for health surveillance. The Viterbi algorithm is ideal for computing estimates of the entire sequence of latent states. These sequences can later be used in downstream analyses that examine the relationship between toxicity and diseased risk. By aggregating these sequences, termed Viterbi paths, we are able to identify long term trends across years.

The proposed hidden Markov model makes a number of parametric assumptions including that the unobserved states follow a first-order Markov model and that the observed DST data follow a first-order autoregressive process after accounting for the HMM structure. We believe that these are reasonable assumptions since DST remains in the water for extended periods of time and algae blooms are rapid events. Latent state estimation should not be sensitive to small departures from these underlying assumptions. Therefore, we presume that the AR(1)-HMM framework adequately describes the biological process.

In the future, our method can be applied to other types of monitoring program data as well. Because most monitoring program data contains missing values, accounting for the temporal autocorrelation that is often present is not straightforward. Our method can adequately handle both complications simultaneously while also creating historical time-series reconstructions. Using our method, we are also able to relate two separate processes together while we impute the maximum-likelihood profile of the variable of interest.