1 Introduction

The pandemic of COVID-19, also known as SARS-CoV-2, has already caused several hundreds of thousands deaths worldwide while bringing the global economy to a standstill. The long incubation period, large portion of asymptomatic infections, highly contagious rate, testing accuracy problem in the early stages of the spread, and the difficulty to implement uniform nationwide policies, are just some of the challenges in containing the spread of the virus. In the meantime, relatively long-duration treatment and high death rate have overwhelmed public health systems. Capturing the underlying disease dynamics with the ability for credible statistical predictions is ever so critical for health care providers and decision makers to be better prepared through resource allocation to ultimately contain the virus.

Modeling and forecasting of COVID-19 cases have recently received large attention. The fundamental epidemic model, proposed in 1927 [1], is able to describe the dynamics of three population groups, namely, susceptible (S), infected (I), and recovered (R). The SIR family of models are among the most popular ones to learn the dynamics of COVID-19. Toda [2] used SIR model to study the effects of transmission rate to the growth of new cases, the economic impacts of the pandemic was also investigated. The SIRD model (“D” stands for “death”) is an extension of the SIR model, and is used by [3] to demonstrate the differences of infectious rate at different countries. Sarkar et al. [4] separated the susceptible individuals into unaffected and quarantined ones, and the infected individuals into asymptomatic and symptomatic ones, the modified SIR model was used to study the consequences of the public policies. He et al. [5] proposed a modified SEIR model (“E” stands for “exposed”) to adapt the particularities of COVID-19, which considers the social and government interventions, quarantine and treatment. To quantify the effectiveness of public health interventions (that change with time), Linda et al. [6] proposed a dynamic SEIR model with time-varying reproduction number. Radulescu and Cavanagh suggested to use separate SEIR models for different age compartments since they have different infection, recovery, and fatality parameters, and this model was applied to a small “college town” community to study the effects of several social interventions to the spread of the disease. Besides the SIR family models, there are also some other types of models that try to address the modeling and forecasting of COVID-19, including the phenomenological models [7], exponential smoothing models [8], autoregressive moving average and Wavelet-based models [9], artificial intelligence based models [10], time series models [11], and many others [12, 13].

The focus of the present work is to uncover some persistent dynamics within the daily new case data and make reasonable and reliable predictions of the future infections of the whole United States (US) as well as of individual states within the US. We explore the suitability of the switching Kalman filter (SKF) [14] algorithm as a viable tool for this purpose. Kalman filter (KF) is a widely used method for tracking and navigation, and for filtering and prediction of econometric time series [15]. The KF is efficient, and accurate, only when the hidden state is a linear Gaussian model, which is usually not true for most practical applications. By using a mixture of several linear Gaussian models, the SKF, however, is able to accurately estimate the hidden state that is governed by nonlinear and non-Gaussian dynamics [16,17,18]. More specifically, a weighted combination of linear models is used to estimate the true state at each time step. An addition hidden switching variables are introduced to specify which model to choose at any specific time step. The switching between models usually indicates a change in the underlying dynamics of the hidden state and hence is useful to monitor abnormal behavior in a system. It has been used, for instance, to monitor anomaly detection of dams [19], the diagnostics and prognostics of vehicle health [20], and the monitoring of bearing systems [21].

COVID-19 data used in this paper was extracted from [22]. The daily new case of the US is shown in Fig. 1, from which we can see that the time series can be split into several stages along the time, namely, the low-level new infections stage, the rapid increasing stage starting from March 1, the slow decreasing stage from the beginning of April to June 15, and another rapid increasing stage after that. By the clear separation of the data, we have reason to believe that there might be different dynamics driving the evolution of each stage. In addition, with the evolution of the spread of the virus, the communities change their personal attitude towards handling the disease and decision makers should make corresponding adjustments to their public policies. These factors can, in turn, drive changes in the dynamics of new infections. The SKF, with its ability to switch between dynamical systems, is well-adapted to these challenges.

The paper is organized as follows. Section 2 introduces the mathematical foundations of SKF, as well the ideas of learning its parameters from the observations. In Sect. 3, the models being used in SKF are introduced, including both the trend and seasonal models, the methodologies are also summarized into two algorithms. The numerical analyses of SKF on the actual data of the daily cases of the US and several states of the US are carried out in Sect. 4. In Sect. 5, we present the concluding remarks and discussions.

Fig. 1
figure 1

Daily new cases of the US up to July 24

2 Switching Kalman filters

The mathematical foundation of SKF is presented in this section. The KF is first recalled, based on which the SKF method is introduced.

2.1 Kalman filter

The KF uses a linear state-space model to estimate the true (hidden) states, including past, present and future states, of a process from a set of observations, in which the mean squared error is minimized [23]. Denote \({\varvec{x}}_t \in {\mathbb {R}}^n\) the true state and \({\varvec{y}}_t \in {\mathbb {R}}^m\) the observation, the linear dynamic model that the KF tries to address can be specified as follows,

$$\begin{aligned} \begin{aligned} {\varvec{x}}_{t} =&{\varvec{A}} {\varvec{x}}_{t-1} + {\varvec{w}}_t,\\ {\varvec{y}}_t =&{\varvec{Hx}}_t + {\varvec{v}}_t, \end{aligned} \end{aligned}$$
(1)

where \({\varvec{w}}_t \sim {\mathscr {N}}({\varvec{0}}, {\varvec{Q}})\), and \({\varvec{v}}_t \sim {\mathscr {N}}({\varvec{0}}, {\varvec{R}})\) are the state and observation error matrices, respectively; \({\varvec{A}}\) (\(n\times n\)) and \({\varvec{H}}\) (\(m\times n\)) are the state transition and observation matrices, respectively. Note that, \({\varvec{A}}\), \({\varvec{H}}\), \({\varvec{Q}}\), and \({\varvec{R}}\) might change with time, but are assumed to be constant in this paper. In KF, \(({\varvec{x}}_{t}|{\varvec{x}}_{t-1})\) and \(({\varvec{y}}_t|{\varvec{x}}_t)\) are both assumed to be Gaussians. The KF consists of two steps, namely, a “prediction” step and an “updating” step [24]. Let \(\hat{{\varvec{x}}}_{t|t-1}\) be the prior estimate at time t given information available at time \(t-1\), \({\varvec{V}}_{t|t-1}\) be the prior estimate of state error covariance, \(\hat{{\varvec{x}}}_t\) be the posterior estimate given observation \({\varvec{y}}_t\), and \({\varvec{V}}_t\) be the posterior estimate of the state error covariance. The KF process can then be expressed as,

KF prediction

$$\begin{aligned} \begin{aligned} \hat{{\varvec{x}}}_{t|t-1}&= {\varvec{A}} \hat{{\varvec{x}}}_{t-1}, \\ {\varvec{V}}_{t|t-1}&= {\varvec{AV}}_{t-1}{\varvec{A}}^T + {\varvec{Q}}. \end{aligned} \end{aligned}$$
(2)

KF updating

$$\begin{aligned} \begin{aligned} \hat{{\varvec{x}}}_t&= \hat{{\varvec{x}}}_{t|t-1} + {\varvec{K}}_t {\varvec{\nu }}_t, \\ {\varvec{V}}_t&= \left( {\varvec{I}} - {\varvec{K}}_t {\varvec{H}} \right) {\varvec{V}}_{t|t-1}, \end{aligned} \end{aligned}$$
(3)

where \({\varvec{K}}_t = {\varvec{V}}_{t|t-1} {\varvec{H}}^T \left( {\varvec{HV}}_{t|t-1}{\varvec{H}}^T + {\varvec{R}} \right) ^{-1}\) is referred to as the Kalman gain and \({\varvec{\nu }}_t = {\varvec{y}}_t -{\varvec{H}}\hat{{\varvec{x}}}_{t|t-1}\) is the so-called innovation process. The likelihood of observation \({\varvec{y}}_t\) given the observations of \({\varvec{y}}_{1:t-1}\) can also be obtained as a by-product of the process as

$$\begin{aligned} L_t = p({\varvec{y}}_t|{\varvec{y}}_{1:t-1}) = {\mathscr {N}} \left( {\varvec{\nu }}_t; {\varvec{0}}, {\varvec{HV}}_{t|t-1} {\varvec{H}}^{\varvec{T}} + {\varvec{R}} \right) . \end{aligned}$$
(4)

For further short-hand reference, the KF process containing the operations from Eqs. (2)–(4) is summarized as a subroutine of the form,

$$\begin{aligned} {[}\hat{{\varvec{x}}}_t, {\varvec{V}}_t, L_t] = \mathbf{Filter} \left( \hat{{\varvec{x}}}_{t-1}, {\varvec{V}}_{t-1}, {\varvec{y}}_t, {\varvec{A}}, {\varvec{Q}}, {\varvec{H}}, {\varvec{R}} \right) . \end{aligned}$$
(5)

Kalman filter has been demonstrated in a wide range of applications from the navigation and tracking of vehicles and aircraft [24,25,26] to estimation and forecast in economics [27]. From the above analysis one can see that the KF uses a linear Gaussian dynamic model to estimate the hidden states of the processes. This is exact only when the underlying dynamical process is actually linear and the errors are Gaussian. However, the estimates by KF could be unsatisfactory for processes that are governed by nonlinear dynamics or where the errors are shaped by non-Gaussian effects.

2.2 Collapse and switching Kalman filters

Instead of using single linear Gaussian model, the SKF estimates the dynamical process as a mixture of N (with \(N>1\)) linear Gaussian models [14]. By construction, SKF is better able to estimate the hidden states of processes with nonlinear and non-Gaussian underlying dynamics, which is usually the case in practical applications.

An additional Markov “switching” variable \(S_t\) with a model transition matrix \({\varvec{Z}}\) \((N \times N)\) is introduced in SKF to determine the weights of the linear models that are used at time t. Suppose that \(S_t\) is known, the state at t is then estimated by a weighted combination of linear Gaussian models where the weight of each model is given by \(\text {Pr} (S_t=i|{\varvec{y}}_{1:t})\), for \(i=1, \ldots , N\). Assume that the initial state \(p({\varvec{x}}_1)\) is a mixture of N Gaussians, and each Gaussian can be propagated forward by N different models, so that the belief state of \(p({\varvec{x}}_2)\) will be a mixture of \(N^2\) Gaussians. Then the state at t, \(p({\varvec{x}}_t| {\varvec{y}}_{1:t})\), will be a mixture of \(N^t\) Gaussians. That is, the size of the belief state grows exponentially with time, which makes the SKF system based on exact propagation of the state intractable. There are several approaches to deal with the exponential growth, in this paper we will focus on the Generalized Pseudo Bayesian (GPB) algorithm [14, 28]. In this algorithm, we estimate the state at any time by a fixed number of N Gaussians, which requires one to approximate a mixture of \(N^t\) Gaussians at time t by a mixture of N Gaussians. This is achieved by a “collapsing” step which collapses a mixture of N Gaussians into a single one by first and second moments matching. Suppose that there is a mixture of Gaussians with mean values of \({\varvec{x}}^j\), covariances of \({\varvec{V}}^j\), and weights of \(W^j\), for \(j=1,2,\ldots \), then the collapsing can be obtained as

$$\begin{aligned} \begin{aligned} {\varvec{x}}&= \sum _{j} W^j {\varvec{x}}^j, \\ {\varvec{V}}&= \sum _{j} W^j \left[ {\varvec{V}}^j + ({\varvec{x}}^j-{\varvec{x}}) ({\varvec{x}}^j-{\varvec{x}})^T \right] . \end{aligned} \end{aligned}$$
(6)

For further reference, the collapse step that contains the operation of Eq. (6) can be written as a subroutine as

$$\begin{aligned} {[}{\varvec{x}}, {\varvec{V}}] = \mathbf{Collapse} \left( \{ {\varvec{x}}^j, {\varvec{V}}^j, {W}^j \}_j \right) . \end{aligned}$$
(7)

Now, for the propagation from time \(t-1\) to t, we can split the SKF process into two steps. Suppose that the posterior distribution \(p({\varvec{x}}_{t-1}|{\varvec{y}}_{1:t-1})\) at time \(t-1\) is a mixture of N Gaussians that is

$$\begin{aligned} p({\varvec{x}}_{t-1}|{\varvec{y}}_{1:t-1}, S_{t-1}=i) = {\mathscr {N}}\left( {\varvec{x}}_{t-1}^i, {\varvec{V}}_{t-1}^i \right) , \end{aligned}$$
(8)

where the mean \({\varvec{x}}_{t-1}^i = {\mathbb {E}}[{\varvec{x}}_{t-1}|{\varvec{y}}_{1:t-1}, S_{t-1}=i]\) and covariance \({\varvec{V}}_{t-1}^i = \text {Cov}[{\varvec{x}}_{t-1}|{\varvec{y}}_{1:t-1}, S_{t-1}=i]\), for \(i=1,\ldots , N\). The weight of Gaussian model i is obtained by \(W_{t-1}^i = \text {Pr}(S_{t-1} = i| {\varvec{y}}_{1:t-1})\). Then the propagation from time \(t-1\) to time t and from Gaussian model i to j is a KF, hence

$$\begin{aligned} p({\varvec{x}}_t|{\varvec{y}}_{1:t}, S_t = j, S_{t-1} = i) = {\mathscr {N}} \left( {\varvec{x}}_t^{ij}, {\varvec{V}}_t^{ij} \right) , \end{aligned}$$
(9)

where the mean \({\varvec{x}}_t^{ij} = {\mathbb {E}}[{\varvec{x}}_{t}|{\varvec{y}}_{1:t}, S_{t}=j, S_{t-1}=i]\) and covariance \({\varvec{V}}_{t}^{ij} = \text {Cov}[{\varvec{x}}_{t}|{\varvec{y}}_{1:t}, S_t=j, S_{t-1}=i]\). The first step of the propagation is then to use the Filter subroutine [Eq. (5)] to obtain \({\varvec{x}}_t^{ij}\) and \({\varvec{V}}_{t}^{ij}\) as

$$\begin{aligned} \left[ {\varvec{x}}_t^{ij}, {\varvec{V}}_t^{ij}, L_t^{ij} \right] = \mathbf{Filter} \left( {\varvec{x}}_{t-1}^i, {\varvec{V}}_{t-1}^i, {\varvec{y}}_t, {\varvec{A}}_j, {\varvec{Q}}_j, {\varvec{H}}_j, {\varvec{R}}_j \right) ,\nonumber \\ \end{aligned}$$
(10)

where \(L_t^{ij} = P({\varvec{y}}_t| {\varvec{y}}_{1:t-1}, S_t = j, S_{t-1}=i)\) is the likelihood of observing \({\varvec{y}}_t\); \({\varvec{A}}_j\), \({\varvec{Q}}_j\), \({\varvec{H}}_j\), and \({\varvec{R}}_j\) are the state space matrices of Gaussian model j, for \(j=1,\ldots , N\). The following by-products are also computed

$$\begin{aligned} \begin{aligned} W^{ij}_t&= \text {Pr}(S_t=j, S_{t-1}=i | {\varvec{y}}_{1:t}) = \frac{L_t^{ij} Z_{ij} W_{t-1}^i}{ \sum _{i,j} L_t^{ij} Z_{ij} W_{t-1}^{i} }, \\ W^j_t&= \text {Pr}(S_t=j|{\varvec{y}}_{1:t}) = \sum _{i} W_t^{ij}, \\ M_t^{ij}&= \text {Pr}(S_{t-1}=i | S_t=j, {\varvec{y}}_{1:t}) =\frac{W_t^{ij}}{W_t^j}, \end{aligned} \end{aligned}$$
(11)

where \(Z_{ij} = \text {Pr}\left( S_t = j | S_{t-1}=i \right) \) is the component of the model transition matrix \({\varvec{Z}}\). Then the second step of the propagation is to collapse the mixture of \(N^2\) Gaussians into a mixture of N Gaussians by the Collapse subroutine (7) as

$$\begin{aligned} \left[ {\varvec{x}}_t^j, {\varvec{V}}_t^j \right] = \mathbf{Collapse} \left( \{ {\varvec{x}}_t^{ij}, {\varvec{V}}_t^{ij}, M_t^{ij} \}_i \right) . \end{aligned}$$
(12)

When a single estimate of the state is desired, the Collapse subroutine can be applied again on the N mixture of Gaussians and obtain a single Gaussian distribution.

The model parameters, for instance \({\varvec{A}}\), \({\varvec{H}}\), \({\varvec{Q}}\), and \({\varvec{R}}\), remain to be estimated from the available information. In practice, the hidden state \({\varvec{x}}_{1:t}\) is usually hard to obtain, hence, only the observation data \({\varvec{y}}_{1:t}\) is provided for parameter estimation. The method of maximum likelihood estimator (MLE) [29] is utilized for such purpose, in which the log-likelihood of the observation is first obtained as

$$\begin{aligned} \ln \left( {\varvec{y}}_{1:\tau } \right)&= \ln \prod _{t} p \left( {\varvec{y}}_t | {\varvec{y}}_{1:t-1} \right) \nonumber \\&= \sum _{t} \ln p\left( {\varvec{y}}_t | {\varvec{y}}_{1:t-1} \right) \nonumber \\&= \sum _{t} \ln \left[ \sum _{i, j = 1}^{N} p \left( {\varvec{y}}_t, S_t = j, S_{t-1}=i | {\varvec{y}}_{1:t-1} \right) \right] \nonumber \\&= \sum _{t} \ln \left[ \sum _{i, j = 1}^{N} L_t^{ij} \text {Pr} \left( S_t = j, S_{t-1}=i | {\varvec{y}}_{1:t-1} \right) \right] \nonumber \\&= \sum _{t} \ln \left[ \sum _{i, j = 1}^{N} L_t^{ij} Z_{ij} \text {Pr}\left( S_{t-1}=i | {\varvec{y}}_{1:t-1} \right) \right] \nonumber \\&= \sum _{t} \ln \left[ \sum _{i, j =1} ^{N} L_t^{ij} Z_{ij} W_{t-1}^i \right] . \end{aligned}$$
(13)

In Eq. 13, \(\tau \) is the end time of the data; and, \(L_t^{ij}\), \(Z_{ij}\) and \(W_{t-1}^i\) are the same as in Eqs. (10) and (11). The set of parameters, denoted as \({\mathscr {P}}\), are embedded in the distribution of \({\varvec{y}}_{1:T}\) and can be estimated by maximizing the log-likelihood function as

$$\begin{aligned} {\mathscr {P}}^* = {{\,\mathrm{arg\,max}\,}}_{{\mathscr {P}}} \ln \left( {\varvec{y}}_{1:t} | {\mathscr {P}}\right) . \end{aligned}$$
(14)

To avoid local maxima, the global optimization method, Basin-hopping [16], is used to solve Eq. (14).

2.3 Step-ahead predictor

The SKF follows the one-step prediction and updating algorithm, hence, the observation at time \(t+1\) is required for the hidden state estimation at \(t+1\). However, we are also interested in the step-ahead prediction without knowing the observations of the future. More specifically, the prediction of one step ahead (at time \(t+1\)) or several steps ahead (at time \(t+r\)) given that the state at time t is known. We start with the one-step-ahead predictor, in which, \(p ({\varvec{x}}_{t+1} | {\varvec{x}}_t)\) is first computed as

$$\begin{aligned} \begin{aligned} p({\varvec{x}}_{t+1} | {\varvec{x}}_t) =&\sum _{i, j = 1}^{N} p ( {\varvec{x}}_{t+1}, S_{t+1}=j, S_t=i | {\varvec{x}}_t ) \\ =&\sum _{i, j = 1}^{N} p ( {\varvec{x}}_{t+1} | {\varvec{x}}_t, S_{t+1}=j, S_t=i )\cdot \\&\text {Pr}( S_{t+1}=j, S_t=i | {\varvec{x}}_t ) \\ =&\sum _{i, j = 1}^{N} p ( {\varvec{x}}_{t+1} | {\varvec{x}}_t, S_{t+1}=j, S_t=i ) W_{t+1|t}^{ij}, \end{aligned} \end{aligned}$$
(15)

where \(W_{t+1|t}^{ij} = \text {Pr}( S_{t+1}=j, S_t=i | {\varvec{x}}_t ) = W_{t+1|t}^j M_{t+1|t}^{ij}\), where \(W_{t+1|t}^j = P( S_{t+1}=j | {\varvec{x}}_t )\) and \(M_{t+1|t}^{ij} = \text {Pr}( S_t=i | S_{t+1}=j, {\varvec{x}}_t )\). Please note the differences of definitions between \(W_{t+1|t}^{ij}\) and \(W_{t}^{ij}\), \(W_{t+1|t}^j\) and \(W_{t}^j\), and \(M_{t+1|t}^{ij}\) and \(M_{t}^{ij}\). By assumptions of KF, we have

$$\begin{aligned} P ( {\varvec{x}}_{t+1} | {\varvec{x}}_t, S_{t+1}=j, S_t=i ) = {\mathscr {N}} \left( {\varvec{x}}_{t+1|t}^{ij}, {\varvec{V}}_{t+1|t}^{ij} \right) , \end{aligned}$$
(16)

where the mean \({\varvec{x}}_{t+1|t}^{ij} = {\mathbb {E}} \left[ {\varvec{x}}_{t+1} | {\varvec{x}}_t, S_{t+1}=j, S_t=i\right] \) and the covariance \({\varvec{V}}_{t+1|t}^{ij} = \text {Cov} \left[ {\varvec{x}}_{t+1} | {\varvec{x}}_t, S_{t+1}=j, S_t=i \right] \). Then, similar to Eq. (11) we can compute

$$\begin{aligned} \begin{aligned} W^{ij}_{t+1|t}&= \frac{Z_{ij} W_{t|t-1}^i}{ \sum _{i,j} Z_{ij} W_{t|t-1}^{i} }, \\ W^j_{t+1|t}&= \sum _{i} W_{t+1|t}^{ij}, \\ M_{t+1|t}^{ij}&= \frac{W_{t+1|t}^{ij}}{W_{t+1|t}^j}. \end{aligned} \end{aligned}$$
(17)

Note that, different from Eqs. (11), (17) does not have the likelihood function of the observation since the observation at time \(t+1\) is not available. Moreover, Eq. (17) is a recursive form, the initial condition of which, suppose that the step-ahead predictor starts with time t, is \(W^i_{t|t-1} = W^i_{t}\) where \( W^i_{t}\) is obtained from Eq. (11). We can see that the one-step-ahead predictor is a mixture of \(N^2\) Gaussian. A Collapse step can be followed to reduce the number of mixture Gaussians to N. The step ahead propagation from time t to \(t+1\) can also be split into two steps, with the first step being the predicting part of the Filter subroutine, that is

$$\begin{aligned} \begin{aligned} {\varvec{x}}_{t+1|t}^{ij} =&{\varvec{A}}_j {\varvec{x}}_t^i, \\ {\varvec{V}}_{t+1|t}^{ij} =&{\varvec{A}}_j {\varvec{V}}_t^i {\varvec{A}}_j^T + {\varvec{Q}}_j. \\ \end{aligned} \end{aligned}$$
(18)

In the second step, a collapse step is applied

$$\begin{aligned}{}[{\varvec{x}}_{t+1}^j, {\varvec{V}}_{t+1}^j] = \mathbf{Collapse} \left( \{ {\varvec{x}}_{t+1|t}^{ij}, {\varvec{V}}_{t+1|t}^{ij}, M_{t+1|t}^{ij} \}_i \right) . \end{aligned}$$
(19)

From which we can see that the one-step-ahead predictor uses only the estimated state of the current step to predict the state of the next step. With the recursive manner, this predictor can be easily extended to a multi-step-ahead predictor by letting the predicted state of the current step be the available information for the next step prediction.

3 Models and methodology

3.1 Trend component

In the classical additive decomposition [30], the time series \({\varvec{x}}_t\) can be decomposed as

$$\begin{aligned} {\varvec{x}}_t = {\varvec{m}}_t + {\varvec{s}}_t + {\varvec{\epsilon }}_t, \end{aligned}$$
(20)

where \({\varvec{m}}_t\) is the slow changing trend component, \({\varvec{s}}_t\) is the seasonal component that has known period T, and \({\varvec{\epsilon }}_t\) is the random noise component. One can relax the decomposition and obtain the trend plus noise model as

$$\begin{aligned} {\varvec{x}}_t = {\varvec{m}}_t + {\varvec{\epsilon }}_t. \end{aligned}$$
(21)

In time series analysis, the polynomial models have been widely applied to filtering and prediction since they can efficiently capture the trend component [31]. The 0th and 1th order polynomials are proved to be adequate for short term predictions. The COVID-19 data is affected by many factors including the population density, mobility of the community, temperature, testing credibility, masks policy, lock down policy, personal attitude, personal health predisposition, etc. With the high complexity, the 1th and 2nd order polynomials are used to predict the trend components. The 2nd order polynomials are useful for prediction problem with longer lead time [31], which is the case for COVID-19 data.

The model constructed by the 1th order polynomials in the state space is also known as the constant velocity model. The state vector is \({\varvec{x}} = [x, {\dot{x}}]^T\), and the velocity is assumed to be constant over time, that is \(\partial \dot{{\varvec{x}}}_t / \partial t = 0\). The state transition and state error covariance matrices of the model can be obtained as [24]

$$\begin{aligned} {\varvec{A}}_{\text {vel}} = \begin{bmatrix} 1 &{} \varDelta t \\ 0 &{} 1 \\ \end{bmatrix}, \quad {\varvec{Q}}_{\text {vel}} = \sigma _q^2 \begin{bmatrix} \frac{1}{3}\varDelta t^3 &{} \frac{1}{2}\varDelta t^2 \\ \frac{1}{2}\varDelta t^2 &{} \varDelta t \\ \end{bmatrix}, \end{aligned}$$
(22)

where \(\varDelta t\) is the sample interval, and \(\sigma _q^2\) is a constant that defines the level of variance in the error.

The model constructed by the 2nd order polynomials in the state space is also known as the constant acceleration model. The state vector is \({\varvec{x}} = [x, {\dot{x}}, \ddot{x}]^T\), and the acceleration is assumed to be constant over time, that is \(\partial \ddot{{\varvec{x}}}_t / \partial t = 0\). The state transition and state error covariance matrices of the model can be obtained as [24]

$$\begin{aligned} {\varvec{A}}_{\text {acc}}= & {} \begin{bmatrix} 1 &{}\quad \varDelta t &{}\quad \varDelta t^2\\ 0 &{}\quad 1 &{}\quad \varDelta t\\ 0 &{}\quad 0 &{}\quad 1 \end{bmatrix},\nonumber \\ {\varvec{Q}}_{\text {acc}}= & {} \sigma _q^2 \begin{bmatrix} \varDelta t^5 / 20 &{}\quad \varDelta t^4 / 8 &{}\quad \varDelta t^3 /6\\ \varDelta t^4 / 8 &{}\quad \varDelta t^3 / 3 &{}\quad \varDelta t^2 /2\\ \varDelta t^3 / 3 &{}\quad \varDelta t^2 / 2 &{}\quad \varDelta t\\ \end{bmatrix}. \end{aligned}$$
(23)

3.2 Seasonal component

By visually inspecting, the daily new cases data of the US in Fig. 1 exhibits a periodic behavior with the period approximately equals 7 days. In this paper, several seasonal components are used to capture the periodic behavior. Each of the seasonal component, \({\varvec{s}}_t^j\), is modeled by a harmonic function of sines and cosines recursively with a specified frequency \(T_j\) as [32]

$$\begin{aligned} \begin{pmatrix} {\varvec{s}}_t^j \\ {\varvec{s}}_t^{j*} \end{pmatrix} = \begin{bmatrix} \cos \omega _j &{} \sin \omega _j \\ -\sin \omega _j &{} \cos \omega _j \\ \end{bmatrix} \begin{pmatrix} {\varvec{s}}_{t-1}^{j} \\ {\varvec{s}}_{t-1}^{j*} \end{pmatrix} = {\varvec{A}}_{{\varvec{s}}^j} \begin{pmatrix} {\varvec{s}}_{t-1}^{j} \\ {\varvec{s}}_{t-1}^{j*} \end{pmatrix}, \end{aligned}$$
(24)

where \(\omega _j = 2\pi /T_j\) is the angular frequency. In Eq. (24), \({\varvec{s}}_t^j\) is the seasonal value at time t, and \({\varvec{s}}_t^{j*}\) is an auxiliary value by construction. The error covariance matrix associated with \({\varvec{s}}_t^j\) and \({\varvec{s}}_t^{j*}\) is given by

$$\begin{aligned} {\varvec{Q}}_{{\varvec{s}}^j} = \begin{bmatrix} \sigma _{{\varvec{s}}^j}^2 &{} 0 \\ 0 &{} \sigma _{{\varvec{s}}^j}^2 \end{bmatrix},\, \end{aligned}$$
(25)

where \(\sigma _{{\varvec{s}}^j}^2\) is a constant.

Incorporating the trend components with the seasonal components, the state space matrices of the constant velocity model becomes

$$\begin{aligned} \begin{aligned} {\varvec{A}} =&\text {blkdiag} \left( {\varvec{A}}_{\text {vel}}, \, {\varvec{A}}_{{\varvec{s}}^1},\, \ldots , \, {\varvec{A}}_{{\varvec{s}}^{n_s}} \right) , \\ {\varvec{Q}} =&\text {blkdiag} \left( {\varvec{Q}}_{\text {vel}}, \, {\varvec{Q}}_{{\varvec{s}}^1} ,\, \ldots , \, {\varvec{Q}}_{{\varvec{s}}^{n_s}} \right) , \\ {\varvec{H}} =&\begin{bmatrix} 1,&0,&1,&0,&\ldots ,&1,&0, \end{bmatrix}, \\ {\varvec{R}} =&\sigma _r^2, \end{aligned} \end{aligned}$$
(26)

where \(n_s\) is the number of seasonal components; “blkdiag” denotes a block diagonalization operator. Similarly, the state space matrices of the constant acceleration model becomes

$$\begin{aligned} \begin{aligned} {\varvec{A}} =&\text {blkdiag} \left( {\varvec{A}}_{\text {acc}}, \, {\varvec{A}}_{{\varvec{s}}^1},\, \ldots , \, {\varvec{A}}_{{\varvec{s}}^{n_s}} \right) , \\ {\varvec{Q}} =&\text {blkdiag} \left( {\varvec{Q}}_{\text {acc}}, \, {\varvec{Q}}_{{\varvec{s}}^1} ,\, \ldots , \, {\varvec{Q}}_{{\varvec{s}}^{n_s}} \right) , \\ {\varvec{H}} =&\begin{bmatrix} 1,&0,&0,&1,&0,&\ldots ,&1,&0, \end{bmatrix}, \\ {\varvec{R}} =&\sigma _r^2. \end{aligned} \end{aligned}$$
(27)

For the COVID-19 data of the US, the cyclic behavior is clearly non-harmonic with additional fluctuations within each cycle. To better capture these dynamics, two seasonal components (\(n_s = 2\)) with \(T_1 = 7\) and \(T_2 = 3.5\) days are incorporated.

3.3 Methodology

From the previous analysis, the set of unknown parameters that remains to be learned from the COVID-19 data is

$$\begin{aligned} {\mathscr {P}} = \left( \sigma _{q_{\text {acc}}}^2, \sigma _{q_\text {vel}}^2, \sigma _r^2, \sigma _{{\varvec{s}}^1}^2, \sigma _{{\varvec{s}}^2}^2 \right) . \end{aligned}$$
(28)

The lower bound and upper bound for all the parameters are chosen to be \(1.0\times 10^{-7}\) and \(1.0\times 10^7\), that are large enough to be able to include the optimal solution. The model transition matrix is assumed to be

$$\begin{aligned} {\varvec{Z}} = \begin{bmatrix} Z_{11} &{} Z_{12} \\ Z_{21} &{} Z_{22} \end{bmatrix} = \begin{bmatrix} 0.99 &{} 0.01 \\ 0.01 &{} 0.99 \end{bmatrix}. \end{aligned}$$
(29)

That is, the state tends to remain on its own state. The choice of the transition matrix is based on the nature that the daily new cases data of COVID-19 usually has several stages, namely, the steady growth stage, super linearly growth stage, flat curve stage, and decreasing stage, and these stages remain for a period of time before it switches to the next trend, that is, it tends to remain on its own stage.

figure a
figure b

The SKF method and its multi-step-ahead predictor are summarized in Algorithms 1 and 2, respectively. For the purpose of dynamics learning of the COVID-19 data, the Basin-hopping optimization is first used to find the optimal set of parameters \({\mathscr {P}}^*\) by maximizing the log-likehood of the observations, \(\ln ({\varvec{y}}_{1:\tau })\), that is obtained from Algorithm 1. In other words, Algorithm 1 serves as an objective function in the optimization problem. The data up to time \(\tau \) is the training set and is defined by the user. After obtaining the optimal set of parameters \({\mathscr {P}}^*\), it can then be fed into Algorithm 1 again to estimate the hidden states \({\varvec{x}}_{1:\tau }\), where \(\tau \) is the end time that the user intends to estimate. Note that the end time of the estimation can be different from the end time of the training set. These estimations can be compared with the data to verify the validity of the learned dynamics. Finally, for the purpose of forecasting of the COVID-19 data, in which no further observations are known beyond time \(\tau \), Algorithm 2 can be used to perform multi-step-ahead predictions starting from \(\tau \), with specified number of steps r. The r steps ahead predictions are used to approximate the future trend of the data. The forecasting can also start from any date of the available data, say \(\tau '\) with \(\tau '<\tau -r\), then the forecasts are obtained by assuming that the data after \(\tau '\) is not available. The r steps ahead predictions in this case can then be compared with the data to quantify the accuracy of the forecasts.

4 Dynamics learning and forecasting of the COVID-19 of the US

4.1 Dynamics learning and validation

In this section, the SKF is first used to identify the hidden dynamics behind the daily new cases of the COVID-19 of the US, including the whole US and some individual states, for instance, California (CA), New York (NY), Florida (FL), Texas (TX), North Carolina (NC), Georgia (GA) and Alabama (AL). The selected states are either the previous epicenter, or the ones that have rapid increasing trends currently based on the data up to July 24.

The incubation period of COVID-19 can be as long as 14 days according to [33], the infections from asymptomatic to symptomatic are postponed to be shown in the data. In addition, the wait time to get the test results is usually 3\(\sim \)5 days [34], and this time is extended with the number of people being tested increased rapidly starting in July. Thus, the severity of the pandemic today is only presented in the data after one week or so, which in turn causes the delays of social reactions. These delays can postpone the effects of policy interventions, including lock down, quarantine, mandated masks, mandated social distancing, etc. As a result, the time series of daily new cases has slow changing dynamics, even though the dynamics evolves with time. In other words, the dynamics of the next several days might not have large variation from the dynamics of the previous several days. This provides the foundation to do step-ahead predictions. Moreover, one might not need to use all the available data to capture the embedded dynamics as long as the additional data will not alter the current dynamics significantly.

To verify this, the Basin-hopping optimization and Algorithm 1 are used to learn the dynamics based on two different training sets, namely, the set with data up to June 30 and the set with data up to July 20. The optimal parameters for the US and several individual states are shown in Tables 1 and 2, the first three parameters of these two tables define the trend component. Note that the behavior of the KF is highly affected by the ratios of two error covariances [8, 35], \(\sigma _{q_{\text {acc}}}^2/\sigma _r^2\) and \(\sigma _{q_{\text {vel}}}^2/\sigma _r^2\). By comparing Tables 1 and 2 for the same locations, the error covariances of the observation, \(\sigma _r^2\), only have minor differences. Though, the differences between \(\sigma _{q_{\text {acc}}}^2\) and \(\sigma _{q_{\text {vel}}}^2\) could be significant for some locations, for instance the US, CA and NC, the ratios of \(\sigma _{q_{\text {acc}}}^2/\sigma _r^2\) and \(\sigma _{q_{\text {vel}}}^2/\sigma _r^2\) remain at low levels. That means the dynamics learned from training data up to June 30 is similar to that learned from training data up to July 20.

Table 1 Parameters learned from data up to July 20
Table 2 Parameters learned from data up to June 30
Fig. 2
figure 2

The hidden state estimation of the US with different training sets

Fig. 3
figure 3

The hidden state estimation of Florida with different training sets

Graphically, the comparisons of the hidden state estimations given by Algorithm 1 with parameters learned from two different training sets of two representative locations, the US and Florida, are presented in Figs. 2 and 3. In these figures, the top ones present the filtered (or estimated) trend components with 95% confidence intervals, the middle ones show the filtered trend plus seasonal components with 95% confidence intervals, and the bottom one depicts the probabilities of the constant acceleration and constant velocity models along the time. The dashed vertical lines in the top and middle figures indicate the cutoff dates of the training sets. For all the top sub-figures, it can be seen that the trend component is able to capture the overall evolution of the daily new cases where the cyclic behaviors are filtered. However, the 95% confidence interval of the trend component is unable to cover the measured data. With the additional seasonal components, the estimations in the middle sub-figures represent the data very well, and some missing peaks are captured by the associated 95% confidence intervals. The model probability figures in the bottom indicate the switching of dominance between the constant acceleration model and constant velocity model. Due to the nature of these two models, the number of cases grow linearly when the constant velocity model is dominant; the cases grow quadratically when the constant acceleration model is in dominance; and the growth rate is in between when the probabilities of these two models are close. The comparison between Fig. 2a, b shows that the performance of the estimations are very similar in all three sub-figures for different training sets. However, there are several differences between Fig. 3a, b. Firstly, the confidence intervals given by the training set up to July 20 are slightly larger than the other ones. This can be explained that with more data in the training set, especially the additional data has more fluctuations, the SKF algorithm requires larger covariance matrices to accommodate a larger variance. Secondly, the model probability figures are slightly different, especially between April 5 and April 26. This is because the SKF algorithm is better tweaked in the period with small variance for the smaller training set. Nevertheless, the overall behaviors of Fig. 3a, b are similar, especially for the dates after June 20, which are more interesting for future decision-making.

This part shows that the SKF is able to accurately capture the trend component, and can represent the data well with additional seasonal component. It is also showed that we can infer the future dynamics of the daily new cases series by the available data with good accuracy. Hereinafter, the training set will always be the data up to July 20.

4.2 Shared dynamics by different locations

Revisiting Table 1 with focus on the first three parameters that define the trend component, we see that the parameters for CA, NY, FL and TX (referred to as the first group) are similar, and the parameters for NC, GA and AL (referred to as the second group) are similar. The parameters of these two groups are generally different if we look at the \(\sigma _r^2\). The locations in the first group have large populations and population densities, while the ones in the second group are on the contrary with low populations and population densities. In other words, the locations with same level of population and population density seem to share the same dynamics. Hence, we can possibly use the parameters learned from one location to estimate the hidden states of another. To verify this, the estimations of the hidden states of CA and NY by using the parameters learned from their own data are shown in Figs. 4 and 5, respectively. And the estimations of the hidden states of FL and NY by using the parameters learned from CA data is shown in Figs. 6 and 7, respectively.

Fig. 4
figure 4

The hidden state estimation of California with trained parameters learned from its own data

Fig. 5
figure 5

The hidden state estimation of New York with trained parameters learned from its own data

Fig. 6
figure 6

The hidden state estimation of Florida with trained parameters learned from California data

Fig. 7
figure 7

The hidden state estimation of New York with trained parameters learned from California data

Comparing Fig. 4 with Fig. 5, the model probabilities of CA and NY are very different. And the evolutions of the daily new cases of these two states are different too. However, the estimation of the hidden states of NY based on the learned parameter from CA in Fig. 7 resembles Fig. 5 a lot, and the 95% interval of the filtered trend and seasonal components can include the test data well. This not only shows that the hidden states of NY can be estimated via CA data, but also indicates that the model probability is an outcome of combined effects of the learned parameter and the test data (the model probabilities of Figs. 4 and 7 are very different with exactly the same dynamics). For FL, the estimations based on the learned parameters of its own data (Fig. 3b) and the estimations based on the learned parameters of CA data (Fig. 6) are also close.

The hidden state estimations of the three locations with low populations and population densities are shown in Figs. 8, 9 and 10. For NC in Fig. 8, the constant velocity model is superior most of the time, though the advantage is not huge, which is consistent with the steadily increasing of daily new cases. The evolution patterns of the model probability for GA and AL are similar, they both saw the dominance of constant velocity model before June 20 (or near this date), and a switching to more aggressive constant acceleration model after that. Fig. 11 presents the estimations of hidden state for NC again, but with the parameters learned from GA data. Comparing Fig. 11 with Fig. 8, the trend and seasonal behaviors are well captured, the model probability has same pattern of evolution, and the switches between models are also represented. Figure 11 is an example showing that the dynamics are similar within the second group.

Fig. 8
figure 8

The hidden state estimation of North Carolina with trained parameters learned from its own data

Fig. 9
figure 9

The hidden state estimation of Georgia with trained parameters learned from its own data

Fig. 10
figure 10

The hidden state estimation of Alabama with trained parameters learned from its own data

Fig. 11
figure 11

The hidden state estimation of North Carolina with trained parameters learned from Georgia data

Actually, the primary route of transmission of the COVID-19 is by close contact from person-to-person [36], and this makes the population density a vital factor for the spread of the disease. With same dynamics within the first and second groups, respectively, our research indicates that the population density might be the driving force of the spread COVID-19.

4.3 The role of model probability

From the previous analysis, we already know that the model probability gives the dominance of the constant acceleration model and constant velocity model, which are associated with quadratic growth and linear growth of the data, respectively.

Another important feature is that the switching from linear growth to quadratic growth usually indicates the change of growth rate. When the data is increasing both before and after the switching, it is a warning sign that the new cases could increase rapidly. Take the US as an example (see in Fig. 2b), the switching at around March 20 from linear growth to quadratic growth warned the rapid increasing of new cases, and the later evolution has confirmed it. The switching at around June 20, also gave a clear sign of increased growth rate, and we saw another rapid increasing of new cases. By the estimations of FL in Fig. 3b and CA in Fig. 4, GA in Fig. 9, Alabama in Fig. 10, and some other locations that are not shown here, for instance TX and AZ, we can see the switches from linear growth to quadratic growth for all these locations at around June 20. The non-accidental similarity gave strong sign of rapid increasing of infections, which could be useful for the decision-making.

On the contrary, the switching from the quadratic growth to linear growth usually gives good sign of either switching to a less aggressive increasing or a stable decreasing stage. Figure 2b of the US data experienced this type of switching at around the end of April, and the data switched to a steady decreasing stage after that. In addition, there is a sign of the switching at July 24, which means the data will hopefully switch from rapid increasing to a less aggressive stage. For CA in Fig. 4, the switching at around July 5 gave a short period of break from a rapid increasing to a less aggressive growth, and it remained at the linear growth stage which indicates that CA could experience a steady growth stage with low rate. One can do the same analysis for other locations.

From the above analysis we see that the model probability plays an important role in exposing the hidden dynamics and making general and non-quantitative predictions. The switching from one model to another gives useful information for the overall trend.

4.4 Forecasting

Algorithm 2 described in Sect. 3.3 is used for forecasting of the COVID-19 data. Figures 12 and 13 present the 20 days forecasts of the US and CA at different times, respectively. Of which, the top sub-figures, the second sub-figures from the top, the third sub-figures from the top, and the bottom sub-figures show that forecasts starting from June 28, July 8, July 17 and July 24, respectively. The model probabilities along the forecasts are also presented. The top two sub-figures, where the data of the predicted dates are available, show that the forecasts can capture the future trend well. The predictions with seasonal components are able to recover the cyclic behaviors in the data to some extent. Moreover, the 95% confidence interval of the forecasting is able to include the measurement data but the width of the interval grows very quickly with increased prediction steps. Which means the forecasts by the SKF are more confident in short to middle terms.

For the US in Fig. 12, the trend showed relatively rapid increasing within the 20 days of forecasting starting from June 28. The growth rate was decreased for the forecasts staring from July 8, and the forecasting even switched to decreasing for the forecasts starting from July 17. The model probability of constant acceleration in this case experienced a drop from over 0.8 to approximately 0.6, which resembles a switching from quadratic growth stage to linear growth stage, that is, the rapid growth was tempered. However, this trend was not maintained since the constant acceleration regained its dominance afterward, see in the bottom sub-figure that shows the forecasts starting from July 24. It suggests that the new infections stopped decreasing on July 24 and will be in a flat curve stage.

For CA in Fig. 13, the forecasting starting from June 28 indicated a rapid increasing within 20 days, though the model probability suggested a potential switch from quadratic growth to linear growth. But, if comparing with the trend several steps behind, where the daily new cases just experienced a nearly exponential growth, the increasing rate of the forecasts is still tempered, which is consistent with the change in model probability. The growth rate reduced a lot for the forecasting staring from July 8. The model probability at around July 8 suggested a switching from quadratic growth stage to linear growth stage, and it stayed at the latter. This indicates that the daily case of California is more likely to remain at a linear growth stage after July 8, with a growth rate that is approximately the slope of the trend component. This is verified by the forecasts starting from July 17 and July 24. Both of which gave linear growth forecasts.

For both the US and CA, we can see that the forecasting changes with time progresses. The limitations of the forecasting lie in the observation that prediction could give deviated trend when there is a switch from one model to another. For example in the third sub-figure of Fig. 12, the predicted trend indicated a decrease in new cases, and the dominance of constant acceleration model dropped to the same level of the constant velocity model. However, the constant acceleration model regained its dominance afterwards and the predicted trend in the last sub-figure of Fig. 12 is flattened. In addition, though the forecasts at a stable dynamics (no switch between models) can predict the future trend well, it is not necessarily that the prediction with seasonal components can always capture the cyclic behaviors well. For instance the first sub-figure of Fig. 13, in which the trend of the data is well predicted but the cyclic behavior is not represented well in the predictions of the first several days.

The more quantitative forecasting of this section and the non-quantitative overall prediction by the model probability can be combined for mutual verification and providing more reliable forecasts.

Fig. 12
figure 12

Daily new cases forecasting of the US by SKF at different times

Fig. 13
figure 13

Daily new cases forecasting of California by SKF at different times

5 Concluding remarks

In this paper, the SKF with seasonal component is introduced and applied for the dynamics learning and forecasting of the daily new cases of COVID-19 of the US. The optimal parameters of SKF learned from the data is able to capture both the trend and seasonal component, in a sense that the 95% confidence interval is able to include the data with narrow width. The resembles of dynamics at neighborhood period of time is also embedded in the SKF parameters, hence the dynamics learned from previous data is sufficient to estimate the hidden states of future time steps. It is also discovered that the locations with same level of population and population density have similar dynamics, the dynamics of one location can accurately estimate of the hidden state of another. The model probabilities give implications of how the new cases could evolve as well as how the growth rate could change. The switching between models indicated the change of dynamics, hence, in turn, provides useful information for inference and prediction of the overall trend. The multi-step-ahead predictor of SKF provides quantitative forecasts of new cases for both trend and seasonal components. The forecasting will update with the progressing of time and has narrow 95% confidence intervals for short to middle term predictions. The quantitative forecasting can be combined the overall prediction given by the model probabilities to offer more insight on the future trend.

We remark that the effects of social interventions on the pandemic are embedded in the dynamics of the daily new cases data. The changes of public policies can be presented in the switches of the dynamics given by the SKF. The consequences of new major policies can, hence, be observed and predicted by the SKF. The state space matrices \({\varvec{A}}\), \({\varvec{H}}\), \({\varvec{Q}}\) and \({\varvec{R}}\) are assumed to be constant in the present paper, however, there are changes of dynamics along the time. For instance, the variations of the daily new cases data after June 20, especially for some states like CA, FL, GA and others that are not presented, were obviously larger than the data before. Using an online algorithm to learn the time-varying parameters could improve the SKF method. Moreover, one could also incorporate the SKF method with the SIR family models. Different from the SKF method, where the epidemic features are assumed to be implicitly embedded in the learned dynamics, the SIR family methods describe the dynamics of COVID-19 by epidemic parameters explicitly, for instance the infection rate, recover rate and others. However, different social scenarios are supposed to have different parameters, for instance, the infection rates with and without lock down policies are different, and the infection and recover rates for young and elderly people are different. Different models can be generated with different parameters, and these models can be incorporated into the scheme of SKF. The model probabilities of these models can provide rich information of the effectiveness of social interventions as aforementioned. Linearization techniques of the SIR family model would be required for the incorporation.