Background

The Coronavirus Disease 2019 (COVID-19) pandemic has disrupted every aspect of human society. Because of the highly infectious nature of the disease, state governments in the United States (US) have implemented social distancing measures (e.g., closure of non-essential businesses, regional lock-down, and face-covering mandates) to contain the virus spread and flatten the epidemic curve (epi curve) [1]. However, since these state-level measures have differed in the strength and timeline of policy enforcement, it is intractable to rely on a simple rubric to evaluate the policy effectiveness. An alternative step is via analyzing the time series of the COVID-19 cases, which can eventually assist stakeholders with proactive health policymaking, such as determining the optimal timing to relieve social distancing.

One critical variable in the time series analysis is the change point, also called the inflection point, which is the point where a sudden change occurs in chronologically ordered observations. The change point detection has been long employed in statistical theory [2], but its applications to COVID-19 are relatively underexplored. For example, when modeling COVID-19 cases, the majority of studies have defined change points as key dates of policy interventions or social events [1, 3]. Other studies have employed parametric models, such as the linear regression model [4, 5] and the Bayesian model [6, 7] to derive change points. However, most of these parametric models require that the observations are normally distributed and that the trend line cannot have extreme variability. In situations where the observations show large variability over time and the trend line cannot be well fitted, parametric models become less reliable. These situations are not uncommon in fitting the COVID-19 epi curve, as the disease progression has a considerable degree of uncertainties and variability [1].

To overcome the limitations of the parametric model, we have applied a nonparametric model, called the Mann-Kendall-Sneyers (MKS) test, to change point detection in the COVID-19 epi curve. The MKS test, developed from a prototype model by Mann [8], is used to detect the monotonic trends (e.g., upward, downward) and their corresponding change points in time series data. The model has been primarily employed in earth science research to characterize the fluctuation of climatic and environmental variables, such as rainfall, air temperature, and surface runoff [9,10,11]. Recently, some COVID-19 studies have used the Mann-Kendall (MK) test, which is an earlier version of the MKS test, for trend detection [12, 13]. While the MK test is useful in detecting monotonic trends, it cannot detect changes in the trends and the corresponding change points, making it less useful for disease tracking and monitoring in the mid to long term. The MKS test, as a sequential extension of the MK test [14], fills this gap. It can become a valuable tool for long-term disease monitoring and can thus support public health decision-making.

The contributions of the paper are as follows.

  • The paper is the first to apply the MKS test to COVID-19 time series analysis.

  • The paper identifies six change point patterns for state COVID-19 cases.

  • The paper develops an open-access tool for model implementation.

Methods

The nonparametric MKS test [15], oftentimes called the sequential Mann-Kendall-Sneyers test, has been applied to the change point detection for long-term time series data (e.g., hydrological changes, climatic changes). According to the Centers for Disease Control and Prevention (CDC) report, both social distancing and mass gathering can potentially lead to an abrupt change in regional COVID-19 cases, albeit in different directions [16]. Then, we have evaluated the potential of the MKS test for change point detection in short-term time series data, the COVID-19 cases of infection.

In this section, we first articulate the MKS test. Then, we use an example to demonstrate the model implementation.

Method description

The MKS test applied to the COVID-19 time series data can be completed in three major steps.

Step 1: Deriving test statistics (S k)

We have treated new weekly cases as an independent observation in a 45-week time series data. Under the null hypothesis that the development of new cases remains stable, for each state, we have a time series of the weekly new cases: X = {x1, x2, x3xN }, where n is the total number of weeks under observation (N = 45 in our case study). mi (i = 1, 2, …, N) represents the total number of elements xj preceding xi (j < i) where xj < xi.

Based on mi, the test statistic Sk derives the cumulative mi for each week, as shown in Eq. (1).

$${S}_k=\sum_{i=1}^k{m}_i\ \left(k=1,2,3,\dots, N\right)$$
(1)

The mean of Sk can be derived by Eq. (2).

$$E\left({S}_k\right)=k\left(k-1\right)/4$$
(2)

The variance of Sk can be derived by Eq. (3).

$$VAR\left({S}_k\right)=k\left(k-1\right)\left(2k-5\right)/72$$
(3)

Step 2: Deriving two sequences (U f and U b)

Next, we derive two sequences, the forward sequence Uf and the backward sequence Ub, based on the three variables (Sk, E(Sk), and VAR(Sk)) in Eqs. (1) through (3). Specifically, the forward sequence Uf of the time series is derived by Equation [4].

$${U}_f=\left({S}_k-E\left({S}_k\right)\right)/\sqrt{VAR\left({S}_k\right)}$$
(4)

Then, we reverse the sequence of the original time series X and term it Xr. An intermediate sequence Ufr is derived by applying Eq. (4) to the reversed time series Xr. We reverse the sequence of the values in Ufr (i.e., the first value appears the last, and vice versa). We generate the backward sequence Ub by adding a negative sign to the reversed values.

Step 3: Deriving change points

Lastly, we identify the change points of the time series X based on the two generated sequences (Uf and Ub). We first identify the initial set of the change points as the points of intersection between the two sequences. Previous studies show that it is uncertain to recognize all of these change points as abrupt changes, as a change point can be induced by a sudden shift of the mean value over two stable periods [17]. These outlier points could be reevaluated by using additional detection methods, such as the double mass curve [18]. To avoid miscounting the change points while making the proposed method more applicable, we employ a statistical filter—the points of intersection falling beyond the 95% confidence intervals (CIs), which correspond to Z-scores = ±1.96, are rejected. This filter has been used in relevant MKS studies [19]. It is worth noting that the MKS test can also identify the monotonic trend or the change of direction—if a point of intersection is between the Z-scores of 0 and 1.96, the change is upward; if the point is between the Z-scores of − 1.96 and 0, the change is downward.

Model implementation

In this section, we take the state of Virginia as an example to further elaborate on the model implementation. The MKS test can be implemented in Microsoft Excel by calling embedded functions. The datasets and codes are available on GitHub (https://github.com/peterbest52/mks).

Data cleaning

Daily confirmed cumulative COVID-19 case data between March 22, 2020 and January 31, 2021 (in a total of 45 weeks) were obtained from the USAFacts website (https://usafacts.org/data/). Then, we aggregated the data on a weekly basis, generating a 45-week time series for each state representing new weekly cases. Lastly, to demonstrate the method, we extracted the data for Virginia as the time series X.

MKS test

For time series X, we derived mi, the cumulative times that the case value of the current week is larger than that of each preceding week. Following this step, Sk was derived as the cumulative mi (i = 1, 2, …, k), according to Eq. (1); then, the mean value of Sk or E(Sk) and the variance of Sk or VAR(Sk) were derived by Eqs. (2) and (3), respectively. It is worth noting that, since k is the only independent variable in Eqs. (2) and (3), E(Sk) and VAR(Sk) are the same for all states in this study. Based on Eq. (4), we derived the forward sequence Uf for Virginia (solid line in Fig. 1).

Fig. 1
figure 1

MKS test of new weekly cases in Virginia with the forward sequence (solid line) and the backward sequence (dashed line). The black dot is the identified change point, and the white dot is the excluded change point

Then, we reversed the time series X and derived Xr. We derived the intermediate sequence Ufr by applying Eq. (4) to Xr. Lastly, we derived the backward sequence Ub (dashed line in Fig. 1) by first reversing the sequence of values in Ufr and then adding a negative sign to these values.

Change point detection

The forward sequence (Uf) and the backward sequence (Ub) were plotted as the solid line and dashed line, respectively (Fig. 1). The points of intersection between the two sequences became the initial set of the change points. The thresholds of 95% CIs (Z-scores = ± 1.96) were set as the statistical filter. Only change points within the thresholds were retained. Specifically, in the case of Virginia, three points of intersection were initially detected. Week 4 (Point A in Fig. 1) and Week 43 (Point C in Fig. 1) were identified as the final change points with statistical confidence. Week 8 (Point B in Fig. 1) was excluded (Z-score = 2.72), as it fell beyond the thresholds. Since both Point A and Point C were between Z-scores of 0 and 1.96, these changes were upward.

Results

By applying the MKS test to weekly new COVID-19 cases in 50 states, we identified that 30 states (60.0%) have at least one change point within the 95% CIs. For the unqualified states, most of them have no change points within the 95% CIs but have at least one change point beyond the 95% CIs. Only the state of Vermont has no change points either within the 95% CIs or beyond, meaning that there is no abrupt case decrease or increase during the entire study period.

To characterize the temporal distribution of these change points, we further divided the study period into three disease development stages, namely, Weeks 1–10 (March 23 through May 31, 2020), Weeks 11–30 (June 1 through November 19, 2020), and Weeks 31–45 (November 19, 2020 through January 31, 2021). These three stages were determined by the three clusters of chronologically ordered change points, as shown in Fig. 2. Based on the three development stages, we then mapped out the emergence of the change point for each state, as shown in Fig. 3.

Fig. 2
figure 2

The three development stages based on clusters of chronologically ordered change points

Fig. 3
figure 3

The emergence of the change point for each state a at the first stage (Weeks 1–10), b at the second stage (Weeks 11–30), and c at the third stage (Weeks 31–45). The map is created by the authors

Figure 4 shows the change points detected by the MKS test for the 30 states with at least one change point within the 95% CIs. Among these states, we identified that a single change point exists for 25 states, two change points exist for 4 states (i.e., LA, OH, VA, and WA), and three change points exist for one state (i.e., GA). Then, we further derived 6 change patterns based on the emergence and direction of the change point at the three stages, as shown in Table 1.

Fig. 4
figure 4

States with at least one change point identified. The horizontal axis is the week; the vertical axis is the weekly new cases normalized to 0–100% with respect to the maximum weekly new cases in each state

Table 1 Summary of change patterns based on the emergence and direction of change points at three stages

Discussion

Two epidemiologic patterns can be identified in Table 1. First, the downward changes at the first stage (Pattern 4) appear only in Northeastern states (e.g., CT, MA, NJ, NY), as confirmed in Fig. 3a. This pattern can be explained by the immediate state policy actions on social distancing in this region during the early outbreak. After COVID-19 was declared a national emergency by the presidential proclamation on March 1, 2020 [20], most Northeastern states enforced social distancing regulations in late March and early April, including the closure of non-essential businesses and schools [21]. These policies largely restricted face-to-face interactions, slowed the virus diffusion, and eventually, suppressed the epi curves. Second, the upward changes at the third stage appear mostly in the Western states (e.g., AZ, CA, CO, NM, WA, WY) and the Midwestern states (e.g., IL, IN, MI, MN, OH, WI), as shown in Fig. 3c. This result is consistent with the observation that most Western and Midwestern states experienced an abrupt case surge in the late summer and fall [22]. The rising trend could be linked to their less restrictive reopening policies, especially reopening indoor dining without a statewide face-covering mandate [23].

To further validate the MKS test, we compared it with two other change point detection methods, the pruned exact linear time (PELT) method and the regression-based method (Table 2), both of which are commonly used for detecting multiple change points in time series data. Specifically, the PELT method searches for change points by minimizing a cost function over possible numbers and locations of change points, and it implements an efficient pruning to increase the computational efficiency [24, 25]. The regression-based method analyzes the time series using a regression model with multiple segments, where the coefficients shift from one stable regression relationship to another. It implements a dynamic programming approach to find segments that can minimize the residual sum of squares [26, 27]. We implemented the PELT method using the ‘changepoint’ package in R [25] and the regression-based method using the ‘strucchange’ package in R [28].

Table 2 Summary of the identified change points (CP) by the three methods

The validation tested if the MKS-identified change points can be confirmed by the two other methods. A confirmation is accepted if an MKS-identified change point is validated by another method within a two-week window. The comparison results are shown in Table 2. Based on the 36 MKS-identified change points, the MKS-test reaches 41.7% agreement (15/36) with the PELT method and 47.2% agreement (17/36) with the regression-based method. It is also worth mentioning that the other two methods identified at least one change point for every state, even when there is no obvious change of direction. The comparison results signify that the MKS test is a relatively conservative method for change point detection, as it can only detect abrupt changes and can thus avoid false-positive results.

Conclusions

To sum up, the MKS test has several advantages in change point detection. First and foremost, it is characterized by high computational efficiency and easy implementation. Users can easily implement this method in Microsoft Excel without any prior statistical knowledge or modeling skills. Second, the method can detect the change of direction, whereas some other methods (e.g., PELT) can only identify the existence of a change without specifying the direction. Third, since the MKS test is a nonparametric model, it can be applied to time series data where the distribution is not normal or has extreme variability. However, due to its conservative nature and moderate agreement with the other slower but more sensitive methods, we recommend using the MKS test primarily for initial pattern identification and data pruning, especially in large data. For example, to identify the change points in a long sequence of COVID-19 infection data, we can first use the MKS test to narrow down the time window where changes are likely to occur, and then use a second method (which has a higher computational cost but is more sensitive) to reconfirm the change pattern. In addition, as the conservativeness of the MKS test can be easily modified by adjusting the width of the statistical filter, future studies should examine how the quality of the results derived from the MKS test may vary as a function of the statistical filter.

This pilot study is the first to implement the MKS test for COVID-19 studies. An open-access tool is developed to facilitate the model implementation. With further validation and modification, the method can be applied to other health data, such as injuries, disabilities, and mortalities. By identifying key time points where chronologically ordered observations have a drastic change, the method can eventually contribute to revealing the etiology of these health outcomes and supporting public health decision-making.