Abstract
A practical algorithm has been developed for closeness analysis of sequential data that combines closeness testing with algorithms based on the Markov chain tester. It was applied to reported sequential data for COVID19 to analyze the evolution of COVID19 during a certain time period (week, month, etc.).
Similar content being viewed by others
1 Introduction
The COVID19 coronavirus has spread worldwide, and as of May 31, 2021, the number of confirmed cases was 170M, and the number of deaths was 3.54M. A fourth wave of infections due to the emergence of variants with strong infectivity began hitting a number of countries in Spring 2021. Coping with a worldwide pandemic like the COVID19 one requires understanding the infection situation. This requires development of techniques for analyzing the various types of sequential data that are available. These data include the number of confirmed infections, the number of deaths, and the number of polymerase chain reaction tests and rapid antigen tests by location and time.
As the availability of various types of data has increased in recent years, faster and more sampleefficient algorithms have been developed for statistical testing. In particular, for data collected by sensors, closeness testing of distributions to infer information from the underlying probability distributions is rapidly evolving (Chan et al. 2014; Canonne 2020; Daskalakis et al. 2018). Wolfer and Kontorovich, for example, developed an identity tester that determines whether sequential data represented by two Markov chains are identical (Wolfer and Kontorovich 2020). Although the theory is quite rich in this area, there have been few reports of proposed algorithms being tested on actual applications or of simulation studies. Moreover, the algorithms are suitable only for discrete distributions, so a quantization technique is needed to transform continuous distributions into discrete ones. Canonne and Wimmer discussed the difficulties inherent in binning and segmentation and their limitations (Canonne and Wimmer 2020). The main criticism of these algorithms is that generally the domain of the distribution is taken as \([n]=\{1, \dots , n\}\), which is not always realistic or representative of the true data. To overcome this limitation, a suitable quantization is needed, as suggested by Canonne and Wimmer (2020). In this work, since we did not have a prior information on the data distribution, we adopted a uniform discretization, for a number of bins which was determined empirically via numeric search.
We have developed a practical algorithm for closeness analysis of sequential data by combining distribution testing and algorithms based on Wolfer and Kontorovich’s identity tester (Wolfer and Kontorovich 2020). We tested it using it to analyze the evolution of COVID19 during a certain time period (week, month, etc.). Although Markov switching models and Markov agent models have been widely used for general compartmental models in epidemiology such as for SIR (susceptible–infectious–recovered) and SEIR (susceptible–exposed–infectious–recovered) models to represent the state transition (Bestehorn et al. yyy; Boukanjime et al. 2020; Gribaudo et al. 2021; Larsen et al. 2020; Raherinirina et al. 2021), there has been little use of Markov chains to model COVID19 data. Ma et al. (2021) recently proposed using a Markov process combined with LSTM (long short–term memory) model to categorize the reported COVID19 cases. To our knowledge, there have been no reports of applying Markov chains to COVID19 data using testing techniques.
In the following section, we briefly describe related work on distribution testing and Markov chain testing. Our analysis methods are described in Sect. 3, and their usage for analyzing spatiotemporal data like that for COVID19 is described in Sect. 4. We discuss the testing sensitivity in Sect. 5 and conclude with a summary of the key points in Sect. 6.
2 Related work
2.1 Distribution testing
Distribution testing is a field of computer science concerned with statistical (composite) hypothesis testing questions, with a focus on finite sample guarantees and efficient algorithms. While the findings of many distribution tests have been reported, the main focus has been on three problems: the uniform testing problem, the identity testing problem, and the closeness testing problem. Let D be a distribution over a (countable) domain \(\varOmega \). The uniform testing problem is to determine whether \(D = U_{\varOmega }\) (the uniform distribution on \(\varOmega \)) or the distance between D and \(U_{\varOmega }\) is far from \(\varepsilon \in (0,1)\) (\(\varepsilon \)far) (Batu et al. 2001; Goldreich and Ron 2011; Paninski 2008). The identity testing problem is to determine whether \(D = D^*\) (a fixed distribution over \(\varOmega \)) or D is \(\varepsilon \)far from \(D^*\) (Valiant and Valiant 2011; Acharya et al. 2015; Valiant and Valiant 2017). The closeness testing problem is to determine whether D and \(D'\) (another distribution on \(\varOmega \)) are equal or \(\varepsilon \)far from each other (Batu et al. 2013; Diakonikolas and Kane 2016; Valiant 2011). Here, we focus on tolerant closeness testing as it is useful for analyzing the COVID19 situation. The resulting tolerant closeness tester is as follows.

Given sample access to distributions D and \(D'\) over \(\varOmega \), and bounds \(\eta _1 \ge 0\), \(\eta _2 > 0\), \(\delta \in (0,1)\), distinguish with probability at least \(1  \delta \) between \(d_1(D, D') \le \eta _1\) and \(d_2(D, D') \ge \eta _2\) whenever \(D, D'\) satisfy one of these two inequalities.
Here, \(d_1\) and \(d_2\) are the distances between two distributions. Depending on the purpose of the analysis, the total variation distance, \(l_2\), the \(\chi ^2\) distance, or the Hellinger distance are generally used as \(d_1\) and \(d_2\) in distribution testing. The total variation distance is standard, and the properties of the other two distances have been theoretically and comparatively studied (Daskalakis et al. 2018). The \(\chi ^2\)type statistics defined by Chan et al. (2014) are used here.
2.2 Markov chain testing
Learning and testing discrete distributions has been a hot research area, especially for sample complexity problems in identity testing and closeness testing (Canonne 2020). Most of the work in this area has relied on independent and identically distributed (iid) sample testing, which is based on an unrealistic assumption. Emergent work has started to address the three testing problems described above, especially for data generated from a finite Markov chain (e.g., Wolfer and Kontorovich 2019, 2020). Since COVID19 data observations are obviously not iid in time and space, we assume here that the observed proportions \(\pi \) (where the distribution D is estimated by \(\pi \)) are generated by a Markov chain over a discrete state space \([s]=\left\{ s_1, \dots , s_B\right\} \); this means that it verifies the Markovian property
where \(p_{i j}\) denotes the transition probability from state \(s_i\) to state \(s_j\). Given an observed trajectory \(\varvec{\pi }=\left( \pi _{0}, \ldots , \pi _{T}\right) \) from some unknown Markov chain up to time T, we are interested in testing the transition probabilities from only this trajectory. Two strategies can be adopted for Markov chain testing: (i) naive use of distribution testing techniques (closeness testing, identity testing, and so on) for conditional transition probability comparison and (ii) less obvious comparison of the stationary distributions of the two Markov chains. With the first strategy, the discrete conditional probability distributions \( p_{i .}=(p_{i 1}, \ldots , p_{i B})\) and \(q_{i .}=(q_{i 1}, \ldots , q_{i B})\) as defined in (1) are compared for each fixed state \(s_i\). With the second strategy, this technique needs existence conditions through mixing time concept.
Wolfer and Kontorovich’s identity tester (Wolfer and Kontorovich 2020) constructs a tester \({\mathcal {T}}\) that can determine whether a given trajectory was generated from an unknown ergodic Markov chain M having B states. The following distance between Markov chains \(M_1\) and \(M_2\) is used.
where \(\Vert .\Vert _{TV}\) stands for the total variation norm (see Wolfer and Kontorovich 2020). They showed that the tester can determine with a probability of at least \(1\delta \) whether the sample trajectory was generated from M or \(\varepsilon \)far from M.
This issue has also been studied by Dikkala and Gravin (2018), who, inspired by the early work of Kazakos (1978), proposed a difference measure that captures the scaling behavior of the total variation distance between growing trajectories of the Markov chains. They then presented efficient identity testers and gave its information lower bounds. Recently, (Cherapanamjeri and Bartlett 2019) succeeded to remove a dependency in the hitting time of the sample complexity for symmetric chains. Fried and Wolfer (2021) extended the results (Dikkala and Gravin 2018; Cherapanamjeri and Bartlett 2019) from symmetric to general reversible chains. More details about the tightness or the link to the hitting time of the Markov chain can be found in their original paper (Dikkala and Gravin 2018; Cherapanamjeri and Bartlett 2019; Fried and Wolfer 2021)
3 Analysis methods using distribution testing and Markov chain testing
Focusing on COVID19, we investigated whether the pandemic evolved in the same way in different regions and for different segments of the population. We tested three analysis methods based on distribution testing and Markov chain testing that can be applied to the spatiotemporal data of COVID19 and potentially any novel coronavirus.

1.
Closeness analysis

2.
Periodical evolution analysis

3.
Key factor analysis
In the following sections, we first formulate the problem and then describe these analysis methods.
3.1 Observation model formulation
Let us consider a population \({\mathcal {P}}\) and suppose that \({\mathcal {P}} = \bigcup P_\ell \), where \(\{P_\ell \}_{\ell =1,\dots , L}\) is a partition of the population and \(P_\ell \)’s are disjoints. This segmentation can be linked to geographic regions, sociodemographics categories, age, and other relevant auxiliary variables. We are interested in monitoring the dynamic distribution of a coronavirus like COVID19. We are especially interested in the evolution of the distribution \(D_{\ell }(t)\) of the number of infected people in segment \(P_\ell \) at time t.
Our testing framework is applicable to only discrete distributions, so we need to quantize the state space into B bins. Let us denote the discretized states as \([s]=\left\{ s_1, \dots , s_B\right\} \) (in the univariate case), and discretization of the interval \([0,p_{\max }]\), where \(p_{\max }\) is the maximum allowed proportion (in the experiments, the segmentation is uniform and \(p_{\max }\) is less than 1). To investigate the severity of COVID19, the proportion \(\pi _t^\ell \) of infected people in segment \(P_\ell \) at time t is assigned a state \(s_i\) if \( s_i< \pi _t^\ell \le s_{i+1} \). The observed proportion is \({\hat{\pi }}_t^\ell = {n_t^\ell }/{N_\ell }\), where \(n_t^\ell \) is the number of infected people in population \(P_\ell \) at time t, and \(N_\ell \) is the size of the population segment \(P_\ell \). For each t and \(\ell \), the application \({\hat{\pi }}^\ell _t : \longrightarrow {\mathcal {M}}[s]\) is to take a random variable in \({\mathcal {M}}[s]\), which is the set of discrete probability measures on [s].
3.2 Closeness analysis
We designed an algorithm for closeness analysis by combining distribution testing (closeness testing) and Markov chain testing in order to analyze the closeness of two sequential data. In distribution testing, there is generally assumed to be oracle access to the distributions. For closeness testing, according to Theorem 1 of Chan et al. (2014) and Theorem 5.9 of Canonne (2020), tight upper \(\mathrm{O}\) and lower \(\varOmega \) bounds for sample complexity with the total variation distance in Eq. (2) are given by
The algorithm we designed for closeness analysis satisfies the following two conditions under the assumption of oracle access (Canonne 2020; Chan et al. 2014). On input \(\varepsilon \in (0,1)\) (a constant), \(C \in {\mathbb {R}}^+\) (an absolute constant) and \(B \in {\mathbb {N}}\) (the number of states), it takes \(C \cdot \max (\dfrac{B^{2/3}}{\varepsilon ^{4/3}}, \dfrac{B^{1/2}}{\varepsilon ^{2}})\) samples from the distributions and,

if the distributions are equal, it outputs ACCEPT with probability at least 2/3;

if the total variation distance between the distributions is greater than \(\varepsilon \), it outputs REJECT with probability at least 2/3.
As shown in Algorithm 1, five parameters are input: \(\varepsilon \), C, B, \(N \in {\mathbb {N}}\) (the number of testing iterations) and \(\mu \in {\mathbb {N}}\) (the minimum number of samples for testing). The sequential data (\(\mathbf{x} \) and \(\mathbf{y} \) with ddimension) are first quantized into B bins (or B states). Algorithm 1 follows the naive use strategy described in Sect. 2.2. For each state b, the discrete conditional probability distributions (\( p_{b .}=(p_{b 1}, \ldots , p_{b B})=(\frac{T_b^{x}(1)}{\sum _{k=1}^{B} T_b^{x}(k)}, \ldots , \frac{T_b^{x}(B)}{\sum _{k=1}^{B} T_b^{x}(k)})\) and \(q_{b .}=(q_{b 1}, \ldots , q_{b B})=(\frac{T_b^{y}(1)}{\sum _{k=1}^{B} T_b^{y}(k)}, \ldots , \frac{T_b^{y}(k)}{\sum _{k=1}^{B} T_b^{y}(k)})\)) are compared. In accordance with Theorem 1 of Chan et al. (2014) and Theorem 5.9 of Canonne (2020), \(m_0\) is sampled from a Poisson distribution with mean m (line 21), and \(m_0\) samples are sampled from the distributions (lines 23 and 24). For the acceptance probability, the \(\chi ^2\)type statistic z(n) defined by Chan et al. is calculated for each sample n (line 28) and compared with a threshold (Canonne 2020) (line 30). The statistic can be viewed as a modification of the empirical triangle distance applied to \(c^{x}\) and \(c^{y}\). For the reject probability, the total variation distance d(n) is calculated for each sample n (line 29) and compared with a threshold \(\varepsilon \).
After application of Algorithm 1, the acceptance \(P_A\) and reject \(P_R\) probabilities, the distance of the \(\chi ^2\)type statistic Z, and the total variation distance D for closeness testing between \(\mathbf{x} \) and \(\mathbf{y} \) can be calculated as the mean, median, or minimum value over all states. The minimum value is the most conservative; the mean value was used in the experiments. The \(\chi ^2\)type statistic is an estimate of \(\chi ^2\)divergence. The relation between the divergence and the total variation distance is as follows; for distributions p and q, the following inequalities hold.
Additional details and discussion can be found elsewhere ( Daskalakis et al. 2018 for instance). These inequalities show that the \(\chi ^2\)divergence \(d_{\chi ^{2}}\) is more conservative than the Hellinger distance \(d_{\mathrm {H}}\) and the total variation distance \(d_{\mathrm {TV}}\). This motivated our use of the \(\chi ^2\)type statistic.
Note that the distance also depends on the mixing properties of the Markov chains and the stationary distribution, particularly when the number of states is small (Wolfer and Kontorovich 2020). For such a case, the mixing time should be estimated, for example, according to Algorithm 1 in Wolfer (2020) and confirmed to be smaller than m (line 21) in Algorithm 1.
3.3 Periodical evolution analysis
For a sequential data such as COVID19 data, it is often demanded to analyze the evolution situation. Here, we investigate a method of periodical evolution analysis with closeness analysis. As shown in Algorithm 2, input sequence \(\mathbf{x}\) is first segmented into L segments. Then, for each pair of segments, closeness of the pair is tested using Algorithm 1. We can analyze the periodical properties on the resulting \(L \times L\) matrices for the acceptance probabilities and the distances.
3.4 Key factor analysis
When planning measurements such as those for COVID19, it is important to analyze the key factors, i.e., the factors that correlate with changes in, for example, the number of infections. We investigated a method for analyzing the key factors that uses a generalized additive model (GAM) (T.J. Hastie 1990) in which the response variable depends linearly on the unknown smooth functions of some predictor variables and the focus is on making inferences about the smooth functions. The benefit of GAM is that it takes advantage of the smoothed transforms of the predictor variables using basis functions such as smoothing splines. The distances obtained by the closeness analysis are used as the response variables. The data for the key factor candidates, e.g., vehicle and public transport increase rates, are used as predictor variables. The best model is then selected in a stepwise fashion using either Akaike Information Criterion or model residual deviance (Hastie 1992).
4 Experiments and results
4.1 COVID19 sequential data
We used reported data for the number of newly infected people \(n_\ell ^t\) for each of the 53 cities on the main island of Japan as reported daily by the Tokyo metropolitan government from April 1, 2020, to May 6, 2021, along with the population \(N_\ell \) of each city. Segmentation \(\{P_\ell \}_{\ell =1,\dots , L}\) (described in Sect. 3.1) was linked to each city in Tokyo (which is a prefecture, not a city). The observed proportion \({\hat{\pi }}_t^\ell (= {n_t^\ell }/{N_\ell })\) was quantized into Bstates, and B was set to 20.
4.2 Closeness analysis of COVID19 infection situation between cities
Figure 1 shows 53 cities \(\times 53\) cities matrices of acceptance probabilities (the mean of \(P_A(b)\) over all states in Algorithm 1) and distances of \(\chi ^2\)type statistics (the mean of Z(b) over all states in Algorithm 1) between all pairs of 53 cities in Tokyo for each month from April 2020 to April 2021, calculated using Algorithm 1. C and \(\mu \) were chosen empirically and set to 100 and 3, respectively. As of June 2021, there had been four waves of COVID19 infection; the peak months are roughly indicated by red stars.
For the acceptance probabilities, the matrices between the waves tend to be darker; that is, many cities are considered to have had similar characteristics of the changes in the number of infected people for each of the months. In fact, for such cities, the number of infected people was relatively and stably small during those months.
For the distances, the overall matrix color is the darkest for January 2021, when the third wave peaked and the number of infected people was the largest. Many cities experienced an explosion of infections and different characteristics of the changes in the number of infected people for the month.
Figure 2 shows the kmeans clustering for the distance matrices in Fig. 1. To facilitate recognition of the differences in the level of increases in infection, the number of color codes was set to three: red indicates relatively high level, yellow indicates moderate level, and blue indicates low level. For April 2020, two cities in the heart of Tokyo, Shinjukuku and Minatoku, had the highest level. This is attributed to Shinjukuku and Minatoku having a popular entertainment district. Until October 2020, most cities had the lowest level. Starting with the third wave, roughly from December 2020 to February 2021, the levels of the nearby cities increased to moderate and then to high. These figures illustrate how the characteristics of the changes in the number of infected people were transformed.
4.3 Periodical COVID19 evolution analysis
Figure 3 shows the matrices of acceptance probabilities, distances of \(\chi ^2\)type statistics, reject probabilities (mean of \(P_R(b)\) over all states in Algorithm 1), and total variation distances (mean of D(b) over all states in Algorithm 1) between all pairs of 13 months for Shinjuku and Tachikawa calculated using Algorithm 2. C and \(\mu \) were chosen empirically and set to 100 and 3, respectively. Tachikawashi is located in the middle west of Tokyo, in a suburban area. For Shinjukuku (in the heart of Tokyo), as in Fig. 1, almost all the pairs are different while the May–October 2020 pair are similar. For Tachikawashi, the pairs from April to November 2020 and for February and March 2021 are similar. The number of infected people for these months was relatively and stably small. This figure illustrates the characteristics of monthly COVID19 evolution for both cities.
Figure 4 shows the matrices of acceptance probabilities, distances of \(\chi ^2\)type statistic, reject probabilities, and total variation distances between all pairs of 57 weeks from 1 April 2020 to 5 May 2021 for all of Tokyo calculated using Algorithm 2 and all the numbers accumulated for all the cities in Tokyo. C and \(\mu \) were chosen empirically and set to 100 and 3, respectively. The acceptance probabilities show that the weeks from April to June, 2020 and for August and September, 2020, tended to be similar among the cities. The distances show that the weeks in January, April, and May 2021 were very different. This indicates that the number of infected people for the weeks in January 2021 dynamically changed, probably because of an increase in contacts between people due to yearend and beginningofyear parties and meetings. In April and May 2021, variants of the COVID19 virus with higher infectivity began to gradually spread, so the characteristics of the changes in the number of infected people differed from those in previous weeks.
4.4 Key factor analysis for COVID19 evolution
For the key factor analysis, we used the distances of the \(\chi ^2\)type statistic Z and the total variation distances D between all pairs of 52 weeks from 6 May 2020 to 4 May 2021 for all of Tokyo, which are included in Fig. 4 in which 57 weeks were used. Table 1 lists the key factor candidates used in the experiments such as vehicle and public transport increase rates and average temperature in Tokyo, which are considered to affect the rate of new infections. We set a delay of zero (no delay), one week, or two weeks between the distances.
For the distances of the \(\chi ^2\)type statistic, the Rsquared (adjusted) values are listed in Table 2. Rsquared is a statistical measure of the success in explaining the response by the model, and Rsquared (adjusted) is a version adjusted for the number of predictors in the model for parsimony. The table shows that the fitting was fairly accurate. The best model for a delay of two weeks was selected; it is shown in Eq. (3). The \(s({{ term}})\) indicates a smoothed transform in which \({{ term}}\) is computed using a smoothing spline, as mentioned in Sect. 3.4. All the terms were significant: 0.001 significance level for \(\mathbf{vehicle} \), \(s({\mathbf{temperature}})\), and \(s({\mathbf{deathTokyo}})\), 0.01 for \(s({\mathbf{week}})\), \(s({\mathbf{patientHospital}})\), and \(s({\mathbf{roomHospital}})\), and 0.05 for \(\mathbf{pedestrian} \) and \(s({\mathbf{deathWorld}})\).
For the total variation distances, the fitting accuracy on the Rsquared (adjusted) values was fairly good, as shown in Table 2. The best model for a delay of two weeks was selected; it is shown in Eq. (4). All the terms were significant except for \(s({\mathbf{patientHospital}})\): 0.001 significance level for \(s({\mathbf{week}})\), \({\mathbf{vehicle}}\), \(s({\mathbf{temperature}})\), \(s({\mathbf{deathTokyo}})\), and \(s({\mathbf{infectedWorld}})\) and 0.01 for \(\mathbf{pedestrian}\) and \(s(\mathbf{roomHospital})\).
Moreover, we divided the 52 weeks from 6 May 2020 to 4 May 2021 into two periods: (i) the 30 weeks from May to November 2020 and (ii) the 22 weeks from December 2020 to May 2021. For the first period, the Rsquared (adjusted) values for both the \(\chi ^2\)type statistic and total variation distance in Table 2 were low, making it is difficult to find correlation between the distances and the key factors. For the second period, the Rsquared (adjusted) values for both distances were high. As mentioned in Sect. 4.2, the third wave roughly started in December 2020 in Tokyo, and stronger correlations between the distances and the key factors are evident for the second period.
For the distances of the \(\chi ^2\)type statistic, the best model for a delay of two weeks was selected; it is shown in Eq. (5). All the terms were significant: 0.001 significance level for \({\mathbf{week}}\), \(\mathbf{vehicle} \), \(s({\mathbf{deathTokyo}})\), \(s({\mathbf{deathWorld}})\), and \(s({\mathbf{patientHospital}})\), 0.01 for \(s({\mathbf{transport}})\) and \({\mathbf{infectedWorld}}\), and 0.05 for \(s(\mathbf{temperature} )\).
For the total variation distances, the best model for a delay of two weeks was selected; it is shown in Eq. (6). All the terms were significant except for \(s({\mathbf{patientHospital}}))\): 0.001 significance level for \({\mathbf{week}}\), \({\mathbf{vehicle}}\), \(s({\mathbf{deathTokyo}})\), and \(s({\mathbf{patientHospital}})\) and 0.05 for \(s({\mathbf{transport}})\).
These results indicate that the increase rates for vehicles and public transport can be used in the COVID19 measurements, especially for the second period. The temperature, numbers of deaths, and number of patients in hospitals in Tokyo should be considered key factors that can be correlated with a change in COVID19 infection rates.
5 Discussion
We first discuss the properties of Algorithm 1 as a Markov chain tester and the sensitivity of its parameters. We do this using simulated data: (i) sequence \(Q^x\) randomly generated from a transition probability matrix with 5 states (Markov chain), (ii) sequence \(Q^y\) generated using sorting sequence X, and (iii) sequence \(Q^z\) consisting of \((100  \alpha )\)% sequences (the same as for \(Q^x\)) and an \(\alpha \)% sequence (different from \(Q^x\)). All sequences had a length of 100 with state components \(s_1=1,\ldots ,s_5=5\) (see appendix A). Note that although sequences \(Q^x\) and \(Q^y\) included the same portion of each state, \(Q^y\) had no Markovian property.
Figure 5 shows the acceptance probabilities, the distances of the \(\chi ^2\)type statistic, and the threshold values of closeness analysis between two sequences (\(Q^x\) and \(Q^y\)) with and without the Markovian property and with various values of \(\varepsilon \) and C in Algorithm 1. When \(\varepsilon \) was smaller than 0.3, the algorithm could accurately distinguish \(Q^x\) and \(Q^y\) for all values of C. However, when \(\varepsilon \) was 0.4 or 0.5 and C was 1 or less, the test results were incorrect although the inaccuracy was less than \(4\%\). These results show that strict testing can be conducted with small values of \(\varepsilon \) and large values of C although with these setting, m (line 21 in Algorithm 1) becomes large and the computation cost is higher. However, the required level of strictness in closeness analysis should differ between applications, meaning that the values can be set accordingly, especially that of \(\varepsilon \). Moreover, both C and \(\varepsilon \) should be set in accordance with the available computation power.
Figure 6 shows the acceptance probabilities, the distances of the \(\chi ^2\)type statistic, and the threshold values of closeness analysis between two identical sequences (\(Q^x\) and \(Q^x\)) with various values of \(\varepsilon \) and C in Algorithm 1. For \(\varepsilon \) from 0.1 to 0.9 and C from 1 to 100, the algorithm correctly determined that the two sequences were the same.
Table 3 lists the acceptance probabilities, the distances of the \(\chi ^2\)type statistic, the reject probabilities, and the total variation distances of closeness analysis between (100  \(\alpha \))% similar sequences (\(Q^x\) and \(Q^z\)) with \(\varepsilon = 0.1\) and \(C = 100\) in Algorithm 1. \(\alpha \) was varied from 0 to \(5\%\). The algorithm was able to distinguish the similar sequences when \(\alpha = 2\%\) or more. In contrast, the classical hypothesis tests for two distributions (Wilcoxon ranksum test and Kolmogorov–Smirnov test) could not reject the null hypothesis for all values of \(\alpha \). The proposed algorithm thus has strong testing power for sequential data.
6 Conclusions
We have designed a practical algorithm for testing the closeness of sequential data by combining distribution testing and Markov chain testing. We used it to analyze the closeness, the periodical evolution, and the key factors for the number of people infected with COVID19 for each city in Tokyo. The results showed that whether or not the epidemic evolves in the same way in different cities or in different months or weeks with numerical indicators of the acceptance and reject probabilities and the significance levels. Examination of the properties of the algorithm as a Markov chain tester and the sensitivity of the parameters showed that strict testing can be conducted with small values of \(\varepsilon \) and large values of C under the constraint of the available computation power. Comparison with the classical Wilcoxon ranksum test and Kolmogorov–Smirnov test demonstrated that the algorithm has a strong testing power for sequential data.
References
Acharya, J., Daskalakis, C., & Kamath, G. (2015). Optimal Testing for Properties of Distributions. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., URL https://papers.nips.cc/paper/2015/hash/1f36c15d6a3d18d52e8d493bc8187cb9Abstract.html.
Batu, T., Fischer, E., Fortnow, L., Kumar, R., Rubinfeld, R., & White, P. Testing random variables for independence and identity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 442–451, a. https://doi.org/10.1109/SFCS.2001.959920. ISSN: 15525244.
Batu, T., Fortnow, L., Rubinfeld, R., Smith, W. D., & White, P. Testing closeness of discrete distributions. 60(1):4:1–4:25, b. https://doi.org/10.1145/2432622.2432626.
Bestehorn, M., Riascos, A. P., Michelitsch, T. M., & Collet, B. A. A markovian random walk model of epidemic spreading. 33(4):1207–1221. ISSN 14320959. https://doi.org/10.1007/s0016102100970z.
Boukanjime, B., Caraballo, T., El Fatini, M., & El Khalifi, M. (2020). Dynamics of a stochastic coronavirus (COVID19) epidemic model with Markovian switching. Chaos, Solitons, and Fractals, 141:110361, Dec. 2020. ISSN 09600779. https://doi.org/10.1016/j.chaos.2020.110361. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7566849/.
Canonne, C. L. (2020). A Survey on Distribution Testing: Your Data is Big. But is it Blue? Number 9 in Graduate Surveys. Theory of Computing Library, 2020. https://doi.org/10.4086/toc.gs.2020.009. URL http://www.theoryofcomputing.org/library.html.
Canonne, C.L., & Wimmer, K. (2020). Testing data binnings. In APPROX/RANDOM, volume 176 of LIPIcs, pages 24:1–24:13. Schloss Dagstuhl  LeibnizZentrum für Informatik, 2020. URL https://drops.dagstuhl.de/opus/volltexte/2020/12627/.
Chan, S.O., Diakonikolas, I., Valiant, P., & Valiant, G. (2014). Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the TwentyFifth Annual ACMSIAM Symposium on Discrete Algorithms, pages 1193–1203. Society for Industrial and Applied Mathematics, 2014. ISBN 9781611973389 9781611973402. https://doi.org/10.1137/1.9781611973402.88.
Cherapanamjeri, Y., & Bartlett, P. L. (2019). Testing symmetric markov chains without hitting. In A. Beygelzimer and D. Hsu, editors, Proceedings of the ThirtySecond Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 758–785. PMLR, 25–28 Jun 2019. URL https://proceedings.mlr.press/v99/cherapanamjeri19a.html.
Daskalakis, C., Dikkala, N., & Gravin, N. Testing symmetric markov chains from a single trajectory. In Proceedings of the 31st Conference On Learning Theory, pages 385–409. PMLR. URL https://proceedings.mlr.press/v75/daskalakis18a.html. ISSN: 26403498.
Daskalakis, C., Kamath, G., & Wright, J. (2018). Which distribution distances are sublinearly testable? In A. Czumaj, editor, Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 710, 2018, pages 2747–2764. SIAM, 2018. https://doi.org/10.1137/1.9781611975031.175.
Diakonikolas, I., & Kane, D. M. A new approach for testing properties of discrete distributions. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 685–694. https://doi.org/10.1109/FOCS.2016.78. ISSN: 02725428.
Fried, S., & Wolfer, G. (2021). Identity testing of reversible Markov chains. arXiv:2105.06347 [cs, math, stat], Oct. 2021.
Goldreich, O., & Ron, D. On testing expansion in boundeddegree graphs. In Electronic Colloquium on Computational Complexity (ECCC), volume 20.
Gribaudo, M., Iacono, M., & Manini, D. (2021). COVID19 Spatial Diffusion: A Markovian AgentBased Model. Mathematics, 9(5):485, Jan. 2021. https://doi.org/10.3390/math9050485. URL https://www.mdpi.com/22277390/9/5/485. Number: 5 Publisher: Multidisciplinary Digital Publishing Institute.
Hastie, T. (1992). Generalized additive models. Chapter 7 of Statistical Models in S. Wadsworth & Brooks/Cole.
Kazakos, D. (1978). The bhattacharyya distance and detection between markov chains. IEEE Transactions on Information Theory, 24(6), 747–754.
Larsen, J. R., Martin, M. R., Martin, J. D., Kuhn, P., & Hicks, J. B. (2020). Modeling the Onset of Symptoms of COVID19. Frontiers in Public Health, 8:473, Aug. 2020. ISSN 22962565. https://doi.org/10.3389/fpubh.2020.00473. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7438535/.
Ma, R., Zheng, X., Wang, P., Liu, H., & Zhang, C. (2021). The prediction and analysis of covid19 epidemic trend by combining lstm and markov method. Scientific Reports, 11(1), 1–14.
Paninski, L. A coincidencebased test for uniformity given very sparsely sampled discrete data. 54(10):4750–4755. ISSN 15579654. https://doi.org/10.1109/TIT.2008.928987. Conference Name: IEEE Transactions on Information Theory.
Raherinirina, A., Fandresena, T. S., Hajalalaina, A. R., Rabetafika, H., Rakotoarivelo, R. A., & Rafamatanantsoa, F. (2021). Probabilistic Modelling of COVID19 Dynamic in the Context of Madagascar. Open Journal of Modelling and Simulation, 9(3):211–230, May 2021. https://doi.org/10.4236/ojmsi.2021.93014. URL http://www.scirp.org/Journal/Paperabs.aspx?paperid=109274. Number: 3 Publisher: Scientific Research Publishing.
Hastie, R. T. T.J. (1990). Generalized Additive Models. Chapman & Hall/CRC.
Valiant, G., & Valiant, P. The power of linear estimators. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, pages 403–412. https://doi.org/10.1109/FOCS.2011.81. ISSN: 02725428.
Valiant, G. & P. Valiant. An Automatic Inequality Prover and Instance Optimal Identity Testing. SIAM Journal on Computing, 46(1):429–455, Jan. 2017. ISSN 00975397, 10957111. https://doi.org/10.1137/151002526.
Valiant, P. Testing symmetric properties of distributions. 40(6):1927–1968. ISSN 00975397. https://doi.org/10.1137/080734066. Publisher: Society for Industrial and Applied Mathematics.
Wolfer, G. Mixing time estimation in ergodic markov chains from a single trajectory with contraction methods. In Algorithmic Learning Theory, pages 890–905. PMLR. URL https://proceedings.mlr.press/v117/wolfer20a.html. ISSN: 26403498.
Wolfer, G., & Kontorovich A. Estimating the mixing time of ergodic markov chains. In Conference on Learning Theory, pages 3120–3159. PMLR. URL http://proceedings.mlr.press/v99/wolfer19a.html. ISSN: 26403498.
Wolfer, G., & Kontorovich, A. (2020). Minimax testing of identity to a reference ergodic markov chain. In International Conference on Artificial Intelligence and Statistics, pages 191–201, 2020. URL http://proceedings.mlr.press/v108/wolfer20a.html.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Simulated data
The simulated data, \(Q^x\), \(Q^y\) and \(Q^z\) are as follows.

\(Q^x\) = (1 4 1 2 2 5 1 2 2 5 5 5 1 2 5 5 3 3 4 5 4 2 4 4 5 3 4 4 5 5 5 5 4 3 2 2 5 1 4 3 2 4 5 3 5 5 1 5 2 3 5 3 2 4 1 2 4 4 5 5 1 2 2 1 2 2 1 5 5 3 5 3 5 1 2 4 5 3 4 4 4 5 4 3 1 4 5 4 5 4 3 2 1 3 2 3 5 1 3 4)

\(Q^y\) = (1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5)

\(Q^z\) = (1 4 1 2 2 5 1 2 2 5 5 5 1 2 5 5 3 3 4 5 4 2 4 4 5 3 4 4 5 5 5 5 4 3 2 2 5 1 4 3 2 4 5 3 5 5 1 5 2 3 5 3 2 4 1 2 4 4 5 5 1 2 2 1 2 2 1 5 5 3 5 3 5 1 2 4 5 3 4 4 4 5 4 3 1 4 5 4 5 4 3 2 1 3 2 2 2 2 2 2) (\(\alpha = 5\%\))
The transition probability matrix used to generate \(Q^x\) is as follows.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Matsui, T., Azzaoui, N. & Murakami, D. Analysis of COVID19 evolution based on testing closeness of sequential data. Jpn J Stat Data Sci 5, 321–338 (2022). https://doi.org/10.1007/s4208102100144w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s4208102100144w