Datasets
COVID-19 incidence and population data
Based on cumulative COVID-19 cases data from the Johns Hopkins Coronavirus Resource Center (https://coronavirus.jhu.edu/), for this study, we compiled time series data on daily new cases of the disease for more than 300 U.S. counties from 32 states and the District of Columbia and matched by unique five-digit FIPS code or county name to dynamic and static variables as collected from additional data sources described below. Since a single county may consist of multiple individual cities, we include the list of all city labels within each aggregate group to represent a greater metropolitan area. A total of 151 of such metropolitan areas that had at least 1,000 reported cases of COVID-19 by May 31, 2020, were selected for this study. Population covariates for these areas were collected from the online resources of the U.S. Census Bureau and the U.S. Centers for Disease Control and Prevention (CDC) (https://www.census.gov/quickfacts/, https://svi.cdc.gov/).
Human mobility index data
Anonymized geolocated mobile phone data from several providers including Google and Apple, timestamped with local time, were made available for analysis of human mobility patterns during the pandemic. Based on geolocation pings from a collection of mobile devices reporting consistently throughout the day, anonymous aggregated mobility indices were calculated for each county at Descartes Lab. Mobility traces are aggregated as nodes representing typical members of a given population. The maximum distance moved by each node, after excluding outliers, from the first reported location was calculated. Using this value, the median across all devices in the sample is computed to generate a mobility metric for select locations at county level. Descartes Labs further defines a normalized mobility index as a proportion of the median of the maximum distance mobility to the “normal” median during an earlier time-period multiplied by a factor of 100. Thus, the mobility index provides a baseline comparison to evaluate relative changes in population behavior during COVID-19 pandemic (Warren and Skillman, 2020).
Methods
Below, we list the steps of the overall workflow of our framework, and briefly describe each in the following paragraphs of this section.
Temporal patterns of mobility
To better understand the temporal patterns of mobility, in addition to the given non-negative mobility index M, we also use two variants: delta mobility ΔM and local derivative \(M^{\prime }\) defined as follows:
$$ {\varDelta} M(t)= M(t)-M(t-1)\\ $$
(1)
and
$$ M^{\prime}(t)=\{(M(t)-M(t-1))+0.5*(M(t+1)-M(t-1))\}/2. $$
(2)
Here ΔM is the first difference, and \(M^{\prime }\) approximately the local derivative (Keogh and Pazzani, 2001), of the time series M, and yet, unlike M, these variants are not restricted to be non-negative.
Representing a city as discrete set of points
With the above definitions, the temporal relationship between mobility (and its variants) and new cases of each city in our data can be depicted as triplets (\(M/{\varDelta } M/M^{\prime }\), N, t). We represent the time series by performing a normalized ranking of the variables so as to represent each city by a discrete set of points in unit cube [0,1]3. This normalized ranking is frequently used as a estimator for empirical copulas with good convergence properties (Deheuvels, 1980). The cities can have different representations by considering the three definitions of mobility metrics, and in each case, we can have different groupings of cities. A comparative analysis of all groupings can provide a correlation structure between groups of cities from different perspectives.
Comparing cities using optimal transport
To distinguish between the temporal dependence between mobility and new cases of a given pair of cities, we used Wasserstein distance from optimal transport theory. We computed Wasserstein distance between two discrete sets of points in unit cube, corresponding to two cities, as the minimum cost of transforming the discrete distribution of one set of points to the other set. It can be computed without the need of such steps as fitting kernel densities or arbitrary binning that can introduce noise and artefacts to data.
Wasserstein distance between two distributions on a given metric space M is conceptualized by the minimum “cost” to transport or morph one pile of dirt into another – the so-called ‘earth mover’s distance’. This “global” minimization over all possible ways to morph takes into consideration the “local” cost of morphing each grain of dirt across the piles (Peyré et al. 2019).
Given a metric space \(\mathcal M\), the distance optimally transports the probability μ defined over \(\mathcal M\) to turn it into ν:
$$ W_{p}(\mu,\nu)=\left( \inf_{\lambda \in \tau(\mu,\nu)} {\int}_{\mathcal M \times \mathcal M} d(x,y)^{p} d\lambda (x,y)\right)^{1/p}, $$
(3)
where p ≥ 1, τ(μ,ν) denotes the collection of all measures on \(\mathcal M\times \mathcal M\) with marginals μ and ν. The intuition and motivation of this metric came from optimal transport problem, a classical problem in mathematics, which was first introduced by the French mathematician Gaspard Monge in 1781 and later formalized in a relaxed form by L. Kantorovitch in 1942. More recently, the use of Wasserstein distances in machine learning (also known as Earth Mover Distances) highlighted the advantages of cross-bin distances between histograms especially in computer vision (Rubner et al. 2000). Here, we used Wasserstein distance to cluster temporal dynamics as it preserves the overall geometry of the compared distributions without being sensitive to small variations or “wiggles” therein.
Clustering the cities
Upon computing optimal transport based distances for each pair of cities, hierarchical clustering of the cities was performed using Ward’s minimum variance method (Nielsen, 2016). For the 3 variants of mobility (\(M/{\varDelta } M/M^{\prime }\)), we obtained 3 different hierarchical clusterings: HC1, HC2 and HC3 respectively. Given a dendrogram and a prescribed number of k clusters, we can “extract” from the dendrogram a flat partition of the data into k clusters by using dynamic programming (Nielsen, 2016). The dendrogram is drawn on the plane using the height function arising from the linkage function. A typical cut consists in finding the height h so that a line y = h cuts the dendrogram into k tree edges. A “best cut” (e.g., minimizing the sum of cluster variances like in k-means) can be calculated efficiently from dynamic programming. Then we get a x-monotone polyline cutting the embedded dendrogram into k locations (Nielsen, 2016).
Comparing the clusterings
The resulting clusters are compared using a visualization tool called HCMapper (Marti et al. 2015). HCMapper can compare a pair of dendrograms of two different hierarchical clusterings computed on the same dataset. It aims to find clustering singularities between two models by displaying multiscale partition-based layered structures. The three different clustering results are compared with HCMapper to sought out the structural instabilities of clustering hierarchies. In particular, the display graph of HCMapper has n columns, where n represents the number of hierarchies we want to compare (here n = 3). Each column consists of the same number of flat clusters, which are depicted as rectangles within the column. The rectangle size is proportional to the number of cities within the clusters, while an edge between two clusters depicts the number of shared cities between them. Thus, a one-to-one mapping between the clusters of two columns likely depicts a perfectly similar clustering whereas too many edges crossing between two columns describe a dissimilar structure.
We also checked the spatial homogeneity of a clustering in terms of the average number of clusters in which the cities of each state were assigned to, over all states that are represented in our data. Moran’s I statistic to assess the spatial correlation among the cluster labels was also computed.
Summarizing the distinctive cluster patterns
We summarize the overall pattern of each identified cluster by computing its barycenter in Wasserstein space. It efficiently describes the underlying temporal dependence between the measures of mobility (here we use \(M^{\prime }\)) and disease incidence within each cluster. Wasserstein distances have several important theoretical and practical properties (Pele and Werman, 2009; Villani, 2008). Among these, a barycenter in Wasserstein space is an appealing concept which already shows a high potential in different applications such as, in artificial intelligence, machine learning and statistics (Benamou et al. 2015; Carlier et al. 2015; Cuturi and Doucet, 2014; LeGouic and Loubes, 2017).
A Wasserstein barycenter (Agueh and Carlier, 2011; Cuturi and Doucet, 2014) of n measures ν1…νn in \(\mathbb {P} \in P(\mathcal M)\) is defined as a minimizer of the function f over \(\mathbb {P}\), where
$$ f(\mu)=\frac{1}{N}\sum\limits_{i=1}^{N} {W_{p}^{p}}(\nu_{i},\mu). $$
(4)
A fast algorithm (Cuturi and Doucet, 2014) was proposed to minimize the sum of optimal transport distances from one measure (the variable μ) to a set of fixed measures using gradient descent. These gradients are computed using matrix scaling algorithms in a considerable lower computational cost. We have used the method proposed in Cuturi and Doucet (2014) and implemented in the POT library (https://pythonot.github.io/) to compute the barycenter of each cluster.
Analysis of the clusters using static covariates
To characterize the composition of the identified clusters, i.e., what could explain the similarity in the temporal dependence between mobility and new cases of the cities that belong to a cluster, we used different city-specific population covariates from the U.S. census and CDC data, while checking their relative contributions to discriminating the clusters. These covariates include (a) date of Stay-at-home order, (b) population size, (c) persons per household, (d) senior percentage, (e) Black percentage, (f) Hispanic percentage, (g) poor percentage, (h) population density in 2010, (i) SVI ses (j) SVI minority, (k) SVI overall, and (l) Gini index of income inequality (Farris, 2010). Here SVI stands for Social Vulnerability Index of CDC, and “ses” socioeconomic status. In addition, we also compute the ‘reaction time’ (RT) of each city as the number of days between the stay-at-home-order at a given city and a common reference starting point date, which was taken as March, 15, 2020.
This step also provided a form of external validation of the clustering results as none of the covariates were used for our unsupervised clustering. We demonstrated this step with the clustering results of HC3.
Using the covariates as features of the cities, a random forest classifier is trained to learn the cluster labels. The aim is to see how the clustering could be explained by the covariates. To find which of the features contribute most to discriminate the clusters of cities we computed the mean Shapley values (Lundberg and Lee, 2017). A Shapley value quantifies the magnitude of the impact of the features on the classification task. The ranking of the covariates/features based on the mean Shapley values determines the most relevant features in this regard.