This section introduces the models and the results for data imputations, as well as the clustering model for patients included in the CDC surveys. The next two sub-sections include the following: 1—Imputations using correlations (Pearson and others) for data creation of variables that can point to preventable diseases 2—A multivariate clustering model of patients, based on all variables mentioned above. The goal of the model is to group patients into multiple clusters for preventive recommendations and management.
Multivariate imputation by chained equation
The data wrangling process is referred to as 80% of the effort in a data science lifecycle. The most challenging aspect in data wrangling is data incompleteness. Incompleteness leads to data imbalance, and therefore, leads to biased outputs from models. Incomplete data often makes the case for a Garbage-in Garbage-out (GIGO) situation; therefore, imputation is deployed. Most commonplace methods for imputation are presented below. None of the four mentioned methods are useful in the case of CDPP—the reasoning is presented in parenthesis:
Providing means, modes, and medians of a column to replace missing data (Averages of patients do not mean anything—patients respond to diseases in unique fashions). It is noteworthy to state that since NHANES data are cross-sectional, stratification is not possible. Such imputations are applied when monitoring patients longitudinally.
Providing means, modes, and medians of data points surrounding missing data—usually 3 on each side, or more (Same reason as in #1. Closer patients in order are not closer health-wise).
Using correlations with few predefined columns to extract new values (Mostly used in small datasets. The number of columns that are used in this study is ~ 1000+, which makes it impossible to find one or few columns to be used for correlations; instead all columns ought to be used).
Deleting rows with missing data (Deleting rows/patients will lead to bias problems, and so we avoid this option).
Instead, we introduce a different process for data imputation of CDPP variables. The method used includes six main steps:
Stream data from the SQL database into the R environment, split data into testing and learning data, use learning data for all steps except step # 6.
Measure correlations between every column and all other columns to find the highest correlated columns for every CDPP variable.
Apply Pearson correlations for numerical values, Analysis of Variance (ANOVA) for columns that have numerical and categorical values, and Cramer V correlation coefficients for data that are categorical:
Cramer V coefficient is used when both x and y are nominal. Results are decimal values between 0 and 1. Formula for Cramer V is: Φ = SQRT (X2/N (K-1)). ϕ denotes Cramer V; X2 is the Pearson Chi square statistic; N is the sample size involved in the test; K is the lesser number of categories of either variable.
ANOVA is used when x is numeric, and when y is nominal (or vice versa). The result is a decimal value between 0 and 1.
Pass the top 10% correlations for every column as inputs for the Multivariate Imputation by Chained Equation (MICE) model—an R library . The MICE model then uses columns that have a correlation > 0.5.
Run through all columns of missing CDPP data and impute data points based on the highest correlated columns and MICE.
Validate created CDPP data versus actual data (using the testing datasets from Step #1).
A sample code from the MICE algorithm used for this study is as follows (used in Steps #4 and #5):
The 6-step process is applied to providing missing longitudinal smoking data as an example. Smoking is known to be one of the most common risk factors to many chronic diseases, therefore, completeness in this data is critical to providing health recommendations. The following successful results are produced:
Imputations error rate is: 0.071. That is deemed to be very low (result #3).
The process created similar statistical distributions of predicted data and actual data.
Top correlations for disease and smoking data are identified (Fig. 11), with low error rates. Such variables aid medical teams in identifying healthcare parameters for preventive healthcare amongst a subgroup of the population taking the NHANES survey. This breaks down SMQ to specific practices and can aid in making a quick change. For example, cigarette filter type seems to have a high effect on the patient’s smoking numbers (correlation = 77%). Cigarettes length and Tar content are other two high effect variables—(result #4).
The listed measures (survey questions) are the preventive pointers to smokers—in order to avoid certain smoking-relevant diseases.
The 6-step imputations’ process of CD data can provide more detailed pointers to preventive sub-variables (such as the cigarette filter example). Another example presented is for missing clinical periodontal measures used to estimate prevalence of Periodontitis. Periodontitis is a host-inflammatory oral disease characterized by lengthy exposure to pathogens. For periodontitis, imputations created similar distributions of groups (actual data vs. predictions) amongst periodontitis predictions. See Fig. 12 for a visual comparison.
Periodontitis of mild, moderate or severe form affects over 64 million Americans today; i.e. above 42% of the US population aged 30–79 . This disease is heavily diagnosed and monitored via clinical parameters; thus, in case where there are limited resources or if dubbed costly or inaccessible, imputations can potentially serve to primarily identify patients requiring further screening, preventive or interventional measures. Periodontitis can be prevented and managed; therefore, a data-driven approach can present a state-of-the-art ML method to apply on a population; i.e. a large scale of individuals. Here, this was implemented on NHANES missing periodontal data for 2015 onward. When comparing actual vs. predicted results of imputations for periodontitis, and comparing the highest correlated variables between both sets, many variables ended up having similar correlations such as: OHDPDSTS dentition status, OHXCJID dentition status, and OXHXPCM dentition status. Figure 13 shows the top 5 variables that have the highest correlation.
The dental variables’ filters used in the study are illustrated in Fig. 14 (showing the criteria for data collection of periodontal data in NHANES). Only the complete and partial dental tests are included, any data that are not done or missing are not considered in the imputations or correlations test.
Refer to Corr Order # for instance; it has filters 1 and 2. Corr#1 in actual data is the highest correlation variable (0.972) and corr#2 is the second highest (0.612), and so on. Similar to that, in predicted values, corr#1 is equal to 0.973, and corr#2 is equal to 0.079 (result #5).
The goals desired from the mentioned periodontitis example are to present initial pointers to the disease, and to point to factors that would allow for preventive dental parameters (providing focused dental cleaning and cures). The 6-step imputations model presented in this section provides completeness and balance to the datasets, however, to fit patients into a group of similar patients, a clustering model is required, that is presented next.
The k-modes clustering algorithm is an extension of the infamous k-means clustering model. Instead of distances, the k-modes model is based on dissimilarities (that is, quantification of the total mismatches between two data points: the smaller this number, the more similar the two objects). K-prototype (k is the number of clusters)  is used for clustering numerical and categorical values. It is a simple combination of k-means and k-modes. K-prototype has the following steps:
Select k initial prototypes from the dataset X.
Choose the number of clusters (Fig. 15 illustrates the Elbow diagram for recommendations of k).
Allocate each data point in X to a cluster whose prototype is the nearest.
Retest the similarity of objects against the current prototypes. If the algorithm finds that an object is allocated such that it is nearest to another cluster prototype, it updates the prototypes of clusters.
Repeat step #3, until no object changes its cluster (after fully testing X).
A variety of clusters’ descriptions could be pulled from the model developed. The model includes 519 variables; we only used columns that had more than 10% values, and less than 90% nulls. Columns with Null values could lead to model skew. Depending on the application deployed, the clusters could be defined depending on the purpose. For example, they can be defined based on smoking habits and blood pressure, but they will be categorized differently if they are defined by chronic diseases—the seven CDs collected in our results are (a total of 1711 cases):
Hypertension (477 cases)
Diabetes (188 cases)
Arthritis (373 cases)
Cancer (111 cases)
Asthma (173 cases)
Coronary Disease (41 cases)
Periodontitis (348 cases)
The best k-model has 11 clusters (based on the elbow method and clinical heterogeneity). For instance, it is worth mentioning that cluster 5 is the ‘healthy cluster’. Cluster 8 has patients who don’t have high levels of Cholesterol (Mmol/L)—6 or higher. Clusters 3 and 11 are the ones with the highest probability of oral health disease (i.e. periodontitis); a 300 + sample of patients had periodontitis. As one can notice, clusters 3, 7, and 11 are the least healthy, while 5, 6, and 8 are the healthiest ones. Figure 16 presents the distribution of clustering results by periodontitis projections. More importantly, Table 2 presents the counts of CD patients within every cluster (result #6).
Most chronic pro-inflammatory conditions have common risk factors, such as smoking and co-existence with other systemic diseases. In case of periodontitis, diabetes mellitus II is a chief risk factor to exacerbation of the disease. Both diseases were claimed to even have a bi-directional or a two-way relationship. Periodontitis is the 6th most common condition in diabetic patients. Susceptibility to periodontitis increased by around threefold in diabetic patients . Evidence suggests that managing or preventing one could aid in alleviating the other condition. Thus, treating or preventing periodontitis may improve the status of other chronic conditions.
By belonging to a health group, a primary care physician can get instant expectations on patient’s health, and which preventive tests a patient might need. The multiple factors of a CD, for most physicians, can sometimes be rather time-consuming to collect; and so our ML model aids in providing comprehensive consideration of hundreds of variables. We aim to experiment with this model at small scale facilities first, such as at a university healthcare facility or a clinic. Further clustering data results and code are available for researchers upon request from the authors.
Validating unsupervised models is an intricate task. For the clustering model, we applied three main means of evaluation: (1) Relative clustering validation: evaluates the clustering model by varying different parameter values for the same algorithm. In this study, we evaluated a different number of clusters (k), however, as part of future work, we would like to test including/excluding other healthcare variables and observe the changes to the model outputs. (2) External clustering validation: compares the results of a cluster analysis to an externally known result, such as externally provided NHANES clusters (which are not provided by NHANES). Since we know the best cluster number in advance (k = 11 or k = 5), this approach is mainly used for selecting the right clustering algorithm. 3. Internal clustering validation: uses the internal workings of the clustering process without reference to external knowledge. This type aims to minimize the distance between data points in the same cluster and maximize ones in different clusters. Different diseases require different measures, and therefore, a breakdown of CD patients within every cluster is presented in Table 3 (result #7).
As noted in the table, if a patient ends up in cluster # 3 for instance, it is safe to assume that they have health parameters that are very similar to a group of other patients with Coronary disease, or Asthma. Additionally, if a patient belongs to clusters #5 or #8, then their health parameters are similar to a group of healthy patients. The next section presents conclusions and implications on healthcare policy.