Introduction

Drug utilisation research traditionally involves characterising pre-defined patient features, usually related to the indications and contraindications of a drug or drug class for regulatory purposes and post-marketing surveillance. This approach is hypothesis-driven and can potentially fail to identify particular sub-groups of drug users that might be of interest for further studies on the risk-benefit of the drugs.

For some time, it has been possible to use information extracted from electronic health records (EHRs) and claims data for drug utilisation research. EHRs are data-rich and, crucially, represent real-world use of the drugs in question within the community. In parallel, recent advances in data science have led to the development of techniques that can be used to mine these, large-scale data and allow data-driven analysis. This approach has the potential to capture patterns that may not have been considered in a hypothesis-driven setting and therefore reveal more about the actual use of drugs in the real world.

EHR mining has been studied widely [1, 2], and a range of techniques are used, including data mining [3], natural language processing [4] and, more recently, deep learning [5••]. Most studies use a two-step approach of phenotyping and discovery [6], shown in Fig. 1.

Fig. 1
figure 1

The process of analysis of electronic health record (EHR) data for phenotyping and discovery. In the first stage, raw patient data are analysed to extract meaningful features. These features are used to build models (e.g. cluster analysis, classification and prediction models) in the second stage.

Phenotyping and Discovery

EHR data can be complex in terms of both volume (number of patient records) and variety (types of information stored). A typical EHR database can include patient information such as clinical diagnoses codes, demographic data, laboratory and imaging results, vital signs and free-text notes. The first step in analysing EHR data is phenotyping, which involves selecting clinically relevant features from the raw data (Fig. 1). Once a set of features has been extracted, knowledge discovery can follow. The features are analysed at a population-specific or patient-specific level to determine if there are sub-groups (cluster analysis) or if they can be used to derive new diagnostic criteria (classification) or to determine prognosis (prediction).

Machine learning methods have recently been used in EHR mining [5••, 7]. In particular, unsupervised learning methods have shown promise for feature selection [8••, 9••, 10] before clustering and prediction tasks.

In this paper, we explain feature selection and cluster analysis in the context of EHR-based drug utilisation research. We illustrate the methods using a case study on anti-osteoporosis drugs.

Anti-osteoporosis drugs are commonly prescribed as preventative therapies to patients at risk of fragility fractures [11]. The case study is based on data from the SIDIAP database (www.sidiap.com), which provides anonymised electronic general practice-level data from Catalonia (Spain) [12].

As a first step, it is important to characterise the study population with respect to key variables that are relevant in the clinical context. In our case study, which included 37,996 patients using anti-osteoporosis drugs, 32 variables were extracted for each patient, including clinical diagnosis codes, lab tests and demographics. The characteristics are summarised in Table 1.

Before any analysis, data can be examined for obvious correlations in the structure. In our example, an initial assessment of the data did not suggest strong correlations (where a strong correlation would have Pearson’s correlation > 0.6 for any variable pair). Another statistic used for assessing grouping structure is Hopkin’s statistic [13], and a value 0.44 for it indicated that the data were distributed in a random manner.

Table 1 Prevalence of features in the dataset*

Feature selection and cluster analysis

When performing cluster analysis, an underlying assumption is that there are sub-groups or clusters within the population (in our example, "sub-groups of anti-osteoporosis drug users). The “true” number and nature of these clusters are unknown, which make this an unsupervised learning problem. Therefore, expert knowledge is often used as a basis for determining the number of clusters (k) to be derived. We also do not know which of the variables (and combinations thereof) are most characteristic of the population and can thus be considered as features. Our task is to first identify the most characteristic variables of the given population (feature selection) and then to learn the structure of the k clusters based on these features (cluster analysis).

Feature Selection

The autoencoder [14] is an unsupervised learning algorithm for feature selection using unlabelled data. It is a feedforward neural network. In its simplest form, its architecture comprises an input layer that feeds into a hidden layer, which in turn feeds into an output layer. Consider a D-dimensional dataset X = {x1, x2, … xD}, where D is the number of variables, presented at the input layer. The autoencoder attempts to reconstruct X at the output layer. In other words, it models the identity function f(x) = x [15]. To do so, the hidden layer is forced to learn a compressed, weighted representation of the data X presented at the input layer, which is then reconstructed at the output layer as \( \hat{X} \). The autoencoder is suitable for tasks such as dimensionality reduction and feature selection because it produces this compressed representation of the data.

The learning process depends on the architecture of the autoencoder (the number of nodes in the hidden layer) and the sparsity parameterFootnote 1ρ, which enables the compressed representation. The optimal architecture is one for which these two parameters result in the smallest reconstruction error (RMSE) between X and \( \hat{X} \).

wdj, the weight assigned to the dth variable at the jth node of the hidden layer, can be used to generate a measure of the “importance” of that variable in the reconstruction of the dataset, where d ∈ ℤ: 1 ≤ dD and j ∈ ℤ: 1 ≤ jJ.

It is not necessarily straightforward to interpret the expression combining the weights at the hidden layer. In our example, we took a simple approach. The weight of a variable d at a given node j signified its importance in activating that node. The greater the weight of a variable, the more important it was for the activation. We therefore considered the average weight of a variable across all of the J nodes, \( {\hat{w}}_{dj}={\sum}_{j=1}^J{w}_{dj} \). A variable with a low \( {\hat{w}}_{dj} \) would have less importance than a variable with a higher \( {\hat{w}}_{dj} \). A selection threshold can be defined such that variables with weights above the threshold are considered selected features.

In our example, a two-layer autoencoder was constructed with D-dimensional input and output layers and one hidden layer containing J nodes. To identify the number of hidden nodes, J, and sparsity parameter, ρ, of the optimal autoencoder, we performed a grid search using 1 < J < 20 and 0.001 < ρ < 0.995. For the optimal model,\( {\hat{w}}_{dj} \) was estimated for each variable. Variables with weights \( {\hat{w}}_{dj} \)> 0.5 at any node in the hidden layer were considered features of the dataset.

It is useful to apply feature selection to different subsets of the data to assess whether the results differ with the number of input variables. In our example, we applied feature selection to a subset of the data containing 12 variables that are considered risk factors in the osteoporosis literature and to the full dataset containing 32 variables.

Evaluating Feature Selection

As feature selection in unsupervised learning is purely data driven, it is often compared with other statistical approaches, and expert opinion is typically sought to evaluate the results. In our example, features selected by the autoencoder were compared with those obtained using principal component analysis (PCA). PCA is another method commonly used for feature selection and dimensionality reduction. It selects the variables that best explain the variability in the data.

Independently, we polled experts to ask which variables they believed were taken into consideration by general practitioners when assessing someone’s risk of needing treatment for osteoporosis, i.e. were risk factors. The poll identified nine variables: age, female gender, obesity, smoking, alcohol use, comorbidity, steroid use, sedative use and fracture history. The expert-identified risk factors were used as the reference against which the features selected by the autoencoder were compared.

In the subsequent discovery stage, variables selected as features by the autoencoder were used in cluster analysis to derive sub-groups of anti-osteoporosis drug users.

Cluster Analysis

There are several algorithms for cluster analysis [16], and k- means [17] and hierarchical [18] clustering are two of the most well-known and commonly used ones. They produce “hard” clustering, where exactly one cluster is assigned to a participant. It is also possible to produce “soft” clustering, where a participant may belong to more than one clusters, depending on the degree of membership. Gaussian mixture models and fuzzy c-means clustering can perform “soft” clustering [19]; however, this approach is often not applicable for data that are binary in nature.

Hierarchical Clustering

Recalling the dataset X = {x1, x2, … xD}, we may consider the ith participant to be represented in the D-dimensional data space by Xi, where i ∈ ℤ: 1 ≤ in and n is the number of participants in the dataset. First, each participant is considered to be a cluster, so that there are n clusters to begin with. Next, the closest clusters are merged, using a measure of closeness (e.g., the Euclidean distance in D-dimensional space). This process of merging carries on until all of the participants have been merged into either a pre-specified number of clusters (k) or one clusterFootnote 2.

k-means Clustering

First, the number of clusters k has to be pre-specified. Then k randomly selected points in the D-dimensional space are initialised as cluster centroids. Next, a participant is assigned to the cluster centroid that they are closest to. As with hierarchical clustering, closeness may be calculated using, e.g. Euclidean distance (other types of distance measures are also used, e.g. Hamming, Mahalanobis, city-block, etc.). The position of a cluster centroid is re-calculated based on the positions of the participants assigned to it. Participants continue to be re-assigned and centroids re-calculated until there are no further changes in the positions of any of the k clusters.

Internal Cluster Evaluation

The purpose of evaluation is to judge, in an objective way, how well the clustered data fit within the candidate k clusters. This can be done by checking (a) if individual clusters are homogenous and (b) if they are well-separated from the other clusters. The optimal number of clusters \( \hat{k} \)is found if these criteria are met.

In the case study, in search of \( \hat{k} \), hierarchical and k-means clustering were performed on the dataset using the features selected by the autoencoder model. Based on expert opinion, it was reasonable to derive as many as 10 sub-groups of anti-osteoporosis drug users. Nevertheless, we set the candidate number of clusters from k = 2 through to k = 20. For each value of k, the clustering process was carried out 100 times, using a random sample of 1000 participants each time. To measure the closeness of points in the data space, we used the squared Euclidean and city-block distance measures. For evaluation of the resultant clusters, we used commonly used criteria [20, 21], including silhouette, Calinski-Harabasz (CH) and gap.

External Cluster Evaluation

Internal evaluation alone is often not sufficient for evaluating clusters. External evaluation can be performed using information about the population that was not used as a feature in the cluster model. In our example, information on bone mineral density and incident hip fracture risk was available in the SIDIAP database. These are key proxies of osteoporosis risk and indication for anti-osteoporosis drug therapy. Since bone mineral density and hip fracture risk were not used in the model generation process, we could use them for evaluating the resulting clusters.

$$ \mathrm{Fracture}\ \mathrm{risk}\ \mathrm{was}\ \mathrm{estimated}\ \mathrm{as}\ \frac{\mathrm{number}\ \mathrm{of}\ \mathrm{fractures}\ \mathrm{since}\ \mathrm{start}\ \mathrm{of}\ \mathrm{study}}{\mathrm{follow}-\mathrm{up}\ \mathrm{time},\kern0.75em \mathrm{totaled}\ \mathrm{for}\ \mathrm{all}\ \mathrm{person}\mathrm{s}}\times 1000\ \mathrm{in}\ \mathrm{units}\ \mathrm{of}\ 1000\ \mathrm{person}\ \mathrm{years}\ \left(\mathrm{py}\right). $$

The bone mineral density and fracture risk of the k clusters were examined.

Presenting the Results of Feature Selection and Cluster Analysis

The results of feature selection should ideally reflect how the model(s) used compare with expert opinion. The results of cluster analysis should not only be assessed in light of internal and external evaluation, but also by considering whether they are clinically plausible. We illustrate this using the results of our case study.

Feature Selection

The minimum error between the original and reconstructed datasets (RMSE = 0.08) was obtained when the number of hidden nodes in the autoencoder was set to J = 5 and the sparsity parameter was set to ρ = 0.5.

Selecting From 12 Variables

When selecting from the subset of 12 variables, the autoencoder assigned high weights (\( {\hat{w}}_{dj}>0.5 \)) to 8 of the 12 variables (Fig. 2a). These 8 variables were the same variables independently identified by clinical experts as risk factors. The only clinically identified risk factor not ranked highly by the autoencoder was alcohol consumption.

Fig. 2
figure 2

Weights assigned by the autoencoder to the variables in the dataset containing a 12 and b 32 risk factors. The dashed line represents a threshold of \( {\hat{w}}_{dj} \)> 0.5. A green circle indicates a feature selected by both expert opinion and the autoencoder. A red circle indicates a feature selected by expert opinion but not by the autoencoder. A blue circle indicates a feature not selected by the autoencoder in the 12-variable dataset but selected in the 32-variable dataset

Selecting From 32 Variables

When selecting from all 32 variables, 7 of the 8 variables were selected again. Steroid use was not ranked highly (Fig. 2b). An additional 3 variables were selected: varicose veins, type 2 diabetes and cancer.

When selecting from both the 12-variable subset and the 32-variable full dataset, the ranking of the selected variables did not exactly match their prevalence in the dataset. For instance, obesity was the highest-ranking variable in the selection although it was present in 34% of the population, whereas being female was the seventh highest-ranking variable. However, in both scenarios, the features with the highest prevalence in the dataset were all selected.

Comparison With PCA

PCA’s ranking of the variables explained 19% of the variation in the dataset. When selecting from the 12-variable subset, the variable ranking by the autoencoder agreed with the variable ranking by the PCA method, with some exceptions. Obesity was ranked first by the autoencoder and second by PCA, and vice versa for comorbidity. Fracture history was ranked fourth by the autoencoder and seventh by PCA. Being female was ranked seventh by the autoencoder and fourth by PCA.

There was less agreement between PCA and the autoencoder when selecting from the 32-variable dataset.

Cluster Analysis and Evaluation

Number of Clusters

Figure 3 shows the results for k-means and hierarchical clustering applied to the features selected by the autoencoder model (with alcohol consumption added). In general, higher CH, gap and silhouette values indicate better within-cluster homogeneity and between-cluster separation.

Fig. 3
figure 3

Internal evaluation of hierarchical (black) and k-means (blue) clustering solutions using the CH (Calinski-Harabasz) (left), gap (centre) and silhouette (right) criteria. Error bars (grey) show the standard deviation over 100 iterations. (Reprinted, with permission, from IEEE, Cluster Analysis to Detect Patterns of Drug Use from Routinely Collected Medical Data, June 1, 2018)

For both k-means and hierarchical clustering, CH decreased from k = 2 onward, suggesting \( \hat{k} \)= 2. The smallest gap resulted when k = 2 for both k-means and hierarchical clustering and, as k increased, no clear elbow point was found to show where the gap was maximised. The silhouette criterion showed similar results to the gap criterion for k-means (indicating better clustering as k increased). However, for hierarchical clustering, it suggested that the clustering solution initially became worse as k increased from k = 2 to k = 4, stabilised from k = 5 and became worse again at k = 8. These results were based on Euclidean distance. Similar results were obtained using the city-block metric (data not shown).

Figure 3 demonstrates that the hierarchical and k-means clustering did not necessarily agree in their clustering solution. It also shows that there was no clear elbow point or stable state on the evaluation curves. The error search methods [22] that are typically used to determine \( \hat{k} \) were therefore not appropriate here. We thus cannot conclude that there was an obvious optimal number of clusters in our data. This highlights an important complexity related to cluster analysis: it is not always straightforward to determine an optimal grouping distribution. We noted this as an unsurprising complexity due to the challenging nature of our real-world EHR data and continued to examine the structure of the clusters.

Cluster Structure

Figure 4 shows the composition of the clusters obtained using hierarchical clustering. At k = 2, the main feature that seemed to distinguish the two clusters was gender. With increasing k, the predominantly female cluster was divided, whereas the male cluster remained as it was. The figure demonstrates the changing composition of the clusters as they are further divided, up to k = 5. Table 2 summarises the characteristics of the five clusters. This characterisation often helps researchers to assign “names” to the derived clusters and to further interpret them, e.g. in our example, cluster 1 could be referred to as the male cluster, cluster 5 could be the elderly women with fracture history, whereas cluster 4 could be referred to as younger women with no fracture history.

Fig. 4
figure 4

Distribution of the risk factors within a cluster obtained using hierarchical clustering. A feature can take values between 0 and 1, where, e.g. female = 0 indicates all of the participants in this cluster are male and female = 1 indicates all of the participants are female. The clusters corresponding to k = 2, k = 3, k = 4 and k = 5 are shown in the panels from top to bottom, respectively. The label "charlson>2" corresponds to the risk factor "comorbid" and the label "over60" corresponds to the risk factor "elderly". (Reprinted, with permission, from IEEE, Cluster Analysis to Detect Patterns of Drug Use from Routinely Collected Medical Data, June 1, 2018)

Table 2 Prevalence of features in each of the five clusters detected using hierarchical clustering, where the minimum and maximum prevalence can be 0 and 1, respectively

For the external evaluation, the fracture risk of the cluster of elderly women with fracture history (cluster 5) was, as expected, the highest (10.5/1000py) and their hip bone mineral density was the lowest (T scoreFootnote 3 = − 2.2). Clusters 1, 2 and 3 had a fracture risk of 4, 6 and 6.5 per 1000 py, respectively. Cluster 4 had a lower fracture risk (1.5/1000 py) than even the general source population (2.23/1000 py) [23]. This also appears plausible since this cluster comprised younger people with no previous fractures. And their hip bone mineral density was higher than that of cluster 5 (T score = − 1.6). However, despite having a “healthy” hip T score, cluster 4 participants had an average spine bone mineral density that was low enough to be osteoporotic (T score = − 2.7), which may explain why this seemingly healthy cluster was prescribed anti-osteoporosis drugs. In this manner, external evaluation can aid in the interpretation of the derived clusters and in judging their clinical plausibility.

Discussion and Conclusion

In this review article, we have explained and demonstrated the use of unsupervised machine learning methods for feature selection and cluster analysis of real-world EHR data for drug utilisation research. Although the case study presented had a limited number of variables, our intent was to show how the methods perform for both small and larger numbers of variables. A consistent set of features was selected regardless of the number of variables entered into the model.

Feature selection and cluster analysis are difficult to assess in the absence of a gold standard. In the case study, we were able to compare the selected features with clinical expert opinion. In reality, it might not always be possible to have a reference to compare against, which is precisely what makes the learning task unsupervised in the first place.

Internal evaluation of the results of clustering the dataset using the selected features exposed the difficulty in deriving an optimal number of clusters. It showed that extracting patterns from real-world data with complex underlying structures may require examining the clusters using information not included in the clustering model. Examining the bone mineral density and fracture risks for the detected groups, for instance, aided in understanding the structures of the clusters and demonstrated how cluster analysis can help to develop and characterise sub-group profiles.

When interpreting the results, it is important to consider whether the analysis was based on complete data, and if not, how any missing data were handled. In the example presented here, the derived clusters only represent the sub-population of anti-osteoporosis drug users who reported their data in full.

Another constraint when analysing EHR data is that some clinical variables are often recorded in a dichotomised fashion, e.g. smoking status is recorded as “yes” or “no”. If all variables are binary, the choice of cluster analysis methods to be used can become limited.

A final note on why one might conduct such an analysis. As demonstrated, feature selection can provide a starting point if it is not known a priori which of many features should be chosen. Once reasonable features are selected, they can be used for discovery analysis, such as detecting clinically plausible sub-groups. In this manner, phenotyping and discovery can help us to discover new or hidden drug use patterns or sub-groups.

Despite interest in machine learning methods, their uptake in drug utilisation research has been slow. It is hoped that by considering the strengths and limitations showcased here, researchers will be better positioned to make informed decisions on the suitability of these methods for tasks such as sub-group detection.