Introduction

The clinical assessment of patients suffering from back and leg pain due to lumbar degenerative disease has recently been supplemented by tests for objective functional impairment (OFI) [1,2,3,4,5,6]. Tests that have been well validated include the timed-up-and-go, 6-min-walk, and five-repetition sit-to-stand (5R-STS) tests [1, 7, 8]. In addition to clinical examination and questionnaire measures for pain and subjective functional impairment, these tests have been shown to be robust to mental status as a confounder and add the ability to capture deficits and complications, such as foot drop or limping [2, 9]. Patients also prefer objective tests over a battery of questionnaires to assess functional impairment [10, 11]. When applied together with questionnaires for pain severity, subjective functional impairment and health-related quality of life, these tests provide a holistic capture of a patient’s health state for scientific and clinical purposes [12,13,14].

A 5R-STS test time of 10.5 s or greater has been shown to correspond to a diagnosis of OFI based on normative data [1]. Baseline severity stratifications have also been constructed, specifying cut-offs for mild, moderate, and severe OFI [15, 16]. However, these cut-offs assume a similar performance among normative populations across all sociodemographic groups. In reality, older patients, those with higher BMI, active smokers, taller patients, and many other groups do worse on the 5R-STS. Cut-offs should be calculated from normative data across all of these groups, but the cut-offs should be flexible and adjustable to an individual’s characteristics.

Achieving such “personalized” cut-offs for OFI can be achieved by calculating cut-offs for specific populations, e.g. cut-offs for > / < 65 years of age and for male and female individuals [1]. However, this would result in a great number of different cut-offs that would be hard to implement in clinical practice. In the era of “personalized/precision medicine”, a more elegant option is to predict the expected upper limit of normal (ULN) for individual patients, based on their sociodemographic characteristics, in order to diagnose OFI [17, 18]. This works well for single cut-offs, e.g. for the binary presence or absence of OFI, based on normative data, but is not suitable for identifying mild, moderate, and severe impairment. These subgroups should instead be defined according to real-world data of patients with established OFI and should reflect specific hallmarks of these subgroups. For example, classifying severity only according to 5R-STS results (e.g. based on test time cut-offs for mild, moderate, or severe disease) would not take into account inter-individual differences among patients. Unsupervised machine learning techniques, such as clustering, are well suited for identifying clusters of observations that exhibit high similarity, without providing labels (e.g. “mild”, “moderate”, “severe”) [19,20,21,22]. Clusters defined by a machine learning algorithm would not be based on disease-specific parameters and could then be used to classify new patients into relevant subsets that may also exhibit differences in treatment response. We aimed to identify clusters of OFI in objectively functionally impaired individuals based on 5R-STS and unsupervised machine learning methods.

Materials and methods

Study design

Pooled data from two prospective studies were used: ClinicalTrials.gov Identifiers: NCT03303300 and NCT03321357) [1, 23]. Both studies were approved by the local institutional review board (Medical Research Ethics Committees United, Registration Numbers: W17.107 and W17.134) and were conducted according to the Declaration of Helsinki. Informed consent was obtained from all participants.

Patients scheduled for lumbar spine surgery for degenerative disease at a Dutch specialized short stay spine clinic were included between October 2017 and June 2018 and were assessed during outpatient consultations. Participating patients completed a variety of questionnaires, as well as the 5R-STS test. The pooled data from both studies were used to train an unsupervised machine learning model to automatically identify clusters of OFI. Subsequently, we compared the identified clusters to identify their hallmarks for further interpretation.

Inclusion and exclusion criteria

Inclusion criteria were the presence of lumbar disc herniation, lumbar spinal stenosis, spondylolisthesis, or discogenic chronic low back pain. Patients with synovial facet cysts causing radiculopathy, hip or knee prosthetics, and those requiring walking aides were excluded to eliminate these confounders. We also excluded all healthy volunteers, who were recruited in the control group. In addition, we excluded all patients without OFI (i.e. a 5R-STS test time of < 10.5 s, as defined by Staartjes et al. [1]) in order to cluster only those patients with established OFI.

Data collection

The 5R-STS was performed according to the protocol described by Jones et al. [5] and Staartjes et al. [1] If the patient was unable to perform the test in 30 s, or not at all, this was noted and the test score was recorded as 30 s. [1] The baseline severity stratification for the 5R-STS, validated by Klukowska et al. [15], was used. Patients also filled in questionnaires containing baseline sociodemographic data including age, gender, smoking status, body mass index (BMI), prior spine surgery, indication and index level, history of complaints, education, work type and ability, analgesia, symptom satisfaction, as well as numeric rating scales for back and leg pain severity and validated Dutch versions of the Oswestry Disability Index (ODI), Roland-Morris Disability Questionnaire (RMDQ), and EuroQOL-5D-3L (EQ-5D) to capture subjective functional impairment, as well as HRQOL [24,25,26]. The EQ-5D included its single domains as well as the composite EQ-5D index and the EQ-5D thermometer on current subjective health status [26]. Participants filled out the questionnaires right after initially performing the test during outpatient consultation. For the EQ-5D, it has been established that the mood component of the EQ-5D correlates well with clinical depression [27].

Analytical methods

Data were reported as mean (standard deviation) for continuous and numbers (percentages) for categorical data. Variables with missingness over 25% were not included in the analysis. When data were assumed to be missing at random (MAR) or missing completely at random (MCAR), imputation was performed using a k-nearest neighbour (KNN) imputation with k = 5 [28]. Pearson’s product-moment correlation was applied to provide an overview of correlations within the dataset—that is, to identify which variables appear to be most highly correlated with 5R-STS results, and which variables demonstrate multicollinearity. One-way analysis of variance (ANOVA) or Pearson’s Chi-Square tests were performed to test for differences among the identified clusters, for continuous and categorical variables, respectively. A p ≤ 0.05 on two-sided tests was considered significant. Analyses were carried out using R version 4.0.3 (The R Foundation for Statistical Computing, Vienna, Austria) [29].

We chose k-means clustering to carry out unsupervised clustering of patients with OFI. The optimal number of clusters was chosen using the “elbow method” based on within-cluster sum of squares. Briefly, this method identifies the number of k clusters from which onwards the increase in similarity of observations within clusters becomes linear. The version of the k-means clustering algorithm described by Hartigan and Wong was used [30]. Pre-processing included centring and scaling (standardization), as well as one-hot encoding of categorical variables. The algorithm was run for a maximum number of iterations of 1000, with 100 initial configurations. Only the 5R-STS test time, 5R-STS baseline severity stratification, patient age, gender, height, weight, BMI, and smoking status were provided as inputs to the model, as sociodemographic variables unspecific to disease, as opposed to e.g. back pain severity or index level. A KNN algorithm with k = 5 was subsequently trained to classify new patients into the corresponding clusters.

Results

Patient cohort

We included 173 patients with OFI fulfilling the inclusion criteria. Detailed characteristics are provided in Table 1. Data missingness was 3.5%. Mean age was 46.72 years (12.65), and 78 patients (45.1%) were male. According to the validated baseline severity stratification, 95 patients (54.9%) had mild, 45 (26.0) had moderate, and 33 (19.1%) had severe OFI. A correlation matrix of all variables included in the model is shown in Fig. 1.

Table 1 Baseline characteristics of the overall patient cohort
Fig. 1
figure 1

Correlation matrix for 5R-STS test time, baseline severity stratification (BSS), age, gender, height, weight, body mass index (BMI), and smoking status. Pearson’s product-moment correlation is demonstrated. 5R-STS five-repetition sit-to-stand test, OFI objective functional impairment, BMI body mass index

Clustering analysis

A plot of within-cluster sum of squares against the number of clusters (Fig. 2) indicated that a number of clusters between 3 and 6 would constitute the optimal k, as this is the point from which onwards the similarity among observations within the clusters only increases marginally. For the analysis, k = 3 was chosen.

Fig. 2
figure 2

Plot of within-cluster sum of squares (WCSS) against number of clusters. The number of clusters at which the decrease in WCSS becomes linear ought to be chosen as the number of clusters for k-means clustering based on the “elbow” method. In more detail, the WCSS describes the distance between each observation and the centroid within each cluster, i.e. how well the cluster fits the single observations within it. With an increasing number of clusters, WCSS decreases because clusters become more specific. However, after reaching a certain number of clusters, WCSS starts to decrease much more slowly. This “elbow” point often provides an optimal balance between a low amount of clusters (allowing for sensible interpretation of clusters) which still adequately represent the data

The three identified clusters (Types 1 to 3) contained 57 (32.9%), 81 (46.8%), and 35 (20.2%) patients, respectively. Within-cluster sum of squares values were 209, 363, and 167, respectively. The ratio of between-cluster sum of squares and total sum of squares was 34.1%.

Cluster hallmarks

Clustered variables

Table 2 provides an overview of the differences between the three clusters in terms of the variables that was included in the model. The clusters of impairment are illustrated in Fig. 3 for continuous variables and Fig. 4 for categorical variables. In terms of raw test times, Type 1 and Type 2 were comparable with mean test times between 14 and 15 s, while Type 3 demonstrated a mean test time of 27.1 (4.4) seconds. The distribution of mild, moderate, and severe OFI groups according to the validated 5R-STS baseline severity stratification increased steadily from Type 1 to Type 3 [15]. Age was constant across all clusters. When comparing Type 1 and Type 2 OFI, the rate of smokers and males was significantly lower in Type 1, as were mean BMI and body height.

Table 2 Comparative analysis of the three types of objective functional impairment identified in the clustering analysis by means of those variables included in the clustering analysis (Clustered Parameters)
Fig. 3
figure 3

Scatterplots demonstrating the hallmarks of the three different clusters (Type 1–3) of objective functional impairment (OFI) in terms of continuous variables. Important hallmarks demonstrated are the markedly lower HRQOL and subjective disability for Type 3 OFI, as well as the difference in body mass index between Types 1 and 2 OFI. HRQOL health-related quality of life, EQ-5D EuroQOL five-dimensions questionnaire

Fig. 4
figure 4

Boxplots demonstrating the hallmarks of the three different clusters (Types 1–3) of objective functional impairment (OFI) in terms of categorical variables. Important hallmarks demonstrated are the steadily increasing rate of prior surgery, active smoking, functional impairment, and extreme anxiety and depression symptoms when comparing the three clusters. EQ-5D EuroQOL five-dimensions questionnaire

Unclustered variables

To further characterize types of OFI, those variables not included in the clustering analysis ought to be analysed (Table 3). There were marked differences in all EQ-5D domains, as well as the EQ-5D index and EQ-5D thermometer and the ODI and RMDQ. Specifically, the rate of patients with extreme anxiety and depression increased steadily from 3.5% in Type 1, 7.4% in Type 2, to 14.3% in Type 3, with statistical significance. In addition, mobility and ability to perform activities of daily life (ADL) were reduced in Type 3, with corresponding increases in subjective functional impairment scores (ODI, RMDQ).

Table 3 Comparative analysis of the three types of objective functional impairment identified in the clustering analysis by means of the variables that was not considered within the clustering analysis (Unclustered Parameters)

The proportion of patients who had undergone prior spine surgery increased steadily from Type 1 with 14.0% to Type 3 with 28.6%, although this progression was not statistically significant. There were no differences in back or leg pain severity in the three identified clusters. Similarly, indications for surgery, history of complaints, index levels, education, work type and ability, analgesic medication use, and satisfaction also remained constant across all three clusters. A qualitative overview of the hallmarks of each type is provided in Table 4.

Table 4 Qualitative overview of the hallmarks of the three types of impairment that was identified through unsupervised analysis

Discussion

Three characteristic clusters of patients with OFI were identified through unsupervised analysis. The clusters were termed Types 1, 2 and 3, and roughly correspond to mild, moderate, and severe impairment (Table 4).

Type 1 OFI was present in around a third of patients and was characterized by a relatively rapid performance of the 5R-STS and was only seldomly associated with problems in performing ADL, mobility, and clinical depression—indicating mild impairment. This is also supported by the low levels of subjective functional impairment found in these patients. As mentioned in Results section, concerning demographics, the vast majority of Type 1 patients were female nonsmokers, with a low BMI. The female gender also explains the lower average height in this group. It has been argued that female patients have a higher pain tolerance than male patients and are likely to also present later for surgical treatment for degenerative spinal conditions [31,32,33,34]. This could partially explain that this largely female group experiences low subjective and OFI. The low incidence of active smoking demonstrably has no effect on 5R-STS performance [1] and for that matter also not on other short-duration objective functional tests [35]. This likely indicates that smoking, while not a significant predictor of 5R-STS performance, was picked up by the clustering algorithm as a confounder associated with other, possibly psychosocial factors that in turn influence performance.

Type 2 OFI occurred in half of our cohort and was linked with overweight in both genders, although test times were slightly elevated compared to Type 1. This indicates mild impairment, also corroborated by mild subjective functional impairment. As stated in results, the incidence of extreme anxiety and depression symptoms was over twice as high as in Type 1, and statistically significantly. The rate of smokers corresponded to that of our patient cohorts and indeed the Dutch population [36]. In addition, both genders were equally represented in this cluster. Type 2 likely indicates low levels of true functional impairment, but with a higher susceptibility for mood changes due to the mild or moderate impairment that is present.

In contrast to the mild/moderate levels of OFI observed in Types 1 and 2, Type 3 indicated extreme impairment with sequelae such as bedriddenness, high subjective functional impairment, mobility issues, and high rates of discomfort. Overall, patients with Type 3 impairment were of average BMI, mostly of male gender, and exhibited a significantly higher rate of active smoking. We also observed a doubling of the rate of extreme depression and anxiety compared to Type 2 and a quadrupling compared to Type 1.

Still, levels of pain were comparable among the three clusters, with the exception of the EQ-5D “pain and discomfort” domain, which also includes discomfort. Back and leg pain severity did not differ among the three clusters, demonstrating that the clusters represent true subgroups of impairment (including objective and subjective impairment), and are not influenced by pain severity as such. This is similar for sociodemographic factors, which could be assumed to influence the perception of impairment, such as level of education, work type, work ability, and age.

Up to now, grading of OFI was based on a fixed cut-off of 10.5 s on the 5R-STS test, though, realistically, obese and elderly, but otherwise healthy, individuals cannot be expected to perform equally well as younger individuals with a BMI in the normal range [1]. Ideally, an otherwise healthy, but obese, 75-year-old person and a 22-year-old athlete should not have their level of impairment rated by the same static cut-off. As one potential solution, Gautschi et al. [2] calculated a range of cut-offs for patients of certain gender and ages for the timed-up-and-go test, but clinical implementation of a larger amount of cut-offs that need to be remembered is cumbersome. Machine learning-based methods have the potential to suggest personalized “expected” cut-offs for each individual patient, based on sociodemographics, as has been alluded to in the initial validation of the 5R-STS in the spinal population [1, 37, 38]. Once a personalized cut-off has been established for the binary presence or absence of OFI, some form of impairment grading, that again takes into account sociodemographics, should be carried out, which the clustering algorithm developed in this study can do. Furthermore, machine learning methods in combination with motion tracking-based 5R-STS assessment [39] could lead to more intuitive and automated integration of objective functional testing in clinical practice. In future, it may become possible to immediately calculate OFI, in contrast to other technological advances such as robotics, imaging, or neuronavigation, as algorithms can run server-side and even be applied on mobile devices, applications are far-reaching even in rural areas where patients cannot easily travel for in-person appointments [40].

Limitations

Although we used prospectively collected data exclusively, this presents only single-centre data. Therefore, generalizability of our findings and specifically of the identified clusters of OFI require further external validation before making the model available (e.g. in a web-app) and applying it in clinical practice elsewhere. However, the data that were used (preoperative sociodemographic parameters and 5R-STS testing) are not centre-specific (such as e.g. surgical treatment or length of stay), and the 5R-STS has been established as having extremely high inter-rater reliability [1, 23]. Inclusion of further parameters from patient history and clinical examination could possibly increase the distinctness of the clusters even further. However, this would come at the cost of clinical usability and parsimony of the algorithm and derived classification of OFI. Currently, only variables that are easily and objectively assessable such as age, BMI, and gender are included in the model, which enables clinical application in under one minute. Although we included a comparatively large and homogenous cohort of patients with OFI, a larger number of patients would likely also lead to an increase in generalizability and distinctness of the clusters. Lastly, the model was not tested in separate diagnoses such as chronic low back pain and spondylolisthesis due to lacking statistical power for such subgroup analyses. However, the classification of OFI based on our model is independent of diagnosis (i.e. it is not a factor considered in this cluster analysis), and in addition, our analysis of unclustered parameters demonstrated that there is no interaction between diagnosis and cluster assignment, indicating robustness against different diagnostic categories.

Conclusions

In this study, we demonstrate that unsupervised machine learning techniques, in combination with the 5R-STS, identified three distinct clusters of patients with OFI that may represent a more holistic and objective clinical classification of patients than test times and baseline severity stratifications alone by taking into account individual patient characteristics. These findings may in future be integrated with higher levels of automation into clinical practice and may then also have diagnostic, prognostic, and predictive implications for surgical and nonsurgical treatment of degenerative spinal conditions.