Primary care provides “an integrated, accessible health care services” for majority of “personal health care needs” [1, 2]. A good primary care system is associated with a more equitable distribution of health in populations [3]. Primary care, as the foundation of the healthcare system, holds great potential to reduce differences in health across population subgroups and improve populations’ overall health [3,4,5]. In Singapore, the Ministry of Health (MOH) is committed to transforming the healthcare landscape in view of the evolving health care needs of its population in the community setting. This is timely in the background of aging society and increasing healthcare expenditure. MOH Singapore has been working on initiatives to enable the appropriate management of patients in primary care where specialists in hospitals work with primary care physicians to manage patients with stable but complex conditions in a shared care program [6]. As demand for primary care and complexity in population health needs are growing, primary care systems globally face tremendous challenges. One of the notable challenges is the heterogeneity in population health needs and the resultant difficulty in health care resources planning [7]. To address this challenge in population health management in primary care setting, it is critical to have a better understanding of primary care utilizers’ heterogeneous health state profiles.

Population segmentation is an emerging approach that aims to address this issue. It aims to divide a patient population with heterogeneous heath profiles into distinct and relatively homogenous subgroups (classes) that share similar healthcare needs [8,9,10]. It enables development of targeted healthcare interventions for each population segment and facilitates healthcare resource planning [11, 12]. Population segmentation frameworks have been widely applied to provide quantitative overviews of population health characteristics and guide population health policy and resource management. For example, Ministry of Health British Columbia, Canada adopted a population segmentation framework dividing the entire British Columbia provincial population into 13 classes that represented different health status and healthcare needs [13].

Recently, data-driven population segmentation that utilizes post-hoc statistical analysis on empirical data is gaining wide interest worldwide. It utilizes large volumes of patients’ data to support population health policy decisions by generating real-life, evidence-based, and quantitative insights of a population’s health status [14]. The rich healthcare data made accessible by adoption of electronic health records globally provide opportunities for population segmentation analysis using empirical data [15]. Additionally, the recent advancement in big data analytics in population health management allows for more computational tools for accurate population segmentation. As an example, latent class analysis by Van der Laan et al. on self-reported data successfully segmented an elderly population into classes with different healthcare needs and demonstrated differential healthcare service utilization patterns in different classes [16]. To date, data-driven population segmentation has been used on wide range of populations, including geriatric [17], pediatric [18, 19] population, and gynecological [20], respiratory [21], and oncological patients [22]. However, to the best of authors’ knowledge, data-driven population segmentation has not been applied to primary care utilizers.

The primary aim of this study is to segment a population of primary care utilizers into classes of unique disease patterns, and to report the disease patterns, one-year follow up healthcare utilizations and all-cause mortality across the classes. The secondary aim is to assess the predictive ability of class membership on one-year follow up healthcare utilizations and mortality.


Study design

In this retrospective cohort study, we retrieved de-identified administrative health data from the population health database at Singapore Health Services Regional Health System (SingHealth RHS), the Singapore’s largest RHS that provides comprehensive care in its primary care clinics, community hospitals, national specialty centers and tertiary hospitals for a specific geographic region. The data included in this study are patients’ baseline demographics, disease diagnosis according to International Classification 9 and 10 codes, and longitudinal data on healthcare utilizations (number of inpatient admissions to hospitals and visits to emergency departments, specialist outpatient clinic, and primary care clinics) in 2013, and one-year all-cause mortality. Inpatient admissions refer to patient visits to the SingHealth hospitals that culminated in patients being hospitalized and day surgeries were not included. Primary care visit was defined as a visit to a SingHealth primary care facility (polyclinics) and specialist outpatient clinic as a visit to a hospital specialist clinic respectively. Telephone visits were not included.

The inclusion criteria are 1) adult patients above 21 years old (age of majority in Singapore), and 2) Singapore citizens or permanent residents, and 3) utilized services in SingHealth RHS primary care clinics in Year 2012. Charlson Comorbidity Index [23], Elixhauser Index [24] and Singapore Chronic Disease Management Program [25] was used to select the chronic diseases included in this study. For diseases that had overlap between Charlson Comorbidity Index and Elixhauser Index, diseases coded in the latter index was utilized as they have been shown to provide better prediction of healthcare utilization and mortality [26, 27]. We excluded patients whose residential postal codes fall outside SingHealth RHS catchment region so as to reflect health utilization patterns accurately because these patients may tend to have care utilizations outside SingHealth RHS. The SingHealth Centralized Institutional Review Board (reference number: CIRB 2016/2294) issued the ethical approval for this study.

Latent class analysis (LCA)

LCA is a model-based tool which is widely used to identify unobserved (latent) subgroups amongst heterogeneous population [28]. LCA as a person-centered approach aims to divide individuals into categories, with individuals in the same category being relatively homogeneous, and at the same time distinct from those in other categories [16, 29]. LCA estimates two parameters based on maximum-likelihood: 1) class membership probabilities, which represent individuals’ probability of belonging to each class, and 2) item-response probabilities conditional on class membership (conditional response probabilities), which refer to the conditional probability a particular response given the individual is in a certain class [30,31,32]. Based on their highest latent class probability, individuals are assigned to one class exclusively. Within each class, individuals have similar conditional item response probability patterns [31, 33].

The latent classes derived from LCA can reflect many aspects of health, depending on the class indicators used. Here we focus on population health state profiles in primary care setting and thus choose to use chronic disease status as class indicators.

Mplus version 8 statistical modeling software was used for conducting LCA [28].

Model selection

We fit LCA successively from k = 2 onwards (k is the desired number of classes) and stopped the succession when a class size of a particular model is less than 1 % of the population. Each class should have a substantial size (≥1 % of the population) so that it can be targeted with distinctive heath intervention strategies at policy level. We assessed model fit using multiple criteria. Firstly, established statistical indexes have been widely used such as Akaike Information Criteria (AIC) and Bayesian Information Criterion (BIC) where a smaller AIC and BIC indicates a better fit [34,35,36]. Secondly, in order to have clinical relevance, the model has to have clinical interpretability. Clinical interpretability of classes was evaluated through the integration of clinical expert knowledge and existing clinical guidelines, which are likely to predict differences in healthcare utilization and outcomes [37, 38].

Statistical analysis

Firstly, to examine whether significant cross-class differences in disease diagnosis patterns, demographics and healthcare utilization in baseline Year 2012 exist, we used one-way ANOVA test (or Kruskal-Wallis H test with Bonferroni correction) for continuous variables and Chi-square test (or Fisher exact test) for categorical variables as appropriate.

Then, we assessed the discriminative properties of class membership on healthcare utilizations and mortality in 2013. We began by excluding patients who deceased within 2012 because in 2013 they would have no healthcare utilization (n = 761). We then ran Kruskal-Wallis H test and Chi-square test between the the healthcare utilization (nonparametric) and mortality in 2013 and population classes respectively. When it came to count variable outcomes (e.g., one-year follow up healthcare utilization), to examine the relationship between healthcare utilization and class membership in 2013, we conducted a multivariable analysis via Poisson or negative binomial regression (with the use of the offset/exposure option) where appropriate. The class membership is the exposure of interest adjusting for ethnicity, age, and gender [9]. In anticipation of people who would die, offset term was used, which is the log of the follow-up time starting from 01 Jan 2013 ending on 1) 31 Dec 2013 for participants who lived beyond 1 Jan 2014 or 2) the death date for those who died before 31 Dec 2013. We performed multivariable Cox proportional hazard regression analysis to examine the relationship of class membership and mortality rate. We also presented Hazard Ratio (HR) and its 95% confidence interval. The models were adjusted for age, gender, and ethnicity. Lastly, we used Kaplan Meier estimator for the survival function from lifetime data. Log-rank test was used to compare the differences of survival distributions between the classes. Kaplan-Meier survival curves for one-year mortality (Year 2013) were plotted with 01 January 2013 as time of entry into the follow up period. The time to survival was defined as the number of days from 01 January 2013 to death or 365 days for patients who are deceased on/before 31 December 2013 and censored patients who lived beyond 2013, respectively. STATA SE 14.0 (Stata Corporation, College Station, Texas, 2016) was used for all the analysis.


Patient demographics in baseline Year 2012.

We included 100,747 patients in this study. Table 1 shows the disease prevalence and healthcare utilization of patients in baseline Year 2012. Patients’ mean age is 51.7 ± 17.4 years old. 45.2% (n = 45,515) patients were male. Majority of patients were of Chinese ethnicity (n = 78,414, 77.8%).

Table 1 Baseline demographics, clinical characteristics and healthcare utilization of patients in Year 2012

Latent class model selection

For the latent class selection, the LCA analyses was run from k = 2 to k = 8. However, for k = 7 and k = 8, some of the class sizes fell below 1% of the population. Hence, further statistical analyses were only performed for k = 2 to k = 6.

A six-class model was selected for interpretation based on its better statistical fit as suggested by lowest AIC and BIC (Table 2). Figure 1 depicts the graphical representation of disease patterns across the six classes. The prevalence of diseases was generally low in patients in Class 1. Patients in Class 2 and 3 had higher prevalence of hypertension and hyperlipidemia. The prevalence of peripheral vascular disease, stroke, and coronary artery disease, were the highest in Class 3 patients. The prevalence of asthma and chronic obstructive pulmonary disease were the highest among Class 4 patients. Prevalence of metabolic diseases among Class 5 and 6 patients such as diabetes mellitus, hypertension and hyperlipidemia were high. Class 6 patients had higher prevalence of diabetes mellitus with complications, chronic kidney disease, heart failure and vascular complications such as peripheral vascular disease, stroke, and coronary artery disease. Hence, the six classes were named: Class 1 “Relatively healthy”, Class 2 “Stable metabolic disease”, Class 3 “Metabolic disease with vascular complication”, Class 4 “High respiratory disease burden”, Class 5 “High metabolic disease without complication” and Class 6 “Metabolic disease with multi-organ complications”.

Table 2 Criteria to assess model fit for latent class analysis models
Fig. 1
figure 1

Graphical display of comorbidities of patients by latent classes (k = 6)

Healthcare utilization and all-cause mortality in follow-up year 2013

Table 3 shows the healthcare utilization and all-cause mortality among patients in the six classes in Year 2013. Class 6 “Metabolic disease with multi-organ complications” patients had the highest number of outpatient specialist clinic and emergency department visits and hospital admissions (p < 0.001). Additionally, they had the highest all-cause mortality. (p < 0.001).

Table 3 Healthcare utilization patients in 2013 and one-year all-cause mortality (k = 6)

Multivariable analyses of classes and healthcare utilization and mortality in follow-up year 2013

As shown in Table 4, Class 1 “Relatively healthy” was used as the reference group in the multivariable analyses. After adjusting for age, gender, ethnicity, Class 6 “Metabolic disease with multi-organ complications” patients had significantly higher utilization of outpatient specialist clinic (Incidence rate ratio (IRR): 6.60, 95% Confidence Interval (CI): 5.75–7.56), hospital admissions (IRR: 19.68, 95% CI: 16.41–23.61), emergency department visits (IRR: 13.86, 95% CI: 11.74–16.37). Patients in Class 3 “Metabolic disease with vascular complication” and Class 5 “High metabolic disease without complication” had the highest utilization of primary care outpatient clinics (p < 0.001). Class 6 “Metabolic disease with multi-organ complications” patients had the highest risk of all-cause mortality (Hazard ratio (HR): 27.97, 95% CI: 25.01–31.29), followed by patients in Class 3 “Metabolic disease with vascular complication” (HR: 14.57, 95% CI: 13.25–16.01) (Table 4).

Table 4 Multivariable negative binomial regression on healthcare utilization and cox proportional hazards regression on mortality in Year 2013 (k = 6)

Analysis of one-year survival time

The Kaplan Meier curve was constructed for all-cause mortality stratified by latent classes (Additional file 1). The one-year mortality of patients in Class 2 to Class 6 were significantly higher than Class 1 patients (p < 0.001), with Class 6 “Metabolic disease with multi-organ complications” patients having the highest one-year mortality rate.

Results for k = 2 to 5 were shown in Additional file 2.


Using latent class analysis, we successfully segmented the heterogeneous population of primary care utilizers into six patient classes with distinct disease patterns. We also demonstrated the derived classes have predicative ability on mortality amd long term healthcare utilization. This supports the feasibility of applying a data-driven population segmentation technique in primary care setting.

Our study provides a detailed and quantitative overview of health status of a large population of primary care users. It can enable health policy makers to make informed decisions on the development of targeted health interventions for each unique. For example, a large proportion of primary care users (57.8%) in our study belong to “Relatively healthy” class and have limited healthcare utilizations (Class 1). For this large segment, health strategies should focus on disease prevention and health promotion. This informs allocation of appropriate health resources to the development of health promotion and education programs as well as preventive services such as screening tests by community-based service providers [14, 39, 40]. For the “Stable metabolic disease” group (Class 2) and “High metabolic disease without complication” (Class 5), health service planning should focus on patients’ disease management education, self-motivation and appropriate clinical monitoring to maintain adequate control of chronic diseases and delay (or prevent) subsequent complications. For the higher utilizing, complex segment of metabolic disease with vascular or multiple organ complications (Class 3 and 6), shared care with appropriate specialists and/or team-based care with community case coordinators are probably required to address the multiple determinants of health and optimize quality of life. One of the useful approaches is a six-step process involving needs assessment, definition of proximal program objective matrices, selection of theory based methods and practical strategy, production of program components and design, program adoption and implementation plan, and finally evaluation plan [41].

Data-driven population segmentation approach is gaining momentum as it leverages on large volumes of empirical healthcare data to generates quantitative and real life insights of population health that supports evidence-based population health policy [14]. With the rapid adoption and expansion of electronic health records globally, data-driven population segmentation has been applied in wide range of populations. For example, Vuik et al. recently demonstrated that data-driven segmentation could be used on a general patient population’s data from healthcare administrative databases [14]. However, despite its wide application in health science and policy literature, no previous study examined primary care users by data-driven population segmentation. To the best of our knowledge, our study is the first to address this critical gap in primary care literature using large scale disease, long term healthcare utilization and mortality data.

Compared to prior studies on segmentation of general population, our segmentation solution generated different population segments. For example, Lafortune et al. used LCA to segment a general elderly population and identified four health state profiles: “Relatively healthy”, “Cognitively Impaired”, “Physically impaired” and “Cognitively and physically impaired” [42]. The differences between the present segmentation solution and the prior studies might be explained by different segmenting variables used for LCA in the present study. In our study, we segmented by disease status to derive different multi-morbidity patterns that are validated by healthcare utilization whereas Lafortune et al. [42] and Liu et al. [29] defined the segments by additional sensory, cognitive and functional data. The different choice of segmenting variables will inevitably result in different definitions and naming of derived segments. Selecting segmenting variables requires careful considerations of clinical significance, policy relevance, and data availability. Our study adds to previous work by discovering the patterns of multi-morbidity that contribute to differential healthcare utilization and mortality. We also observed that mental health diseases such as dementia, depression, and anxiety have low prevalence amongst primary care utilizers in Singapore compared to other disease. This may be multifactorial due to lower prevalence of mental health disease in Asia compared to Western countries [43], biased diagnosis and reporting of mental health disease as a result of cross-cultural application of criteria such as the American Psychiatric Association’s Diagnostic and Statistical Manual [44], and/or mental illness patients’ preference to utilize psychiatrists’ specialist services as opposed to primary care providers’. This deserves future research efforts in understanding their health behavioral preferences and patterns.

Selection of the most appropriate segmentation solution is a complex process and requires interplay of subject matter expertise and data analytics. In the present study, we assessed each segmentation model for scientific robustness and practical utility and implications at population health policy level [45]. First of all, data-driven segmentation solution must be assessed by its statistical fit. In LCA, established diagnostic indexes include Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) [36, 46,47,48]. On top of the basic statistical fit, additional criteria are required to assess its relevance in a particular healthcare system. Currently, the criteria for optimal segmentation framework in population health have not been established [49, 50]. In consumer market segmentation, the segmentation effectiveness is assessed by the following proposed criteria, which could be adopted in healthcare settings: validity, interpretability, substantiality, stability, and actionability/accessibility [51, 52]. Other additional criteria, such as parsimony of number of classes may be important to ensure easy use and widespread adoption of a segmentation framework. Additionally, the naming of each segment is a subjective process in a way which best represented the features of a segment. This may depend on clinical expertise of researchers as well as policy context [9].

One of the limitations of this study is that data were collected from a single cluster of health service institutions (SingHealth RHS). Health services utilizations from non SingHealth RHS were currently not captured in the current database. By excluding resident population whose postal codes fall outside SingHealth RHS catchment region because they are more likely to utilize services outside SingHealth RHS, we attempted to minimize this limitation. Future research can expand to national level data or linking databases from other health services institutions to assess the external validity of our segmentation framework. Some large segments may still have certain degree of heterogeneity which can be further segmented. The current study provides an initial broad segment archetype that can be further refined by additional indicators such as behavioral risk factors, mental health, frailty and social functioning. Another limitation is the relatively short follow-up period. Long-term healthcare utilization and mortality patters of the derived patient segments have important implications in health policy making. Further research efforts may focus on evaluating the long-term stability of the derived patient segments.


In conclusions, primary care users have heterogeneous health state profiles. They can be segmented into classes with unique, relatively homogeneous health characteristics using latent class analysis. Different classes have different health services utilization patterns and mortality risks. This information is critical to population level health resource planning and population health policy formulation.