Background

Depression is the most common mental health condition globally, with one-year global prevalence rates ranging from 7 to 21% [1]. Quality of life can be seriously impaired by this disorder, with depression ranking as the second highest cause of Disability-Adjusted Life Years (DALYs) and Years Lived with Disability (YLDs) [2, 3]. Depression is a major contributory factor in suicide affecting hundreds of thousands of cases per year [4, 5]. In addition to the significant personal and social impact of depression, there is a significant economic cost. For example, in 2007 alone, total annual costs of depression in England were £7.5 billion, of which health service costs comprised £1.7 billion and lost earnings £5.8 billion [6, 7]. More recently, in 2019, it was estimated that mental health problems cost the UK £ 118 billion per year, of which 72% were due to lost productivity and other indirect costs. At 22% prevalence depression was identified as the third highest contributor to these costs [8, 9].

Depression, like most mental health disorders, can be difficult to diagnose, especially for non-specialist clinicians [10, 11]. Assessment by primary or secondary care clinicians typically relies on the World Health Organisation’s International Catalogue of Diseases version 10 or 11, ICD-10/11 [12], the Diagnostic and Statistical Manual of Mental Disorders DSM [13], or by using an interview script such as the Composite International Diagnostic Interview (CIDI) [14, 15]. Diagnosis can also be aided by garnering self-reported symptoms in response to standardised questionnaires such as the Hospital Anxiety and Depression Scale (HADS) [16], Beck Depression Inventory (BDI) [17, 18] and Patient Health Questionnaire-9 (PHQ-9) [19, 20]. The PHQ-9 is considered a gold standard [21] for screening rather than standalone clinical diagnosis [22] and has been validated internationally [20]. As such it sets a sound benchmark for sensitivity (e.g., 0.92) and specificity (e.g., 0.78) that is a good comparator for assessing alternative methods [23].

Considering mental health care pathways, benefits to patients could be provided by early diagnosis, opening the possibility to early interventions. For example, Bohlmeijer et al. [24] observed reduced symptoms of depression for patients who engaged in acceptance and commitment therapy (ACT) as an early intervention compared to those on a wait list, both initially and at a three month follow up. Furthermore, a meta-analysis by Davey and McGorry [25] showed a reduction in the incidence of depression by about 20% in the 3 to 24 months following an early intervention. At the same time, late diagnoses of depression can result in longer term suffering for the patient in terms of symptoms experienced and disorder trajectory together with increased resource consumption [10, 26].

Recently, attempts to support early medical diagnoses have benefited from a) growing availability of electronic healthcare records (EHRs) that contain patients’ longitudinal medical histories and b) new advances in predictive modelling and machine learning (ML) approaches. The use of EHRs in primary care in the developed world is well established. For example, in the USA, UK, Netherlands, Australia and New Zealand, take up in primary care has exceeded 90% [27, 28]. The wide availability of proprietary EHR systems such as SNOMED (Systematized Nomenclature For Medicine) in the UK [29] are enabling rapid and global implementation and their use for disorder surveillance [30]. For example, ML techniques with EHR data have led to predictive models for cardiovascular conditions [31, 32] and diabetes [33]. These studies have led to cardiovascular risk prediction becoming established in routine clinical care and the UK QRISK versions 2 and 3 show significant improvements in discrimination performance over the Framingham Risk Score and atherosclerotic cardiovascular disease (ASCVD) score methods [34] that preceded them. Many of the recent advances were facilitated by the growing popularity of ML in medical data science. As a subfield of artificial intelligence (AI), ML allows computers to be trained on data to identify patterns and make predictions. This approach is well suited for developing algorithms to predict the likelihood of a patient having a disorder by analysing large volumes of medical data. Once trained, these algorithms can then be tested on new data to assess their performance outside of the training environment. There are a variety of ML techniques, but the two most common include supervised and unsupervised methods. In supervised learning data are labelled with desired outcome. In unsupervised learning the data are not labelled, and the algorithms look for patterns within the data without external guidance. Further information on these methods in relation to mental health and EHRs is provide in Cho et al. [35] and Wu et al. [36] but here we note that many existing applications combine some unsupervised and supervised methods to train algorithms on datasets with large numbers of predictors. A scoping review by Shatte et al. [37] on the general use of ML in mental health identified the use of ML with EHRs for identifying depression as a research area. Similarly, Cho et al. [35] included depression amongst the conditions they identified in their “Review of Machine Learning Algorithms for Diagnosing Mental Illness”. In the examples they cite, which are also covered in the results of this systematic review, ML algorithms were trained on EHRs data that included a variety of symptoms and conditions. These algorithms were then assessed on their ability to distinguish between those who did/did not have clinical depression. If EHR/ML methods are to be considered, a suitable benchmark comparator is needed. Studies assessing diagnosis of depression in primary care suggest that approximately half of all cases are missed at first consultation but that this improves to around two thirds being diagnosed at follow up [38,39,40]. This would be a useful minimum comparator for any diagnostic system based on a combination of ML and EHRs data. There exists the potential to develop predictive models of depression using EHR/ML applications and it is necessary to critically evaluate models developed in recent years. This is particularly important in the context of rapidly developing ML techniques, and the growing accessibility and richness of EHRs health data. Our starting point for this systematic review was, “Is there a case for using EHRs with machine learning to predict/diagnose depression?” From this we derived the objectives to identify and evaluate studies that have used such techniques. As part of the evaluation, we specifically focus on identifying key features of the data and ML methods used. Accordingly, our primary focus is to provide a comprehensive overview of the types of ML models and techniques used by researchers, as well as types of data on which these models were trained, how the models were validated and, where done, how they were then tested. By summarizing the data used, identifying and summarising predictors used, describing diagnostic benchmarks, and outlining what types of validation and testing approaches were used, our review offers an important source of information for those who wish to build on existing efforts to improve predictive accuracy of such models.

Methods

Search strategy and search terms

Systematic literature searches were conducted within arXiv, PubMed, PsycINFO, Science Direct, SCOPUS and Web of Science electronic databases. Searches were restricted to information published after 2010 (from 1st January 2011 onwards) and were updated prior to the final synthesis of data on 27th January 2022. Initial searches were made based on titles/key words (where latter available) and papers were selected based on the inclusion criteria summarised in Table 1. These were searched as (#1) AND (#2) AND (#3) AND (#4). These papers were evaluated by reading the Abstract, and then by evaluating main body of each manuscript. Next, a backward citation search for all the selected papers was completed as both a) a quality check to see if other selected papers were included and b) to identify any missing papers. The last search step was a forward search pass where papers that cited the selected papers were identified; again, identifying any missed papers. The same time period and inclusion/exclusion criteria were applied to these additional searches. The initial searches together with primary assessment for inclusion were conducted by DN. 10% of the searches were sampled by LW. The inclusion/exclusion results for the selected papers were audited by LW, and joint discussions were held to resolve any issues. In the event of this not being possible CT would have been involved as final arbiter.

Table 1 Search terms for study identification

This systematic review was prospectively registered with Prospero international database of systematic reviews (# CRD42021269270) [41].

Inclusion/exclusion criteria

Table 2 shows the inclusion and exclusion criteria that were adopted to define the publications that came within the scope of the review.

Table 2 Inclusion/exclusion criteria

Data extraction

Data extraction was informed by requirements detailed in: ‘Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [42]; ‘Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist’ [43], and ‘Protocol for a systematic review on the methodological and reporting quality of prediction model studies using machine learning techniques’ [44]. Table 3 details the data extraction categories. Primary data extraction was conducted by DN this was then validated by LW.

Table 3 Data extraction summary

Quality of studies

The Oxford Centre for Evidence-Based Medicine (OCEBM) system [45] was used to assess quality, previously used for a systematic review about artificial intelligence and suicide prevention by Bernert et al. [46] as many of the models were developed and evaluated in a clinical setting and so merit a level of formal assessment. This ranked the evidence on a scale of 1 to 5, lowest to highest. The results were added to the data extraction table. OCEBM is designed to provide a hierarchy of levels of evidence for researchers and clinicians whose time is limited, it is well established and widely used. For further information, see Howick et al. as reported in [47].

Results

The search protocol together with numbers of studies identified, selected, assessed, included/excluded is presented in Fig. 1, compatible with PRISMA standard [48].

Fig. 1
figure 1

PRISMA flow diagram with results for systematic review study selection [48]. Note: reasons, for example relating to disorder focus, scope, data sources, specially selected cohorts, disorder trajectory not diagnosis, for excluding full text articles are included in supplementary data, Table S 1

Searches

A total of 744 research papers were identified in the first stage of the literature search (711 after duplicates were removed). Screening content of abstracts and, subsequently, main body of each article, reduced the sample to 18 eligible articles. The backwards citation search of the selected papers identified 22 papers (including duplicates) that were rejected, 10 that were in the original selection and two (duplicates) that were added to the selection, resulting in one additional paper (giving 19 in total). The forward citation search did not produce additional papers at the time of the review.

Review articles are not included in the final total but were used for supporting research and were recorded.

Selected studies overview

This review summarised studies that use ML methods to train validate, and test ML models for predicting depression based on individual-level EHR data from primary care (11 studies) and from a combination of primary and secondary care (8 studies). Table 4 summarizes key features of each study. We now turn to a detailed overview of each of the components described in Table 4.

Table 4 Methods, performance, demographics, evaluation summary for the 19 selected papers [49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67]

Depression definition

The definition of depression and the method of its classification varied across the studies in this review. A combination of depression diagnosis definitions based on NHS Read codes [68], SNOMED (Systematized Nomenclature For Medicine) [29] codes, ICD [12] or DSM [13] based assessments and/or the prescription of antidepressants (ADs) was used in 16 of the 19 studies. Only one study, by Xu et al. [65], used antidepressant prescription alone as a case definition. Three other studies relied on the use of a validated questionnaire such as the PHQ-9 [69] or HADS [16].

Predictors

Here we report on aspects of the predictors including their definition, how we grouped them and their frequency of use.

Definitions

Most predictors were derived from a combination of variables present in the EHR databases (e.g., SNOMED/NHS Read codes and/or prescription of a drug in a similar way to the definition used for depression) and were typically categorical. In some cases, additional parameters specifying a time frame for the predictor were also available. Some predictors were defined by identifying components by pre-processing clinical notes/other textual information. A few studies used non categorical predictors such as physiological measurements for example Body Mass Index (BMI), blood pressure, and cholesterol as predictors. This was usually where participants were receiving some form of secondary care, such as in pregnancy for PPD prediction.

Groups

No formal method for grouping predictors was evident in the studies and, due to the large number of diverse predictors used in different papers, for clarity these were organised into the following groups. Specifically: comorbidity, demographic, family history, other (e.g., blood pressure), psychiatric, smoking, social/family, somatic, obstetric specific, substance/alcohol abuse, visit frequency and word list/text. Due to this flexibility in definition, there are overlaps between studies concerning which category a predictor might fall, for example a blood test may be in “other, or “obstetric specific”. Table 5 shows the predictors groups and commentary on their content.

Table 5 Grouping of predictors from the studies

Figure 2 indicates frequency of predictor use across the selected studies.

Fig. 2
figure 2

The approximate number of studies using different groups of predictors. Note 1: Some papers used multiple categories of predictors and not all categorised them. Note 2: The total number of predictors used was difficult to determine at a summary level as multiple models used different combinations, in some cases no exact number was provided but a reference to a set of definitions used as a starting point

Data

The studies in this review used data sets from EHRs systems, insurance claims databases and health service (primary and secondary) providers. As such they store, organise, and define data in a variety of ways that are not expected to be consistent with each other. Most of this data is categorical in nature, though some predictors such as blood pressure, are usually continuous variables within a range. In this section we report how each of the reported studies dealt with missing or erroneous data, potential sources of bias. We also report whether the authors made their data and/or code publicly available.

Missing or erroneous data

Missing data either related to missing patients and/or missing predictor data. In both cases it may not be possible to know that the data is missing. For missing patients, Koning et al. [55] excluded patients whose records did not identify gender or had no postcode registered. Huang et al. [52] removed entries where patients had less than 1.5 years of visit history. Wang et al. [64] excluded from the analysis PPD patients for whom there was no third trimester data.

With regard to missing data. Nemesure et al. [58] estimated that, for their data set, missing values were present in 5% of the data overall and for 20 out of the 59 predictors they used. In some studies, missing data led to exclusion of cases from the analysis. In Nichols et al. [59]. missing smoking status was used to infer non-smoking on the basis this was less likely to be missed for smokers/those with smoking related disorders. Missing data also led to exclusion of predictors. Again, in Nichols et al. [59], the authors did not use ethnicity as it was missing in over 63% of patients. Similarly, Zhang et al. [67] excluded ethnicity from their USA dataset for the same reasons. Many studies (e.g., Koning et al [55]., Meng et al. [57], Nichols et al. [59] raised concerns that errors in predictor data could affect performance, generalizability, and reliability of the models. Errors and missing data were identified as being due to misclassification, measurement errors, data entry and bias; all of which can be difficult identify and/or correct in EHR data as noted by Wu et al. [36]. Other studies varied in the strategies used for dealing with missing data. Common approaches were to estimate the level for a missing point or simply acknowledge that remedial action was not available. Nemesure et al. [58] used an imputation approach fortheir numerical data, such as blood pressure. Where remedial action is not possible then the patient might be excluded from the study, e.g. Hochman et al. [51].

Sources of bias

Many of the studies (12), for instance, Hochman et al. [51], Huang et al. [52] and Koning et al. [55] raised the question about data bias due to cohort selection or collection processes, such as diagnosis, data interpretation and system input. Other studies (12) recognised sources of bias impacting accuracy and generalizability. Jin et al. [53] identified that as the population in their study were mainly Hispanic and there was incompleteness of comorbidity predictor data (e.g., for diabetes), both performance and generalizability would be affected. Zhang et al. [67] acknowledged that sourcing their data from an urban academic medical centre could introduce result in a limited generalizability of their findings. Hochman et al. [51] suggested that their use of an exclusion criteria removing severely depressed patients based on the prescription of specific drugs could also create bias. Zhang et al. [66] chose to exclude ethnicity from their models due to coding inconsistencies and errors; making a bias in that area a potential issue. Huang et al. [52], defined depression based solely on antidepressant usage and suggested their sample would be skewed towards the more severely depressed because the sample excluded those whose condition was treated with only psychotherapy or those without any treatment. A similar concern regarding changing definitions for the detection of depression during their study period was expressed by Xu et al. [65]. At a broader level, 20 of the studies were from “WEIRD” (Western, Educated, Industrialised, Rich, Democratic) countries with the majority (15) from the USA. The remainder were from countries with highly developed IT and healthcare industries such as Brazil, Israel, and India.

Data sharing

The nature of the data, data protection and requirements for anonymity, and privacy issues limited access to source data though details of sources themselves were more often made available (e.g., Hochman et al. [51], Nichols et al. [59]).

Modelling

In this review, we identified a wide array of statistical techniques used on EHR data (see Table 4). Many different types of supervised ML were used for classification of depression versus control, including regression models (13 studies) and Random Forest (8 studies), XGBoost (8 studies) and SVM (7 studies) were the most common techniques. Use of multiple techniques in a single paper was also common, for instance Xu et al. [65] and Zhang et al. [66] used four or more methods. Geraci et al. [50] was the only study to use a deep neural network-based deep learning approach as the primary component of their model. Figure 3 summarises methods used in the selected studies.

Fig. 3
figure 3

Machine Learning/Artificial Intelligence Methods for pre-processing and modelling (note LR variants add up to 11). Abbreviations:; ARM, Association Rule Mining; BRTLM, Bidirectional Representation Learning model with a Transformer architecture on Multimodal EHR; DNN/ANN, Deep Neural Network/Artificial Neural Network; KNN, K Nearest Neighbours; LASSO, Least Absolute Shrinkage Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; M SEQ, multiple-input multiple-output Sequence; NB, Naïve Bayes; SVM, Support Vector Machine; XGBoost, eXtreme Gradient Boosting

Temporal sequence was referred to in two studies [49, 60] though other studies refer to time between predictors and diagnosis (e.g., Meng et al. [56]). In other studies patterns of predictors were used to determine their predictive probabilities of depression, sometimes using time constraints, such as a primary care visit “within the last twelve months” or specifically including time distant events such as birth trauma (Koning et al. [55], Nichols et al. [59]. Only one study, Półchłopek et al. [60], implemented temporal sequence, whereby the order of presentation of symptoms was considered, in the EHRs. Though Abar et al. [49] speculated that temporal sequence might be used to improve performance by taking causal sequence into consideration.

Most studies (17 out of 19) validated their models, most commonly (12) by splitting data into a training and a testing set. Cross validation data sets for model testing were also used (11 out of 19). Generally testing and validation was carried out by the same team as created the models, only Sau and Bhakta [62] had diagnostic accuracy checked by an independent team. Only one study used a separate data set for testing rather than splitting the original data set, Zhang et al. [67].

Code sharing

Code was made available by the majority (12) of studies. In some cases, just the details of the packages that implemented the ML algorithm were provided. For example, Jin et al. [53] reference the R package MASS, rather than the providing the complete code.

Performance

Several performance metrics was used to evaluate ML models of depression. Among those, researchers reported confusion matrices; area under the curve – receiver operating characteristics (AUC-ROC); and Odds Ratios/Variable Importance for predictors.

Confusion Matrix derived metrics (True Positives, True Negatives, False Positives and False Negatives) were used in sixteen of the studies, usually in conjunction with other measures particularly AUC-ROC. Many performance metrics are derived from this information, including accuracy, F1, sensitivity, specificity, and precision. Sensitivity (also known as recall) and specificity were commonly reported, possibly because they give information relating to the discriminative performance of the model and are well understood by practitioners [70].

For sensitivity, reported values range from 0.35 Hochmam et al. [51] to 0.94 Geraci et al. [50]. For specificity, reported values range from 0.39 Wang et al. [64] to 0.91 Hochman et al. [51]. Sensitivity was usually higher than specificity across the models with the exceptions being: Hochman et al. [51] who reported a high specificity figure of 0.91 with a low sensitivity of 0.35 using a gradient boosted decision tree algorithm; and Nemesure et al. [58] reported specificity of 0.7 and sensitivity of 0.55. The highest accuracy at 0.91 was reported by Sau and Bhakta [62] and the lowest was 0.56 (Zhang et al. [67]). This metric only gives a broad overall picture of correctly predicted results vs. all predictions made and gives no indication of the more useful true/false positive rates; it was presented in only six studies.

For the studies that reported performance in terms of AUC- ROC metric (14) the low extreme for any model was 0.55, specifically from a benchmark model predicting depression in the 12–15 years age group (Półchłopek et al. [60]. The highest AUC-ROC score was 0.94 (Zhang et al. [67], Kasthurirathne et al. [71]). The overall range AUC-ROC values reported was 0.70 to 0.90. The average AUC-ROC value was 0.78 with a standard deviation of 0.07. Figure 4 shows the average AUC values achieved in each study.

Fig. 4
figure 4

Average AUC performance across studies reporting them (AUC average = 0.78, Standard Deviation AUC Average = 0.07)

Generalizability and interpretability

Generalizability was mentioned in 14 studies, for example Jin et al. [53] and Zhang et al. [67]. The points already illustrated under, “sources of bias”, for example, demographically specific participants, and, factors relating to missing data and granularity of data, such as only having social deprivation data at practice level have negative consequences for generalizability.

Interpretability was identified as a concern in only 3 studies (Koning et al. [55], Nemesure et al. [58], Meng et al. [56]). For interpretability Nemesure et al. [58] used SHAP (Shapley Additive Explanations) scores which offers a decision chart and other visualisations for model predictors [72]. None of the included studies provided visualisations other than AUC-ROC diagrams and bar charts, as such interpretability was not significantly addressed in the selected studies.

Quality of studies

All the included studies achieved a score of 3 (11) or 4 (8) based on the OCEBM criteria (1 to 5 from highest to lowest) hierarchy of levels of evidence as far they could be applied to the selected studies, areas that related to diagnostic tests only (no interventions). This represents a moderate level of performance. Overall, the studies represented large sample sizes, usually case series or cohort trials and they applied a clinically recognised benchmark, had there been randomized trials studies could have been promoted to level 2.

Only 3 studies provided reference to the use of a formal assessment method such as TRIPOD [42]. suggesting that following standards is not yet widespread or that the frameworks are not yet sufficiently established or appropriate. This lack of consistent reporting is a limitation, and the use of standardised frameworks should become the expectation rather than the exception.

Discussion

In this review we have identified three areas of interest: generalizability (can the model be reused with, e.g., different populations), interpretability (is the model’s information readily understandable to its users), and performance (does the model meet the needs e.g. in AUC-ROC, for the purpose for which it is intended) as key components to consider for predictive models of depression built on the use of ML with EHR data. All three would need careful evaluation before moving from research to a clinical application environment.

Generalizability

This is a significant consideration for medical ML applications, whilst a model may work well in their development and testing environments, this does not guarantees that they will work in a new context [73, 74]. To be widely deployed clinically, the models in the studies would need to be generalizable, i.e., be able to work reliably outside of their development environment. Kelly et al. [73] identified the ability to deal with new populations as one prerequisite for clinical success. Areas identified in the studies that could impact generalizability included demographics, sources of bias, inclusion/exclusion criteria, missing/incomplete data, the definition of depression and predictors. All of these were identified in the included studies, for instance, Jin et al. [53] identified Hispanic participants being highly represented in their data and Zhang et al. [66] excluding ethnicity from their models.

As noted in the Performance sub-section of the Results, the ML method itself did not seem to be overly critical for outcome performance using the EHR data sets in the included studies and it is provisionally suggested that the method itself may be more generalizable than the data to which it is fitted.

Another area that can limit generalizability is the wide variety of EHR data. This varies depending on source for example insurance derived, a state health service such as the NHS, or a proprietary standard such as SNOMED etc. The coding may, or may not, incorporate a recognised medical standard such as the ICD [12] or DSM [13] amongst others that can be found in the included studies. Although not derived from the studies directly it was noted that individual EHRs systems are proprietary in nature and there is no universally accepted extant standard detailing how data should be categorised, stored, and organised for them.. There are organisations developing, promoting, and gaining accreditation, for example Health Level Seven International [75] with ANSI (American National Standards Institute) [76]. However, none of these are globally adopted, and the only accepted standard developed by the World Health Organization (E1384) was withdrawn in 2017 [77]. Lack of standardisation is currently a barrier to portability for individual applications. Consequently, it is likely that models are data source specific to a greater or lesser extent. Further work needs to consider how this can be addressed.

The studies in this review differed in how depression was defined and by the range of predictors selected and their definitions. As mentioned, a commonly used approach was to use a combination of EHR data entry codes covering diagnoses in combination with prescription of an antidepressant. This can result in too many cases as being diagnosed as depressed due to antidepressants being used for a wider range of conditions. Similar issues apply for the definition of predictors. In combination this restricts the generalizability of any models produced.

Another factor for generalization is the robustness of the models and their replicability. None of the studies included replication of their results, only Sau and Bhakta [62] used an independent team for the verification of results, though the majority employed recognised validation techniques and 12 used separate hold out data set. This last point is also relevant to establishing if models have been overfitted to their data; the possibility for this was not reported in any of the studies despite being known as a serious potential issue for ML models in general. Reducing bias and independent validation and testing is recommended for future work involving the prediction of depression using ML with EHRs.

Interpretability

Interpretability was only identified as a concern in a few studies. However, clinical practitioners may wish to know the explanation for ML algorithm’s predicted diagnosis so they can fit it into a broader diagnostic picture rather than treating it as a “black box” as described by Cadario et al. [78]. Similarly, Vellido [79] and Stiglic et al. [80] also considered that interpretability and visualisation are important for effective implementation of medical ML applications. This may be as simple as listing the specific predictors that contributed to the outcome, for example, anxiety, low mood, chronic pain or similar. Of the included studies Nemesure et al. [58] used SHAP (Shapley Additive Explanations) scores which have been used in clinical applications [81] to aid interpretability, again by identifying the most important predictors. Techniques such as SHAP, and e.g., LIME (Local interpretable model-agnostic explanations) [82] offer visualisations which may be more intuitive and provide more easily digested information. However, none of the other studies included provided visualisations other than AUC-ROC diagrams and bar charts of predictors. That said, there is a long-standing unsettled debate regarding interpretability going back to the 1950s. Providing interpretive data to support a practitioner as opposed to a “black box” approach where the diagnosis made by the application is simply accepted, can lead to a lower diagnostic performance overall [83, 84]. It is recommended that future studies should be made that not only develop predictive models but also include trialling their use, for example with primary practitioners, support staff and/or patients, offering different forms of interpretable/black box output and assessing acceptability. This needs not be done, initially, in a clinical setting, but can be piloted and demonstrated in prototype form in a controlled environment. This can then be assessed using a combination of qualitative and quantitative methods e.g., with surveys, interviews, focus groups and panels prior to moving to clinical trials.

Performance

Here we consider what may be limiting the performance of the models with respect to their intended used as a means of identifying depression. One limiting factor on performance in the included studies, relates to the definition of depression itself and the predictors used. Defining depression accurately is critical as this definition is used to train the ML application, a point raised by Meng et al. [57]. In the studies reviewed here, typically a combination of diagnostic and drug codes within the EHRs were used. Using prescription of antidepressants as part of the definition may misidentify too many cases, a point identified in the selected studies by, for example, Qiu et al. [61] and Nichols et al. [59]. ADs are prescribed for other conditions including anxiety [85, 86], chronic pain [87, 88], obsessive compulsive disorder [89, 90], post-traumatic stress disorder [91, 92] and inflammatory bowel disease [93]. Of the included papers Xu et al. [65] suggested that under-identification of depression cases could also occur for patients receiving treatment via private care or an alternate service provider.

The prevalence of predictors can be artificially boosted, as suggested by Koning et al. [55] and Nichols et al. [59] where primary care physicians who think a patient has depression may identify or suspect a precursor or comorbidity, for example, with other mental health conditions like low mood or anxiety. There is strong evidence that family history of depression, alcohol, drug, physical and sexual abuse, and co-morbidity with other mental health conditions, are strong predictors of depression [94,95,96,97]. However, this data appears to be under recorded resulting in removal of important predictors due to low prevalence—again in Nichols et al. [59] removed family history data due to its low prevalence (< 0.02%). This would be expected to have a negative impact on performance. Identifying consistent and valid definitions for depression and any predictors used is a necessity.

The studies in this review reported an overall model performance where AUC-ROC value was 0.78 with a standard deviation of 0.07 (Fig. 2). This compares well with primary care where up to half of depression cases are missed at baseline consultation, improving to around two thirds being diagnosed at follow up [38, 40]. An earlier paper by Sartorius et al. [98] reported that only 39.1% of cases of ICD10 current depression were identified by primary care practitioners. Based on the studies we identified potential areas that might support improvements in the performance of the models. A key area relating to this is that of over/under diagnosis; as mentioned in our background section early diagnosis and thus intervention can show benefits for depression [25, 99]. However, there is a broader argument with regard to over-diagnosis (i.e., false positives) in terms of potentially wasting resource or stigmatising patients.

Although some studies suggested that using more sophisticated techniques should improve performance, we noted that simpler methods such as logistic regression were often comparable to those obtained using more complex ones such as Random Forest and XG Boost (e.g., Zhang et al. [67]. Christodoulou et al. [100] echoed this conclusion in their systematic review of clinical prediction using ML where they saw similar performance for logistic regression compared with ML models such as, artificial neural networks, decision trees, Random Forest, and support vector machines (SVM). Geraci et al. [50] employed a deep neural network (deep learning) as their main modelling technique and Nemesure et al. [58] used it as a component in a larger ensemble model. However, neither demonstrated performance benefits from its use. Even if higher performance could be obtained using deep learning it is important to note that small amounts of noise or small errors in the data can cause significant reliability issues due to misclassification due to very small perturbations in the data [101, 102]. The use of more sophisticated techniques to improve performance is not supported by this review.

How else might performance be improved? The use of non-anonymised data, sourced from within a primary or secondary care facility, something that is more achievable in a clinical than a research setting, could be beneficial. For example, in the Nichols et al. [59] study social deprivation indices were only available at a regional/practice level and inspection of their model suggests that social deprivation has little impact on prediction of depression. This is inconsistent with expectation, as supported by Ridley et al. [103] who showed that there is a link between increased social deprivation and the probability of developing depression. Having this data at an individual level might be expected to increase the performance of a model. However, this is likely to only be achievable in a clinical trial of an application. Alternatively, the use of synthetically generated EHR data [104, 105] removes the patient confidentiality and related ethical constraints that come with real data and would allow all aspects of a model to be fully evaluated as if with non-anonymous patient data.

Another approach is using more information relating to time in predictive models; EHRs typically time stamp entries so it is known when a predictor is activated. Półchłopek et al. [60], considered temporal sequence in EHRs. They were concerned that techniques including support vector machines and random forest identify predictors that affect the outcome but do not identify the effect of sequence on that outcome. They looked at the improvement that could be found by using temporal patterns in addition to non-time specific predictors and noted a small positive effect. Abar et al. [49] also speculated that temporal sequence might be used to improve model performance. There are techniques that might be used to do this. For example, time series analysis methods such as Gaussian processes, which are capable of coping with the sparse nature of EHR data [106] have been used to make predictions for patients with heart conditions. We recommend exploring the use of more time dependent factors in building predictive ML models for depression.

Although missing data is more of a concern in terms of generalizability, some studies identified it as an opportunity to improve performance. Kasthurirathne et al. [54] noted that missing EHR data can reduce model performance and suggested that this could be mitigated by merging with other data sources, for example, related insurance claims. Nichols et al. [59] used missing smoking data as a predictor and it had a positive effect in their model. Missing data is potentially of significance of itself and is an opportunity for further study.

Strengths and limitations

As far as we are aware this is the first systematic review focussed on the use of EHRs to predict depression using ML methods. The choice of journal databases and thedate range covered by the searches means that the studies identified provide a sound basis for comparison. The data extraction protocol was informed by established standards [42,43,44] to best identify data needed to support meaningful and repeatable analyses.

A limitation of this study is that inclusion criteria focused on study titles and key words which may have led to some ML studies using EHRs being missed. This was mitigated using backwards and forwards citation searches. Additionally, the variety of study designs including case control, cohort, and longitudinal studies precluded the possibility of using some of the more traditional quality assessment tools; we did however, as stated in methods, use OCEBM which has been used in previous ML systematic reviews. The categorization, definition, and identification of the numbers of predictors used within models was sometimes difficult to establish, leading to limitation in the scope of this information presented. It is also likely that the included studies are culturally specific as they focused on “WEIRD” populations.

Conclusions

In conducting this systematic review, we have shown that there is a body of work that supports the potential use of ML techniques with EHRs for the prediction of depression. This approach can deliver performance that is comparable to, or better than that found in primary care. It is clear there is scope for improvement both in terms of adoption of standards for both conducting and reporting the research and the data itself. The development of an acceptable global standard for EHRs would improve generalizability and portability. This would involve greater promotion, and development, of standards for research such as TRIPOD [42] and, for data interchange, Health Level Seven International [75], and their further development to support ML/EHR applications. Future work could pay more attention to generalizability and interpretability, both of which need to be addressed prior to trialling implementation in the clinic. It is also worth investigating areas where performance can be improved, for example by including temporal sequence within the models, better selection of predictors and the use of non-anonymised/synthetic data. Our review suggests depression prediction using ML/EHRs is a worthwhile area for future development.