Introduction

The importance of early detection of a severe mental illness (SMI), such as schizophrenia or bipolar disorder, is widely recognized. Substantial literature suggests that prompt interventions improve the clinical outcome of individuals with psychotic symptoms and may even prevent or at least delay the appearance of psychosis [1,2,3]. However, there are currently no “gold standard” instruments to identify the appearance of SMI. Two clinical instruments have been widely recognized for early detection: the Comprehensive Assessment of at-Risk Mental States (CAARMS) [4] and the Structured Interview for Prodromal Syndromes (SIPS) [5]. It has been reported that the transition rate to psychosis among individuals identified as high-risk according to these instruments is approximately 36% after 3 years of follow-up [6]. Although these clinical syndromes are clearly sensitive in detecting susceptibility to developing SMI, the instruments lack specificity, and a large percentage of high-risk individuals will not transit to a full psychotic episode or will possibly rather present poor functional outcomes or other comorbid mental disorders [7]. Moreover, those labeled as high-risk who do not transit will bear the burden of psychiatric stigma and/or may receive inappropriate care. Therefore, further work should be done to establish a detection system of SMI that increases the accuracy of disease prediction so that it minimizes the risk of unnecessary stigmatization and enables clinicians to offer appropriate management according to the needs of each individual.

Another challenge is that diagnostic instruments for SMI are based on symptomatic criteria depending mainly on patients’ reports. Unlike other diseases such as cancer, where diagnosis and prognosis assessments rely on specific biomarkers, there is a lack of biological tests approved for clinical use [8] in psychiatry. This hampers efforts to define reliable clinical phases of psychotic illnesses [9], thereby complicating the implementation of appropriate screening and monitoring approaches. Hence, an increasing number of researchers are turning their interest to developing accurate biomarkers for mental health diseases such as SMI.

Among potential biomarkers, the retina has gained particular interest in recent years, due to its common embryonic origin with the brain suggesting that structural and functional changes in this organ may be aligned with some retinal changes [10, 11]. The electroretinogram (ERG), a very well-known instrument commonly used to assess the functional electrical response of retinal photoreceptors (i.e., rods for scotopic vision and cones for photopic vision [12]), has been shown to be a promising tool to identify SMI given that ERG anomalies in patients with psychotic disorders were found in several studies [13,14,15]. Moreover, our research team has already reported very high accuracies when distinguishing patients from healthy control subjects (91% for schizophrenia and 89% for bipolar disorder) using ERG measurements [16]. Our previous studies also reported ERG anomalies even in young offspring at genetic risk of SMI [17], and we observed an association between ERG anomalies and cognitive impairment in offspring at an early age before the appearance of symptoms [18].

The present study aimed to enhance the development of the ERG as a biological tool to monitor susceptibility to SMI. For this purpose, a predictive regression model was developed using ERG measures. Only photopic (cone) responses were included in the model given that in addition to the effectiveness of the ERG as a biological tool, this time we are interested in the efficiency of its clinical use (the photopic ERG necessitates only 10 minutes of light adaptation in contrast to the scotopic ERG which necessitates 20–30 min of dark adaptation). Because our focus was to predict a vulnerability of SMI before the occurrence of first symptoms, whether it is for schizophrenia or bipolar disorder, both diseases were grouped together and considered as SMI. Also, given the recognized genetic overlap between schizophrenia and bipolar disorder they may share susceptibility [19, 20]. Since special attention should be given to the harms of false-positive individuals—who may be unnecessarily targeted—and false-negative individuals—who might miss further monitoring, the clinical utility of the ERG regression model was assessed using a technique that proposes a “net benefit” value that gives a different ponderation to true and false-positives [21]. Additionally, three levels of certainty (i.e., probable SMI, uncertain, and no disease) instead of two were established so that the third category of uncertainty would identify individuals for whom the ERG may be inconclusive. Hence, the uncertain intermediate level will not immediately receive a psychiatric label but will still benefit from further monitoring.

Methods

Data source and study population

This is a cross-sectional study approved by the Neuroscience and Mental Health Research Ethics Committee of our institution (CIUSSS Capitale-Nationale). The database was previously analyzed to show the high accuracy of ERG prediction [16]. However, because the present objective is to enhance the development of a preliminary screening instrument to detect susceptibility to SMI, whether it is schizophrenia or bipolar disorder, subjects with the two diagnoses were combined, obtaining a total sample of N = 301 SMIs who were unrelated and stabilized outpatients. Participants were referred by their treating psychiatrists from a university hospital or the regional psychiatric department from Quebec City and Beauce region of the Province of Quebec. Inclusion criteria were having a diagnosis of schizophrenia or bipolar disorder according to the DSM-IV criteria, being between 21 and 55 years old, and having normal vision with no known ocular pathology. Exclusion criteria were having brain and metabolic disorders, being pregnant, having used drugs including cannabis in the past 24 hours, having traveled two time zones within 1 month before the experiment, and working on night shift (which could disrupt the retinal internal clock) [17, 22].

As detailed in our previous work [16] healthy control subjects were recruited through advertisements from the same population of Quebec. Exclusion criteria were the same as for patients, with the addition of having any Axis I DSM-IV diagnosis and having a positive family history of schizophrenia, bipolar disorder, or major depressive spectrum disorders. Signed consent was obtained for all participants.

ERG measurements included as predictors in the regression model

ERG recordings were performed in nondilated eyes as per Gagne et al., 2010 [23] using Espion (E2, E3) Systems and color dome ganzfeld (Diagnosys LLC, Lowell, MA) with a background set at 80 cd/m2 and recording from both eyes was achieved with a DTL electrode placed into the conjunctival sac. The reliability and reproducibility of the ERG protocol and acquisition techniques used in this study have been extensively demonstrated [23,24,25]. Further details about the protocol can be found in an earlier publication [16]. Briefly, two components of a typical ERG waveform were measured: the a- and b-waves. For each component, two parameters were registered: the amplitude (amp) and the latency (lat). Each of these parameters was measured at two steps: at a fixed luminance of 7.5 cd•s/m2 and at Vmax (defined as an average of ERG responses obtained at luminances of 13.3, 23.7, and 50.0 cd•s/m2 as per Hébert et al. 2017) [26]. ERG technicians were blinded to the participants’ diagnosis. In addition, the acceptability to participate in the study and to be assessed with ERG was very high (95%) among affected participants.

To enhance the clinical usefulness of ERG, which has already been shown to be a potential biomarker [16], special attention was given to the practicality and ease of use of the instrument. Because the cone ERG assessments requires a light adaptation of only 10 minutes, as opposed to the rods assessment which requires 20 to 30 minutes of dark adaptation, two logistic regression models were developed a priori (using the backward and forward stepwise method). One model included both cone + rod ERG parameters measurements and the other included only cone ERG parameters measurements. Both models yielded almost identical performance accuracies (see Fig. S1 in the supplements); thus, only the cone ERG parameters were used in this study as an attempt to minimize the discomfort of participants.

Statistical analysis

All statistical analyses were conducted in R version 4.0.3. The first portion of the analysis was the development of the regression model (using the glm [27] and the stepAIC functions [28]). For this, the total sample was randomly split using an 80:20 ratio into a training dataset (241 SMI and 160 healthy controls) and a testing dataset (60 SMI and 40 healthy controls). A logistic regression model was then developed in the training dataset to predict the clinical status (SMI vs. control), using cone ERG measurements as predictive variables. The covariates pupil size, age, and sex were selected according to our previous publications [22, 26, 29, 30]. The backward and forward stepwise method was applied for variable selection. All regression model assumptions were adequately met such as independence, normality and no multicollinearity or extreme outliers.

In the second portion, the quality of the predictive ERG model was evaluated. For this, model stability and possible overfitting were assessed using leave-one-out cross-validation with the caret package [31]. Additionally, internal validation was evaluated by estimating the apparent performance (in the training dataset) using two indicators: Nagelkerke’s R2 and the Brier score. Calibration was assessed visually and with the Hosmer–Lemeshow test as an indicator. Then, external validation and the discriminative ability (represented with the area under the ROC curve {AUC-ROC}) were evaluated using the test dataset. The following study follows the TRIPOD statement criteria for reporting a prediction model [32, 33].

The third portion of the analysis assessed the clinical utility of the regression model using the decision curve analysis technique [34, 35]. Under this technique, a “net benefit” value is calculated using Formula 1, where pt. represents a threshold probability of developing SMI and n is the total sample.

$$\textrm{Net}\kern0.5em \textrm{Benefit}\kern0.5em =\kern0.5em \frac{\textrm{True-Positive}\kern0.5em \textrm{Count}}{\textrm{n}}\hbox{-} \frac{\textrm{False-Positive}\kern0.5em \textrm{Count}}{\textrm{n}}\left(\frac{\textrm{pt}}{1\hbox{-pt}}\right)$$
(1)

By providing weight to the false-positives based on pt., it is possible to represent a theoretical relationship between pt. of the predicted disease and the relative value of false-positive and false-negative results. Then, to interpret the potential clinical value of the regression model, two other extreme net benefit values are calculated for two hypothetical clinical situations [36]: 1. All participants are positive (hence, 0 false-positives), so they all receive further intervention, and 2. All are negative (hence, 0 true and false-positives), so no intervention is offered. The optimal strategy will be the one with the highest net benefit value. This technique assumes that pt. represents the threshold at which a practitioner or a patient would decide to pursue a future intervention (e.g., early treatment or monitoring changing symptoms). Thus, a “reasonable range of risk threshold” will be defined; this “reasonable” range means that no one would reasonably use a pt. outside that range to decide upon treatment [36].

Finally, the final ERG regression model was applied to the testing dataset, and three levels of predictive certainty were established: 1. Most likely, SMI, 2. Uncertain, and 3. Most likely, no disease. Using these results, the following predictive accuracy measures were calculated: sensitivity, specificity, and accuracy. The cutoff values to define the three levels were obtained by comparing two trichotomization methods according to their accuracy measures. The first method is a modified ROC analysis called two graph ROC (TG-ROC) [37]. This method selects the most certain ranges of model scores that are the best for use when deciding for or against a diagnosis. Therefore, two thresholds with a preselected sensitivity and specificity of 90% were established. As a result, an intermediate or borderline range between the two thresholds was identified, and only the results outside the intermediate range were considered as certain. The second method is called the interval of uncertainty [38]. It defines an interval around the intersection where “health” and “disease” distributions are equal. To do so, an R function [38] counts the true negatives and false negatives for all possible decision thresholds that are lower than the intersection and counts the true positives and false positives for all the decision thresholds above the intersection. Then, it searches all possible lower and upper combinations and chooses uncertain intervals with specificities and sensitivities below a given value of 0.55; thereby, it defines the model scores that are better not to use for a diagnosis (i.e., the uncertain level).

Results

Table 1 summarizes the clinical and demographic characteristics of the subjects and shows that no significant differences were found between the 301 patients with SMI and the 200 healthy controls. Fifty percent of the SMI subjects were diagnosed with bipolar disorder, while the other 50% had schizophrenia. Prescribed medications are also described in Table 1. There were no missing data for the ERG measures, and as expected, the unadjusted associations with the outcome (SMI) showed statistically significant relationships for most of the ERG parameters (i.e., a-wave amplitude, b-wave amplitude, and b-wave latency); further details are presented in Table S1 in the supplemental section.

Table 1 Clinical and demographic characteristics and comparison between SMI and control subjects

Quality assessment of the predictive model

The final best model yielded by the stepwise method included the following variables: a-wave amp fixed, b-wave lat fixed, b-wave lat Vmax and a-wave lat Vmax, age, and sex. The full model that includes all the variables is presented in Table S2 in the Supplements. Table 2 displays the internal validation of the best model showing that the apparent performance was good with a Brier score of 0.16 (1.0 would be the worst score) and a PseudoR2 Nagelkerkes score of 0.45 (the higher, the better). Visually, the calibration was good (see Fig. 1), and the Hosmer–Lemeshow test showed a p value of 0.74, which also indicates a good regression fit. The best model presented a high AUC-ROC of 0.85 and an accuracy of 0.77. All the parameters remained stable after leave-one-out cross validation with a Pseudo R2 of 0.41, a Brier score of 0.16, and a Hosmer–Lemeshow p value of 0.66.

Table 2 Model performance, discriminative ability, and internal and external validation
Fig. 1
figure 1

Calibration plot for the Best model, on the training data (n = 401)

External validation using the testing dataset is also presented in Table 2; interestingly, with this model, better performance measurements were found (Pseudo R2 squared of 0.48, Brier score of 0.15, and good calibration visually and statistically: p value = 0.72). The discriminative ability remained high, with an accuracy of 0.81 and AUC-ROC of 0.87 (CI: 0.80–0.94). Since medication may represent a confounder to consider in our results, a sensitivity analysis was performed, including medication in the final best model (see Table S3 in the supplements); the results remained robust and thus are not presented in this study. In addition, our previous publications showed no important impact of this variable in the regression analysis [16].

Clinical utility of the regression model

Figure 2 displays the decision curve analysis assessed on the testing dataset. This illustrates that the net benefit of using the ERG predictive model to make a clinical decision exceeded that of the hypothetical situation of intervening with all participants and exceeded the net benefit of no one receiving any intervention. The clinical utility of the ERG remained superior for predictions ranging from 0.14 to 0.95, which can be assumed to comprise the “reasonable range of risk threshold” [39] for most clinical practitioners.

Fig. 2
figure 2

Clinical usefulness of the ERG regression model for SMI prediction: Decision curve. Note. Training data was used (n = 401). Clinical utility of the ERG regression model in terms of net benefit compared to provide intervention to all participants and none receives intervention

The three levels of predictive certainty are presented in Table 3. The two trichotomization methods applied to the model using the testing dataset yielded very similar cutoff points with a very high predictive performance of > 0.89. The TG-ROC method performed slightly better (accuracy = 0.90, sensitivity = 0.91, and specificity = 0.89) than the uncertainty interval method (accuracy = 0.89, sensitivity = 0.89, and specificity = 0.88).

Table 3 Accuracy of the regression model according to the TG-ROC and uncertainty interval methods

Discussion

Prior work has revealed that ERG parameters provide a very accurate distinction between patients with schizophrenia and bipolar disorder compared to healthy controls [16]. This time our study proposes a simplified regression model that further supports the utility of the ERG as a biological instrument to monitor the risk of SMI (regrouping both schizophrenia and bipolar disorder). The results confirmed the very high accuracy and enhanced the efficiency of the clinical utility of the ERG by using only the cone ERG assessment which is less time consuming making the experience for the patient more comfortable.

The apparent performance of the predictive model showed very good discrimination, which remained robust after external validation using a testing dataset. Discriminative values after trichotomization (sensitivity = 0.91, specificity = 0.89, and accuracy = 0.90) and an AUC-ROC of 0.87 remained notably high, especially when compared to the other proposed biomarkers: event-related potential (accuracy = 0.79, sensitivity = 0.78, specificity =0.80) [40], electroencephalography (sensitivity = 0.89, specificity = 0.47) [41], blood-based laboratory test (sensitivity =0.83, specificity =0.83, AUC-ROC = 0.89) [42], eye movement abnormalities (accuracy = 0.98) [43]. The results also outperformed other detection instruments based on symptomatology, such as CARMS and SIPS, for which the pooled sensitivity and specificity estimates were 0.66 and 0.73, respectively, as reported by a meta-analysis [44]. Compared to all above instruments, the easiness and speed of recording of the ERG make it a strong candidate for clinical use.

The potential of photopic ERG as a biomarker has also been supported by other publications that reported significant differences in cone functions between SMI patients and healthy controls [13, 45, 46], although the protocols of ERG measurements may differ across studies. In addition, there is growing evidence suggesting that structural and functional retinal changes may reflect progressive brain neurodegeneration in mental disorders [10, 11, 47], as seen in multiple sclerosis and Alzheimer’s disease [48,49,50].

This study also provides insights into the clinical utility of the ERG as an instrument that can be used for decision-making (e.g., monitor the risk of SMI or offer intervention). ERG’s net benefit remained superior to the extreme hypothetical scenario of assuming all participants are positive for SMI and hence require further intervention. The superiority was evident for a range of threshold probabilities between 0.14–0.95. However, there are some limitations with this technique. First, there is no “gold standard” approach yet to compare the decision curves of Fig. 1, and the use of decision curve analysis is still very new to psychiatric research, which could explain the absence of other potential biomarkers assessed with this technique. Thus, the decision curve analysis presented in this study should be interpreted cautiously as an illustration of the potential value of our model. An intervention based on this type of decision tool still needs further investigation before reaching clear decision guidelines. Second, the decision curve analysis relies on the prevalence of the disease of interest [51], meaning that future research targeting a different population at an earlier stage of the disease is expected to have a different prevalence, which will have an important impact on the resulting decision curves.

Another limitation is the cross-sectional nature of this study which, despite providing valuable support for the development of ERG as a biomarker, is still considered as a preclinical exploratory phase in the sense that the present model predicts group membership (SMI patients or health subjects) rather than future development of SMI. Finally, overfitting was a central preoccupation; however, external validation was attempted using a testing dataset, and the results remained robust. Future longitudinal studies and replication in different samples are needed to address all the limitations cited above.

Nevertheless, the main strength of this study is the large sample of 501 participants (301 with SMI and 200 healthy controls); this allowed us to obtain more precise estimations and capture the diversity of the population. Another major strength is the introduction of uncertainty in the diagnostic levels. It is well known that a psychiatric diagnosis carries a social stigma that results in more issues to mental health and more functional impairment. The uncertainty level is represented here by an intermediate zone where the prediction values are not precise enough to make a diagnosis. Therefore, it allows practitioners to make a clinical decision about the next course of action with minimal misclassification rates and improves the accuracy or the correct classifications. In other words, it provides an option to offer further monitoring to inconclusive patients without the burden of a psychiatric stigma and increases diagnostic confidence.

Conclusion

The ERG predicted SMI risk with a high level of accuracy when uncertainty was accounted for. Given that ERG is a noninvasive instrument already available in clinical settings and that a short photopic protocol may be sufficient, it could have the potential to become a useful clinical decision tool to intervene among at-risk subjects. Nevertheless, details on the moment of introduction during the developmental trajectory of SMI and the corresponding type of clinical decision need to be further investigated in longitudinal cohorts.