Introduction

Given the generally weak associations between clinically defined psychiatric diagnoses with specific neurobiological alterations of the central nervous system, the development and validation of biomarkers has been a major goal in psychiatric research for decades1. Many studies have combined a large number of variables and/or multiple biomarkers using multivariate pattern recognition approaches2,3,4,5,6,7,8,9. There is growing interest in parameters affecting the stability of these results including internal and external validation procedures as well as sample sizes of training and validation samples as validation sample size must be regarded as major risk of misestimation10,11,12,13. This finding might explain why larger data sets tend to display weaker (presumable closer to “true”) accuracies (e.g. in the classification of depressive patients vs. healthy controls 60–65% accuracy based on structural MRI data in N = 2240 participants11 or 54–56% accuracy based on different neuroimaging modalities in N = 1809 participants14) than many previous findings in small samples (e.g.Refs.15,16).

One well-established and quantifiable biomarker in psychosis research is smooth pursuit eye movements (SPEM). SPEM testing involves having individuals visually track a small moving object relying on continuous sensorimotor processing of perceptual motion signals into dynamic adjustments of motor actions17. Thus, specific SPEM parameters reflect the ability of the brain to continuously receive visual motion information and simultaneously generate, monitor and adjust motor output accordingly to provide a clear visual percept of a moving object of interest. As early as 1908, numerous studies have emphasized SPEM dysfunctions as a biomarker for schizophrenia and other psychotic disorders indicating specific impairments of visual sensorimotor processing not only in stable but also early states of the disorder18,19,20,21,22,23,24,25,26.

The assessment of SPEM was recently included in studies initiated by the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP) consortium aiming to develop a biologically valid framework (e.g. biologically defined phenotypes) for psychotic disorders (i.e. stable probands with schizophrenia, schizoaffective disorder, or psychotic bipolar-I disorder)9,27,28,29,30. With regard to psychosis symptoms in the B-SNIP1 sample, Reininghaus and colleagues31 reported evidence of a transdiagnostic dimension underlying affective and non-affective psychotic symptoms. In line with this, results from the first recruitment period of the B-SNIP study (B-SNIP1, N = 674) indicate SPEM deterioration not only in schizophrenia but also in probands with schizoaffective and bipolar disorder with psychotic symptoms20. These findings, consistent with smaller sample studies25, imply that SPEM deficits can be regarded as a transdiagnostic biomarker for psychosis.

To determine the specificity of the relationship between psychotic symptoms and SPEM performance, it is essential to study probands with disorders that lack psychotic symptoms such as non-psychotic affective, substance, attention-deficit/hyperactivity, and obsessive–compulsive. Such studies revealed either intact SPEM performance or only minimal SPEM deficits32,33,34,35,36. As sample sizes were rather small for most of these studies, however, conclusions remain unclear. Underlining its usefulness as a biomarker, subtle SPEM deficits were not only found in chronically ill but also in first episode patients23,37,38 and in unaffected first-degree relatives of schizophrenia patients20,39. Such subtle SPEM deficits are reflected by specific impairments of certain SPEM measures, e.g. during SPEM initiation, while other SPEM measures, e.g. sustained eye velocity, appear unimpaired indicating that certain compensation mechanisms within the oculomotor systems, e.g. derived from prediction, are in play20,40. Thus, we would expect that similar SPEM disturbances would also be present in a clinical high-risk state for psychosis41 which has not been investigated so far.

To provide comprehensive internal and external validation in the present study, we developed a machine-learning based model that was trained on a set of traditional measures characterizing specific SPEM subfunctions, i.e. mean eye velocity, initial eye acceleration and initiation latency20, in the large sample of the B-SNIP1 study. We then applied several external validation steps to determine stability and specificity in an independent sample of psychosis probands (external validation-1: B-SNIP2 sample), in bipolar probands with and without psychosis symptoms (external validation-2: Psychosis and Affective Research Domains and Intermediate Phenotypes (PARDIP) sample), in probands with predominately affective disorders as well as psychosis probands (external validation-3: DFG-Forschergruppe 2107 (FOR2107) sample) and, following an exploratory approach, in clinical high risk as well as recent-onset psychosis and depression states (external validation-4: Personalised Prognostic Tools for Early Psychosis Management (PRONIA) sample), see Fig. 1. Our aim was to develop an algorithm based on SPEM characteristics which allows evaluation of psychosis-related sensorimotor transformation function on an individual level. Ideally, such SPEM characteristics can be assessed in a short 5-min test ensuring practical utility.

Figure 1
figure 1

Overview of study samples that were included into the machine training and validation procedures.

Results

Demographics, clinical characteristics, and SPEM descriptive information for proband groups by study can be found in Tables 1, 2. With regard to the B-SNIP1 sample, there were no significant differences for age (T(977) = 1.20, p = 0.23) and sex (χ(1) = 1.16, p = 0.28) between psychosis probands and healthy controls. However, healthy controls yielded higher cognition scores than psychosis probands (T(953) = 5.93, p < 0.001). Correlations between SPEM performance and possible confounds, i.e. cognition scores or chlorpromazine equivalents were negligible, see Supplementary Tables 810.

Table 1 Descriptive information and clinical characteristics for proband groups by study.
Table 2 Descriptive results of smooth pursuit eye movements for proband groups by study.

Machine training and internal validation: B-SNIP1

The model distinguished psychosis probands from healthy controls by SPEM variables with a mean balanced accuracy of 63.96% (p < 0.001, Table 3; for further results parameter refer to Supplementary Table 4). On average 53% of the psychosis probands and 75% of the control subjects were correctly classified (sensitivity = 52.97%, specificity = 74.96%, Table 3). Mean likelihood ratios42 resulted in: positive test result = 2.18, negative test result = 0.63.

Table 3 Prediction accuracies for all samples and model results for the comparison of chronic psychosis probands vs. controls.

External validation-1: B-SNIP2

Validation in the B-SNIP2 sample included n = 666 psychosis probands and n = 289 healthy controls (n = 64 participants could not be entered into the machine due to at least one missing value). Emphasizing high validity, the B-SNIP1 derived model discriminated psychosis probands from healthy controls in the independent B-SNIP2 sample with a balanced accuracy of 65.03% (see Table 3 and Supplementary Table 4). About 56% of the psychosis probands and 74% of the control subjects were correctly classified (sensitivity = 56.01%, specificity = 74.05%, Table 3).

External validation-2: PARDIP

For the PARDIP sample, n = 44 bipolar probands with psychosis symptoms, n = 33 bipolar probands without psychosis symptoms and n = 70 healthy controls were included in the validation procedure (n = 9 participants were excluded due to at least one missing value). Our trained model could distinguish bipolar probands with psychosis symptoms from healthy controls with a balanced accuracy of 65.52% (Table 3 and Supplementary Table 4). About 68% of the bipolar probands with psychosis were correctly classified as psychosis probands and 63% of the control subjects were correctly classified as healthy controls (sensitivity = 68.18%, specificity = 62.86%, Table 3). Furthermore, about 61% of the bipolar probands without psychosis symptoms were classified as controls (which means that they are closer to the healthy non-psychotic than the psychosis category, Table 3).

External validation-3: FOR2107

To validate the machine in predominately affective psychopathology, data from n = 94 probands with major depression and n = 25 probands with bipolar disorder, both groups without psychotic symptoms, from the FOR2107 consortium were entered into the analyses. Using the B-SNIP1 machine, nearly 81% of the probands with major depression and 60% of the probands with bipolar disorder were classified as being closer to the healthy non-psychotic than the psychosis category, Table 3.

As proof of principle, we also validated the B-SNIP1 machine on n = 51 psychosis probands and n = 72 healthy controls from FOR2107 revealing a balanced accuracy of 58.37% (Table 3 and Supplementary Table 4). In detail, about 43% of the psychosis probands and 74% of the control subjects were correctly classified (sensitivity = 43.14%, specificity = 73.61%, Table 3).

External validation-4: PRONIA

Validation in high-risk and recent-onset psychotic or depressive disorder could be computed in n = 11 probands with recent-onset psychosis, n = 17 probands with recent-onset depression, n = 19 participants with clinical high risk of psychosis, and n = 16 controls (PRONIA study). Emphasizing the validity of the machine, about 94% of the controls were categorized as healthy. However, in contrast to previous results in chronically ill psychosis probands, only 18% of the recent-onset psychosis probands were classified as psychosis patients (Table 3). Interestingly, the machine labeled nearly 42% of the participants with clinical high risk of psychosis as psychosis probands (Table 3). Of the probands with recent-onset, non-psychotic depression, 76% were classified as healthy controls (Table 3).

Effects of sample size on model performance

Training models in reduced (randomly selected 50% of the B-SNIP1 sample) and larger (combined B-SNIP1 and B-SNIP2) samples showed that balanced accuracies (50% B-SNIP1 = 62.28%, B-SNIP1 = 63.96%, B-SNIP1 + B-SNIP2 = 65.87%), specificities (50% B-SNIP1 = 78.48%, B-SNIP1 = 74.96%, B-SNIP1 + B-SNIP2 = 80.94%) and sensitivities (50% B-SNIP1 = 46.08%, B-SNIP1 = 52.97%, B-SNIP1 + B-SNIP2 = 50.80%) were rather unaffected by sample size (Supplementary Tables 6, 7 and Supplementary Fig. 1).

Discussion

In the current study we examined a set of traditional SPEM measures (i.e. predictive eye velocity maintenance gain, early eye velocity maintenance gain, initial eye acceleration, and eye latency; Leigh & Zee17; Lencer et al.20) and their interactions as quantifiable biological indicators of psychosis-related visual sensorimotor dysfunction in large samples of probands with psychotic disorders. This is an important approach since identified SPEM deteriorations point to specific deficits in the transformation of sensory motion signals into motor action being associated with alterations in occipito-parieto-frontal networks24,43.

To overcome limitations by classical frequentist statistics, we implemented multivariate pattern analyses (e.g. supervised machine learning approaches)44 using internal (i.e. a hold-out subsample consisting of participants that were not used for training) and external (i.e. an independent dataset) validation in sufficient large data samples11 to allow for clinically relevant single-subject statements pointing to sensorimotor transformation deficits. Most importantly, we not only trained and internally validated the machine-learning algorithm in a single sample but also applied and externally validated the machine in an independent large sample of psychosis probands and healthy controls (external validation-1: B-SNIP2), in a sample of bipolar probands with and without psychotic symptoms (external validation-2: PARDIP), in a sample of probands with affective disorders without psychotic symptoms and psychosis probands (external validation-3: FOR2107), and in a sample with recent-onset psychosis or depression and clinical high risk of psychosis (external validation-4: PRONIA). Our main finding shows high consistency for the identification of psychosis probands vs. healthy controls by these sensorimotor indicators throughout the four different samples (B-SNIP1: 63.96%, B-SNIP2: 65.03%, PARDIP: 65.52%, FOR2107: 58.37%). However, it is important to consider that our model performed notably better in accurately classifying controls as controls (specificities in the different samples ranged from 63 to 75%) than psychosis probands as psychosis probands (sensitivities ranged from 43 to 68%).

Although a balanced accuracy score of nearly 64% as derived from our training sample (B-SNIP 1) may be regarded as insufficient for SPEM performance to be used as a single screening instrument for determining psychosis-related sensorimotor transformation function, it significantly exceeds chance level and remains within the range of expectable results in similar heterogenous psychiatric sample sizes11. Additionally, a likelihood ratio for a positive test result of 2.18 could be interpreted as small (but important) changes in probability42. Our second key finding emphasizes the generalization to new data when applying the model to an independent cohort of chronically ill psychosis probands and healthy controls. Regarding the first external validation in the B-SNIP2 sample (external validation-1), our machine yielded a comparable (even slightly higher) balanced accuracy of 65.03% when discriminating the two groups. This result is particularly meaningful due to (a) the independence of both data sets and (b) slight differences in the SPEM task design underlining the robustness of classification results by our model. A third cohort with chronically ill psychosis probands and healthy controls was derived from the FOR2107 consortium (external validation-3) and could be classified correctly with a balanced accuracy of 57.64%.

Our findings support the original suggestions by Diefendorf and Dodge45 to use SPEM as a neurobiological diagnostic tool coming with multiple advantages including standardized measurements and brief 5-min testing feasible even for severely impaired patients. Here, we applied a constellation of SPEM tasks consisting of full-ramp and foveo-petal step-ramp trials at 18.7 degrees of visual angle constant velocity. These specific SPEM tasks allow the computation of the four key measures to evaluate SPEM performance and can be recommended for future studies. Our results add to previous findings based on traditional group analyses in indicating that SPEM is a valuable psychosis-related biomarker of sensorimotor integrity being useful even at the single-subject level20. Besides its diagnostic value this biomarker bears highly relevant information for establishing personalized treatment regimes.

Very recently St Clair and colleagues46 applied a multiclass machine-learning model to differentiate patients with schizophrenia, bipolar affective disorder, major depression disorder, and healthy controls on the basis of 98 eye movement symptoms (including several SPEM variables). The model was tested in two validation sets achieving balanced accuracies for schizophrenia patients of 73% and 75%. Both validation sets were relatively small (test-1 internal validation: n = 30 schizophrenia, n = 35 bipolar, n = 33 depression, n = 35 controls; test-2 external validation: n = 60 schizophrenia, n = 184 controls) which entails an increased risk of misclassification11. To avoid this common short coming we have used a large internal validation sample as well as applied our machine to several extensive independent data sets. Of note, the task from St Clair and colleagues took about 15 min in total yielding a total of 98 eye movement measures47 derived from free viewing, fixation duration, and smooth pursuit tasks46 limiting its clinical practicability.

To further determine the model’s specificity regarding the relationship between psychotic symptoms and SPEM performance we applied the machine to other patient groups. To this regard, there has been an extensive discussion concerning similarities and differences between schizophrenia and bipolar disorder48. Machine-learning models based on brain data have been used to discriminate both patient groups49, though often merging data from bipolar patients with and without history of psychotic episodes50.

Similarly, St Clair and colleagues46 did not specify psychosis symptoms in those patients suffering from bipolar disorder and major depression which we found has a significant impact as demonstrated by our external validation-2 sample from the PARDIP. In line with the idea of the relationship between SPEM deterioration and psychotic psychopathology, our machine classified about 68% of the bipolar probands with psychosis correctly as psychosis patients, while 61% of the bipolar probands without psychosis symptoms were classified as healthy (which means that they are closer to healthy individuals). Underlining its generalizability, 60% of the bipolar probands without psychotic symptoms from the FOR2107 study (external validation-3) were also rated closer to the healthy non-psychotic category.

Broadening the perspective of specificity regarding SPEM deficits in affective disorders, we found that nearly 81% of probands suffering from major depression without psychotic episodes (FOR2107 study, external validation-3) were classified as healthy indicating closer affiliation to the non-psychotic category. This result is in line with previous findings of only minor impaired SPEM performance from traditional group statistics36 and multivariate pattern analyses based on brain data indicating major depression and schizophrenia as two end points of an interjacent continuum50.

Our external-validation sample 4 from the PRONIA study was used to test our model in young probands being at clinical high risk for psychosis or experiencing a first psychotic or first depressive episode. Interestingly, about 42% of probands with clinical high risk of psychosis were categorized as psychosis probands which might support the idea of an underlying susceptibility of SPEM deficits in the psychosis spectrum51. Indeed, the specific SPEM measures of predictive and early maintenance gain indicated the worst performance in this proband group compared to all three other PRONIA groups (see Table 2). However, this group is extremely heterogeneous as indicated by large standard deviations in the early and maintenance gains (see Table 2). Note, transition rates for CHR to ROP are about 25% within 3 years indicating a high heterogeneity of CHR subjects regarding susceptibility to psychosis52. In contrast, in the relatively small (n = 11) and heterogeneous sample of recent-onset psychosis probands our machine only classified two probands (18%) as belonging to the psychosis group. Despite the small sample size, this observation points to possible differences in SPEM performance between recent-onset and chronic states of psychosis (see also Table 1 for information about illness duration) as discussed previously53. That study observed subtle impairments of immediate sensorimotor processing in first-episode psychosis patients with only short duration of treatment, e.g. after 6 weeks, which appeared to be compensated by predictive drive to pursuit. In more detail, first-episode patients demonstrated slightly worse performance in the pure-ramp task (comparable to the step-ramp task in the current study) but were unaffected in the oscillating task (comparable to the triangle wave task in the current study). Deficits were discussed as possible medication effects with regard to their serotonergic antagonism of brainstem sensorimotor systems. However, same as in the present study, no associations between SPEM variables and medication dosage were found53. Indeed, in our ROP group (which might be comparable to the first-episode patients after short duration of treatment from the study by Lencer and colleagues53), early maintenance gain -driven by immediate sensorimotor processing- was considerably reduced while predictive maintenance gain was unaffected (see Table 2). Notably, 76% of probands with recent-onset depression and 94% of healthy controls from the PRONIA sample were correctly classified as not belonging to the psychosis group.

Despite the clear strengths of the study, some limitations need to be discussed: (1) SPEM results for initial eye acceleration and latency differed between laboratories/recording devices (Supplementary Table 11). To estimate the impact of these two variables on the prediction of our machine, we additionally trained a machine in the B-SNIP1 sample using only the two eye velocity gain measures as predictors. The machine was able to distinguish psychosis probands from healthy controls with a balanced accuracy of 61.90% (Supplementary Table 12) which is close to the main result using all SPEM variables (63.96%). However, laboratory conditions and/or recording devices may have an impact on the measurement of SPEM initial eye acceleration and latency that could have affected prediction results. (2) As we trained the machine in a sample of chronically ill psychosis probands, possible effects of medication have to be taken into account. Although we found only small and inconsistent correlations between SPEM and chlorpromazine equivalents, effects of medication cannot be fully ruled out53. (3) Furthermore, we found significant differences in cognition scores between psychosis probands and healthy controls in the B-SNIP1 sample. There might be effects of cognitive skills that cannot be entirely discarded. (4) Despite our comprehensive validation samples, our machine was not validated in a group of MDD with psychosis. (5) There is a discrepancy between sensitivity (53%) and specificity (75%) implying our model to be particularly suitable to correctly identify healthy probands as healthy. (6) No follow-up data of samples from the PRONIA study is available to evaluate transition rates of those CHR participants with bad SPEM performance.

Our comprehensive findings support SPEM as an indicator of sensorimotor transformation impairments relevant to patients suffering from chronic psychosis. Thus, our machine learning algorithm based on the performance in a 5 min SPEM task can help to obtain an overview of sensorimotor transformation profiles on an individual level that might inform treatment decisions in rehabilitation contexts, e.g. regarding sensorimotor remediation strategies.

Future studies should broaden this biomarker approach by combining indicators of sensorimotor function with multiple other relevant neurobiological measures, e.g. brain structure indices, to improve individual prediction accuracies and to inform personalized therapeutic decisions for psychotic disorders. Additionally, future studies should target the question whether SPEM-Impairments can indicate illness progression independently from the factor of illness duration.

Methods

Subjects

SPEM data from five independent samples were included in the following analyses (Fig. 1):

B-SNIP1

First, the machine was trained and internally validated with SPEM data from the B-SNIP1 sample consisting of n = 674 chronically ill psychosis probands (n = 265 schizophrenia, n = 178 schizoaffective, and 231 bipolar with psychotic symptoms) and 305 healthy controls. Participants were recruited by the B-SNIP consortium across five sites in the US (Baltimore, Boston, Chicago, Dallas, Hartford; Tamminga et al.27). Diagnoses were derived by a consensus of experienced clinicians based on all available clinical information and the Structured Clinical Interview for DSM-IV54. Inclusion criteria comprised (1) age between 15 and 65 years; (2) reading score of the Wide Range Achievement Test ≥ 6055; (3) no history of a neurologic disorder; (4) normal or corrected to normal vision (minimum of 20/40 acuity), (5) no history of substance abuse within the last month or substance dependence within the last three months, and negative urine toxicology on study day. Additionally, healthy controls were not allowed to have a personal or family history (first-degree) of psychotic or bipolar disorders, to have a history of recurrent mood disorder or to exhibit a history of psychosis spectrum personality traits56. The protocol of the study was approved by institutional review boards at each of the study sites and participants provided written informed consent. For group differences in SPEM performance see Lencer et al.20.

Second, the remaining study samples were used (a) as external validation data for the machine trained in the B-SNIP1 sample and (b) for investigating psychosis-related specificity of SPEM against probands with predominately affective disorders.

External validation-1: B-SNIP2

B-SNIP2 is the follow-up to B-SNIP1. SPEM data were available from n = 727 chronically ill psychosis probands (n = 288 schizophrenia, n = 264 schizoaffective, and 175 bipolar with psychotic symptoms) as well as n = 292 healthy controls recruited in Boston, Chicago, Dallas, Hartford, and Athens (GA). Inclusion criteria were identical to B-SNIP1. For further details on B-SNIP2 eye movements see Huang et al., 202162, but SPEM data have not been published so far.

External validation-2: PARDIP

SPEM data from the multisite PARDIP consortium were available from n = 49 bipolar probands with psychotic symptoms (BPwP), n = 36 bipolar probands without psychotic symptoms (BPwoP), and n = 71 healthy controls. The PARDIP study took place in Dallas, Boston, and Hartford. It was nested within the B-SNIP consortium using similar inclusion criteria but, importantly, there was no overlap between PARDIP and B-SNIP participants. Further information on inclusion criteria and group differences in SPEM performance see Brakemeier et al.57.

External validation-3: FOR2107

In collaboration with the bicentric FOR2107 project (https://for2107.de/, Kircher et al.58), SPEM data were measured in n = 94 probands with major depressive disorder without psychotic symptoms (MDwoP), n = 25 bipolar probands without psychotic symptoms (BPwoP), n = 51 psychosis probands, and n = 72 healthy controls at the Münster site.

External validation-4: PRONIA

Following an exploratory approach for testing the validity of our machine developed in stable probands with chronic psychosis, SPEM data were also collected in collaboration with the multisite PRONIA consortium (https://www.pronia.eu/, Koutsouleris et al.59) from n = 11 probands with recent-onset psychosis, n = 19 probands with a high clinical risk for the development of psychosis, n = 17 probands with recent-onset depression, and n = 16 healthy controls at the Münster site.

All patients were medicated as prescribed by their doctors except for regular or current sedative medication which was an exclusion criterium (see chlorpromazine equivalents at time of testing in Table 1). Note, prior to inclusion ROP patients from the PRONIA sample had not been allowed to take any antipsychotic medication for longer than 90 days (within the past 24 months) with a daily dose rate at or above the minimum dosage of DGPPN S3-guidelines60.

Participants gave written informed consent according to the Declaration of Helsinki. Each study was approved by the respective local ethics committee.

Eye movement measurement and task

At all sites, the SPEM target consisted of a small stimulus (0.5°) moving back and forth in the horizontal plane at 18.7°/s constant velocity displayed on a monitor to constitute full-ramp trials within triangle wave tasks and foveo-petal step-ramp trials61. Participants were instructed to follow the stimulus with their eyes as accurately as possible while sitting in front of the monitor with their heads stabilized using a chin and forehead restraint. Across all studies, eye movements were recorded in a quiet and darkened room.

For B-SNIP1, B-SNIP2, and PARDIP samples (for further details refer to Brakemeier et al.57; Huang et al.62; Lencer et al.20), participants were seated 60 cm from a 22-inch CRT monitor (1360 × 768 resolution; 150 Hz refresh rate) and eye movements were recorded using an Eyelink II (SR Research Ltd., Ontario/Canada) recording device at 500 Hz sampling rate. The stimulus comprised a red cross in a box covering 0.5° moving horizontally between ± 12° across the screen.

In B-SNIP1 and PARDIP studies, 48 full-ramp and 32 foveo-petal step-ramp trials61 both at 18.7° of visual angle per second constant velocity, were applied in order to assess SPEM performance. In full-ramp trials, the stimulus moved back and forth with constant velocity in a triangular waveform. During step-ramp trials, the target started from the central position, stepped either to the right or the left (2.4° of visual angle in a randomized order) and afterwards moved towards the peripheral opposite direction at 18.7° of visual angle per second constant velocity. The stimulus re-crossed the central line after 133 ms allowing the initiation of SPEM without a necessary catch-up saccade61. Additionally, some trials with 9.7° of visual angle/second and 26.6° of visual angle/second target velocities as well as trials with intervals where the target was blanked were displayed to enhance attention but were not included into the analyses (30% of trials). In order to ensure data quality, additional calibration trials were presented between blocks of trials. SPEM measurement was conducted identically across sites.

For B-SNIP2 a slightly different task order was applied: Here, 48 full-ramp trials at 18.7° of visual angle per second constant velocity and a total of 48 foveo-petal step-ramp trials (32 × 18.7° of visual angle/second; 8 × 9.7° of visual angle/second; 8 × 26.6° of visual angle/second; randomized for direction; Rashbass61) were presented in two test sets (each consisting of 48 trials) including six alternating blocks of either eight full-ramp or step-ramp trials. Step-ramp trials at 9.7° of visual angle/second and 26.6° of visual angle/second velocities were shown to enhance attention but were not included into the analyses. To ensure data quality, additional calibration trials were displayed between blocks of trials. SPEM measurement was conducted identically across sites.

FOR2107 and PRONIA eye movements were recorded using an Eyelink 1000 (SR Research Ltd., Ontario/Canada) recording device at 500 Hz sampling rate. Participants were seated 60 cm from a 22-inch CRT monitor (1360 × 768 resolution; 150 Hz refresh rate). Stimulus and task were identical to B-SNIP2.

Eye movement data processing

All SPEM data were analyzed using the identical routines in MatLab (The MathWorks, Natick, MA) developed by one of the authors (AS). Eye position data were filtered using a one-dimensional Gaussian filter (30 Hz) and, subsequently, smoothed eye velocity was computed with central median differentiation of 9 ms20,57,63. Sections of saccades and blinks were automatically detected and excluded from computations of SPEM variables. To revise automatic calculations, individual velocity traces were checked by visual inspection.

To assess the different sensorimotor aspects of SPEM performance, the following variables were computed20,36,57 (see Fig. 2, adapted from Ref.57):

Figure 2
figure 2

Examples of pursuit stimuli with pursuit recordings (eye position and eye velocity) in a control subject and a psychosis proband. Foveopetal step-ramp tasks (A) are used to measure saccade free pursuit initiation. Variables of interest are pursuit latency (time between target step and green dot), initial eye acceleration (blue line) and early maintenance gain (blue line in grey shaded intervals). Triangular wave tasks (B) are used to measure sustained predictive maintenance gain in predefined intervals (blue line in grey shaded intervals) excluding artifacts induced by target reversals. The figure has been adapted from one of our prior publications by Brakemeier and colleagues57.

Predictive maintenance gain during continuous pursuit was calculated from triangular wave tasks as the ratio of median eye velocity to target velocity from middle sections (300–840 ms after stimulus direction reversal) over all full-ramp trials (total duration of a ramp is 1200 ms). Predictive maintenance gain highly depends on predictive drive, i.e. cognitive input to the pursuit system for sustained SPEM under closed-loop conditions.

In contrast, measures from foveo-petal step-ramp tasks represent rapid sensorimotor transformations using immediate visual motion and early performance feedback.

This included first, early maintenance gain as the ratio of median eye velocity to target velocity from middle sections (350-550 ms after stimulus onset) over all unpredictable step-ramp trials, thus reflecting early eye velocity under visual feedback control64. Typically, early maintenance gain is considerably lower compared to sustained predictive maintenance gain.

Second, for the computation of initial eye acceleration under open-loop conditions, when visual feedback is not yet available, eye velocity was smoothed using a Savitzky-Golay finite impulse response filter (polynomial order of 3 and a frame length of 63). The onset of eye acceleration was defined as eye velocity exceeding a noise threshold (above 3.2 standard deviations of mean resting eye velocity which was calculated from 200 ms before to 100 ms after ramp-onset, Carl & Gellman65) for at least 20 ms. Initial eye acceleration was then computed using robust linear regression slope (RobustFit® in MatLab) in a 100 ms time window starting with the acceleration onset over all trials.

Third, eye latency was determined as time that had elapsed between onset of stimulus movement and onset of eye acceleration65 over all trials.

Psychometric, cognitive, and clinical measures

Psychosis-related symptoms

For B-SNIP1, B-SNIP2, PARDIP, and PRONIA studies, psychosis-related symptoms were rated using the positive and negative syndrome scale (PANSS)66 while the FOR2107 study used the Scale for Assessment of Positive Symptoms (SAPS) and the scale for assessment of negative symptoms (SANS)67. To provide comparability, SANS and SAPS scores were converted to PANSS scores68, see Supplementary Table 1.

Depression

Depressive symptoms were quantified with the Montgomery–Åsberg Depression Rating Scale (MADRS; Montgomery & Åsberg69) in the B-SNIP1, B-SNIP2 and PARDIP studies and using the original Beck Depression Inventory in the 1978 version70 in the FOR2107 sample. For PRONIA, the Beck Depression Inventory-II (BDI-II)71 was applied. Severity gradation (MADRS72, BDI71) is given in Supplementary Table 2.

Mania

For B-SNIP1, B-SNIP2, PARDIP and FOR2107 samples, mania was estimated using the Young Mania Rating scale73. Mania was not assessed in the PRONIA sample.

Cognitive abilities

A total score indicating cognitive abilities was estimated using the Wide Range Achievement Test 4 (WRAT455) in the B-SNIP1, B-SNIP2, and PARDIP samples. For the FOR2107 study, the Multiple-Choice Vocabulary Test, version B (MWT-B74) was used. Scores were converted to the IQ scale74. For the PRONIA sample the Wechsler adult intelligence scale matrix reasoning75 was applied to evaluate cognition.

Statistical analyses

Machine learning approach

The machine learning model was trained in the B-SNIP1 sample to distinguish psychosis probands from healthy (non-psychotic) controls using PHOTONAI software76 and scikit-learn toolboxes77. A k-fold nested cross-validation procedure was applied to split data used to train the model from data taken for internal validation. Thus, to obtain the most informative model, parameters were optimized using an inner cycle (10 folds) and the best performing model chosen by highest balanced accuracy ([sensitivity + specificity]/2 taking into account imbalanced data sets) was deployed to an outer cycle (3 folds). Special attention was given to ensure that there was (1) no information leakage between train and validation data76 and (2) a sufficient large validation set to provide stable and meaningful results for unseen (external) samples11. For specifications of the best model see Supplementary Table 3.

For each of the models the following preprocessing steps were applied: (1) SPEM variables were standardized by scaling. (2) Missing values (predictive maintenance gain = 0%, early maintenance gain = 0.51%, initial eye acceleration = 1.23%, eye latency = 0.51%) were imputed with the median of the corresponding variable. (3) In order to consider different group sizes (674 psychosis probands and 305 healthy controls), data were balanced by either randomly under sampling the majority class or oversampling the minority class using SMOTE78. (4) Principal component analysis was applied to reduce the dimensional space.

Predictors included the four SPEM variables described above (i.e. predictive maintenance gain, early maintenance gain, initial eye acceleration, and eye latency). Then, multiple classifiers with default parameters were used to optimize representation of the underlying data (Support vector machine, Random forest, Gaussian naïve bayes, Logistic regression, Ada boost) and to discriminate the label group membership (i.e. psychosis proband or healthy control). Additionally, for the support vector machine, kernel (linear, rbf) and regularization (C = [0.1, 0.3, 0.5, 0.7, 0.9, 1]) parameters were optimized.

Statistical inference was examined using permutation tests79. Therefore, true results were compared to a permutation distribution created from 1000 random rearrangement of the two group labels (healthy controls vs. psychosis group) to the predictors.

Additionally, we trained machine learning algorithms to separate psychosis probands in the B-SNIP1 sample. In line with the idea of SPEM deterioration across the whole psychosis spectrum, results for distinguishing individual proband groups are close to chance level (balanced accuracies: schizophrenia vs. schizoaffective probands 52.65%, schizophrenia vs. bipolar probands 52.48%, schizoaffective vs. bipolar probands 51.00%, Supplementary Table 5).

External validation of the model was investigated by applying the best performing model from B-SNIP1 to B-SNIP2 (external validation-1), PARDIP (external validation-2), FOR2107 (external validation-3), and PRONIA (external validation-4) samples. Here, in accordance with the idea that there is a specific relationship between SPEM performance and psychosis syndromes, we also applied the model to other non-psychotic psychiatric patient groups expecting them not to be classified as psychosis probands (thus be closer to the healthy non-psychotic control group).

To examine the effect of sample size on model performance, additional models were trained and internally validated in randomly selected half of the B-SNIP1 and in the combined B-SNIP1 and B-SNIP2 samples.

Kendall’s Tau correlation coefficients were computed between SPEM measures and chlorpromazine equivalents80. Additionally, correlations were calculated between SPEM measures and WRAT4 scores as well as z-scores of the Brief assessment of cognition in schizophrenia (BACS; Keefe et al.81). Analyses were computed in the B-SNIP1 sample. Results are reported using Bonferroni–Holm-corrected alpha level adjusted for each of the studies over all four SPEM variables82,83.