Introduction

While psychotherapy has proven effective in treating patients with a personality disorder (Cristea et al., 2017; Rameckers et al., 2021; Storebø et al., 2020), there remains a pressing need for improvement in its application. For instance, a recent meta-analysis focusing on borderline personality disorder revealed that nearly half of the patients do not respond adequately to psychotherapy (Liechsenring et al., 2024). Another challenge lies in the premature discontinuation of treatment, with 20 to 33% of patients dropping out early (Dixon & Linardon, 2020; Iliakis et al., 2021; Swift & Greenberg, 2014). Consequently, a substantial proportion of patients do not benefit from treatment.

Unfortunately, psychotherapists seem to have difficulty predicting which treatments will stagnate or worsen (Hannan et al., 2005; Hatfield et al., 2010) and overestimate their effectiveness (Walfish et al., 2012). Because of this bias, there is a risk that patients will continue to be treated in the same way for too long, even when a treatment is ineffective or symptoms worsen. However, adding progress feedback to treatments through Routine Outcome Monitoring (ROM) can help improve treatment outcomes. Progress feedback is the systematic measurement of the treatment process and progress using measurement instruments (De Jong et al., 2021). Progress feedback through ROM typically uses general assessment instruments, such as the Outcome Questionnaire-45.2 (OQ-45.2; Lambert et al., 2004), which can be used in various patient groups. Disorder-specific measurement instruments, such as a questionnaire measuring only depressive symptoms, are also used. Recent meta-analyses show that the use of ROM can have positive effects on symptom reduction, particularly with clients who fail to progress, i.e., clients who are not on track (NOT) (De Jong et al., 2021; Rognstad et al., 2023). However, limited knowledge exists regarding the use of progress feedback with patients diagnosed with a personality disorder. The findings from a study by De Jong et al. (2018) suggest that it may not be universally beneficial for all types of personality disorders, indicating a need for additional research to explore this further.

Progress feedback is used to intervene as early as possible in treatment when it is not on track. Research has indicated that initial improvement in symptoms (early change) predicts overall treatment outcome, especially in patients with anxiety and mood disorders (Lutz et al., 2009, 2014; Schibbe et al., 2014). Early change is seen as an indicator of success and is, therefore, meaningful in clinical practice (Schibbe et al., 2014; Tiemens et al., 2016, 2020).

To measure early change, measurement tools that are sensitive to change are required. In anxiety, mood, and eating disorders, disorder-specific questionnaires seem to be more sensitive to change and, therefore, better able to measure early change and predict treatment outcomes than general questionnaires (Dingemans & Furth, 2017; Nugter et al., 2017; Schibbe et al., 2014; Van Der Mheen et al., 2018). Because disorder-specific questionnaires measure the degree of dysfunction and not symptoms in patients with a personality disorder, it is important to determine whether the difference between general and disorder-specific instruments also occurs in the case of personality disorders.

This study aimed to determine whether early changes in symptoms as measured with a general questionnaire and personality dysfunction as measured with specific questionnaires in the treatment of patients with a personality disorder would predict personality dysfunction post-treatment.

Method

Design and Setting

A cohort study was conducted using data from patients with a personality disorder who attended treatment at the Pro Persona mental health institution within the Center for Psychotherapy (CfP). The primary dependent variable was the level of personality disfunction as measured by the domains on the Severity Indices of Personality Problems (SIPP); the secondary treatment outcome was the level of personality disfunction as measured by the General Assessment of Personality Disorder (GAPD). The independent variables were changes on the SIPP, GAPD and in general symptomatology as measured by the Outcome Questionnaire 45.2 symptom distress scale.

The CfP (which was closed in July 2022 due to financial factors) was a third line, in- and outpatient treatment center for patients with a personality disorder. The treatment team consisted of psychiatrists, clinical psychologists, psychotherapists, health psychologists, art therapists, psychomotor therapists, system therapists, and socio-therapists.

Patients

Patients were adults aged 18 and older with a personality disorder as the main diagnosis. They attended 2- or 4-day group treatment between September 2017 and March 2022 for 30 to 36 weeks. Within the CfP, there were two treatment clusters with a psychodynamic or a schema-therapeutic orientation. Within the psychodynamic treatment cluster, predominantly patients with internalizing personality problems were treated. In the schema-therapeutic treatment cluster, patients with predominantly externalizing personality problems were treated. At intake, diagnoses and treatment recommendations were made based on each patient’s history, the views of the patient’s family or good friend(s), and psychological test results. The main difference between the two treatment clusters, besides the difference in orientation, was the extent to which therapeutic pressure on patients could be increased. For example, the therapeutic treatment climate within the schema-therapy cluster was more structured, with an aim of reducing pressure on patients with more externalizing personality problems.

Measures

The Severity Indices of Personality Problems (SIPP; Verheul et al., 2008)) is a self-report personality questionnaire that focuses on personality traits present in all personality disorders. It consists of 16 facets divided into five domains: Social Attunement, Relational Functioning, Identity Integration, Responsibility, and Self-Control.

On the SIPP, no total score is calculated, but scores on subscales indicate the severity of personality disfunction in the six domains. High scores indicate a more adaptive and better functioning personality. Items are rated on a 4-point Likert scale (1 = completely disagree, 2 = partially disagree, 3 = partially agree, and 4 = completely agree). The SIPP has two versions, a diagnostic version (SIPP-118) and an abbreviated version consisting of 60 items (SIPP-SF). Items on the SIPP-SF were derived from the long SIPP version. At intake, the diagnostic version (SIPP-118) was administered. The abbreviated version (SIPP-SF) was administered at the interim assessment (during treatment) and at the final assessment (at the end of treatment) and was used as a treatment outcome measure. The reliability and validity of the SIPP-118 are adequate to good (Feenstra et al., 2011; Verheul et al., 2008). The internal consistency of the SIPP-SF is good with Cronbach’s alpha between .81 and .88 (Rossi et al., 2017). The SIPP-118 is sensitive to measuring changes in personality aspects during treatment (Feenstra et al., 2011).

The GAPD (Berghuis & Livesley, 2022) is a self-report questionnaire that measures core components of personality dysfunction and is based on Livesley’s (2003) Adaptive Failure model. The main scales of the GAPD are strongly related to the Self and Interpersonal Functioning elements of the alternative DSM-5 model of personality disorders (APA, 2013). The total score can be used to express overall personality dysfunction. Unlike the SIPP, high scores on the GAPD reflect greater personality dysfunction. Items are rated on a 5-point Likert scale (1 = completely disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = completely agree). As with the SIPP, at intake in this study we used the long, diagnostic version (83 items) of the GAPD. The abbreviated version, the GAPD-SF (28 items), was administered at the intermediate and post-treatment. Items on the GAPD-SF were derived from the long version of the GAPD. Research on the psychometric characteristics of the GAPD that included four Dutch and Canadian samples, showed that the GAPD has adequate internal consistency (α ranged between .78 and .98) and good test–retest reliability (r ranged between .89 and .96 for the main scales) and discriminated between patients with and patients without a personality disorder (Berghuis & Livesley, 2022; Berghuis et al., 2013). Research on the validity of the GAPD is ongoing. Regarding convergent validity, moderate strong correlations were found between GAPD and SIPP-118 (Berghuis & Livesley, 2022).

General symptomatology. The Outcome Questionnaire 45.2 (OQ-45.2) is a self-report questionnaire consisting of 45 items that ask about how the patient has felt in the past week (Lambert et al., 2004). The OQ-45.2 measures several domains across three subscales: Symptomatic Distress (SD), Interpersonal Relationships (IR), and Social Role (SR). A total score is calculated by summing all the items. Items are rated on a 5-point Likert scale (0 = never, 1 = sometimes, 2 = rarely, 3 = often, 4 = almost always). The reliability of the Dutch version of the questionnaire is good, its validity is adequate and there is a high sensitivity to change on all subscales (De Jong et al., 2008). In the current study only the SD-scale was used, which of the subscales has the highest internal consistency and discriminates between a normal and clinical population (Timman et al., 2017) and is the most sensitive to change (De Beurs et al., 2019).

Procedure

As part of the treatment at CfP, patients’ progress was monitored through ROM, and the ROM trajectory was linked to scheduled evaluation times. Data consisted of the ROM scores of 841 patients from September 2017 to March 2022. The outcome of the pre-treatment, intermediate, and post-treatment scores on the SIPP-118, SIPP-SF, OQ-45.2, GAPD, and GAPD-SF were used in this study. Patients who objected to their data being used could report this to the Care Monitoring Department of Pro Persona, after which they were excluded from the study. All patients’ data were anonymized according to the k-anonymity method (Sweeney, 2002) before being made available for the study.

Definition of Early Change

In previous studies of the predictive value of early change in the treatment of patients with an anxiety or a mood disorder (e.g., Schibbe et al., 2014), early was often defined as sometime during the first half of treatment, and this rule was also followed in this study. Assessment times were defined as follows:

  • The initial assessment was undertaken before the intake appointment and as close as possible to the start of treatment.

  • The early change was measured at two time points: the first measurement fell between 2 and 8 weeks, and the second between 8 and 15 weeks.

  • The final assessment was taken between the 4 weeks before and the 4 weeks after treatment ended.

Statistical analyses

G*Power 3.1 (Faul et al., 2009) was used to calculate the number of participants needed. With a medium effect size of 0.10 (for multiple regression), an α of 0.05, and a power of 0.80, the required sample size was 151. A medium effect size was estimated based on previous studies (Schibbe et al., 2014).

Before running the main analyses patterns of missing data were analyzed (using the VIM package for R) and missing data was imputed using a machine learning-based data imputation algorithm that operates on the Random Forest algorithm. The advantage of this method is that it can handle non-linear and interaction effects. After imputation scores on the measures were converted to z-scores and assumptions for regression were checked, including investigating residual plots and a curve estimation. Furthermore, distributions of the diagnoses over treatment types and frequency were tested using a chi-square test, correlations between measures were explored and pre-post cohen’s d effect sizes for on all the measures were calculated.

Early change, the change that occurred in the first half of treatment, was measured in three ways: Early change in symptoms measured by participants’ scores on the SD subscale of the OQ-45.2; early change in personality dysfunction measured on the SIPP-SF domains and scores on the GAPD. As described above, early change was measured at two time points. The outcome measures, or dependent variables, were residualized post-treatment scores on the SIPP and the GAPD, case-mix corrected for the following covariates: age, gender, pre-treatment scores on all the measures, treatment frequency (2- or 4-day treatment), and treatment type (a psychodynamic or a schema-therapeutic orientation).

Multiple regression analyses were performed to compare the predictive value of early change in general and early change in symptoms and personality functioning. Because SIPP as an outcome measure could not be calculated as a mean total score, analyses were conducted separately for each domain. In each regression analysis, in the first model, the early change on the same domain/ measure as the dependent variable was entered; in the second model the other SIPP domains, GAPD and OQ-45 SD were added. Separate analyses were performed for the two intermediate time points. The standardized beta (β) was used to compare the variables in the second models to see which had the strongest relationship with the dependent variable. Due to the multiple comparisons a Bonferroni correction of the significance level was used to protect against a Type I error. The p value was set at .005.

Analyses were performed in R and SPSS 29. The R package VIM was used for the analysis of missing data and missForest for imputation of missing data.

Missing Data

There was no missing data on the covariates age, gender, treatment frequency, and treatment type. However, there was missing data on the OQ-45, GAPD and SIPP. The percentage of missing data varied between 23.1 and 68.7%, depending on the measurement instrument used (see Table 1 of the Supplementary materials for the percentages for each measure and time point). Analysis of the missing data indicated that the data were Missing at Random (MAR; see the Supplementary material for further information). A random forest algorithm (missForest) was used to impute the missing data using all the variables: the patient’s age, gender, diagnosis, treatment frequency and type and scores on the all the measures. The missForest package has shown to perform well for a large percentage of missing values (Gómez-Méndez & Joly, 2023). The normalized root mean squared error (NRMSE) gives an indication of the model’s performance, with values close to 0 indicating good performance and values around 1 indicating bad performance (Stekhoven & Bühlmann, 2012; Stekhoven, 2022). The NRMSE values for this imputation ranged between .02 and .12. Table 2 of the Supplementary materials report the means and standard deviations of the imputed versus the original data.

Results

Descriptive Statistics

Table 1 shows the demographic and clinical characteristics of the 841 patients. The difference in the distributions of the diagnoses over treatment types (psychodynamic and schema therapy) were significant (χ2 (4, N = 764) = 162.45, p < .001, with almost all (94%) of the patients with borderline personality disorder treated in the schema therapy cluster. The difference in the distributions of the diagnoses over treatment frequency was not significant (χ2 (4, N = 764) = 8.51, p = .074).

Table 1 Demographic and clinical characteristics

Overall, treatments were effective, with pre-post Cohen’s d effect sizes ranging between d = .39 and d = . 1.07 on the SIPP domains, d = .88 on the OQ-45 SD scale and d = .54 on the GAPD.

Early Change as Predictive of Personality Dysfunction

First the data was explored by viewing the z-scores on the measures (see Fig. 1) and the correlations between the measures (see Table 3 of the Supplementary materials). We ran a curve estimation to check for possible curvilinear relationships, which showed that although the relationships between independent and dependent variables in our models were linear, there was one exception, namely early change on Social attunement at measurement time 2, which showed a curvilinear relationship. Thus, a quadratic term was added to this model.

Fig. 1
figure 1

Z scores on the GADP, OQ-45 SD and SIPP domains at four measurement points. GAPD General Assessment of Personality Disorder, OQ-45.2, SD Outcome Questionniare-45.2, Symptomatic Distress scale, SIPP Severity Indices of Personality Problems, SC self-control, SA social attunement, R responsibility, RF relational functioning, II identity integration

Tables 2 and 3 show the results of the multiple regression analyses with the early change at the two intermediate time points on the separate SIPP domains, GAPD, and OQ-45 SD scales as predictors of the residualized post-treatment scores on the SIPP domains and GAPD scores. No indications of multicollinearity were found in any of the regression analyses, with the tolerance ranging between .61 and 1 and the VIF between 1.21 and 1.63.

Table 2 Results of Regression analyses of the effect of early change on SIPP post-treatment scores
Table 3 Results of Regression analyses of the effect of early change on GAPD post-treatment

The results showed that early changes in scores on a specific SIPP domain were the strongest predictors of the residual scores on that same SIPP domain, but other SIPP domains were not (when the Bonferroni corrected p value of .005 is applied). The second most significant predictor for the SIPP domains of Self-Control, Relational Functioning, and Identity Integration was the early change on the OQ-45 SD. For the Self-Control domain, the third most significant predictor was the early change in the GAPD. The second strongest predictor of the SIPP domains Social attunement and Responsibility was the early change on the GAPD; on these domains the early change on the OQ-45 SD was not a significant predictor.

When the residual score on the GAPD was the outcome, the early change on the GAPD at both intermediate measurement points was the strongest predictor, followed by early changes on the OQ-45 SD scale and the SIPP domain Social Attunement, while the other SIPP domains did not predict the GAPD.

Discussion

The primary objective of this study was to determine whether change shown early in treatment by patients with a personality disorder could predict their ultimate treatment outcome, as measured by two outcome measures, the SIPP-SF and the GAPD. A secondary aim was to determine whether the predictive value of early change depends on whether it is assessed with a general or a disorder-specific measurement instrument.

The results showed that early changes on a specific domain of the SIPP were the strongest predictors of case-mix corrected final scores on that same domain, and the proportion of explained variance ranged from 10.9 to 33.4%. This indicates that improvements or declines in specific personality domains early in treatment are highly indicative of the final outcomes in those same domains. Changes in other SIPP domains were not significant predictors of case-mix corrected final scores on a specific domain. This suggests that cross-domain predictions within SIPP are weak or non-existent. The second most significant predictor for the domains of Self-Control, Relational Functioning, and Identity Integration was the early change on the OQ-45 SD scale. This indicates that early improvements in overall symptom distress are relevant for predicting outcomes in these three personality domains. The second most significant predictor for the domains of Social Attunement and Responsibility and the third most significant predictor of the domain of Self-Control was early change in the GAPD. Thus, early changes in the general severity of personality disorder symptoms also play a role in predicting outcomes in these domains.

For the GAPD as an outcome measure, early changes in the GAPD itself at both intermediate measurement points were the strongest predictors of case-mix corrected GAPD post-treatment scores. Early changes on the OQ-45 SD scale and the SIPP domain Social Attunement were also significant predictors, indicating that improvements in overall symptom distress and Social Attunement might be relevant for predicting outcomes on the GAPD. Interestingly, in the case of the GAPD as an outcome measure, the predictive value of early change on the OQ-SD scale was stronger than that of all domains of the SIPP. In fact, early changes in the SIPP domains of Relational functioning, Identity integration, Responsibility, and Self-control were not predictive at all. The takeaway is that even though they both measure personality functioning, the SIPP domains and the GAPD cannot be used interchangeably to predict each other.

Another relevant finding was that no difference was found between the two early change moments. This means that patients with a personality disorder can also be assessed early in treatment (the first 8 weeks) whether they respond well to treatment.

As to why early changes on the OQ-45 SD predict certain SIPP domains and the GAPD but not others, this might have to do with the relationships between general symptom distress and specific personality traits. Symptom distress may have a more immediate and direct impact on domains such as Self-Control, Relational Functioning, and Identity Integration because these areas are closely connected to an individual’s emotional and psychological state. For example, in the case of the SIPP domain identity integration, the items of that domain seem to overlap with those on the OQ-45 SD scale. For instance, the items “I often see no reason to continue living” or “I often feel that I am not as worthy as other people” are similar to OQ-45 SD items “I have thoughts of ending my life” and “I feel worthless.” The highest correlation found between the OQ-45 SD and SIPP domains in the present study was with the identity integration domain (see Table 3 in the Supplementary materials). Additionally, in a study with adolescents, the correlation between the general severity index of the Symptom Checklist (SCL-90) and the identity integration domain of the SIPP was r = − 0.80, which is very high (Feenstra et al., 2011). The reason why early changes on the OQ-45 SD did not predict outcomes on the Social Attunement and Responsibility domain might be because these domains may rely on more stable traits or skills that are less directly impacted by general symptom distress, requiring different therapeutic approaches for improvement.

Arguably, the added explained variance of early change on other measures than the measure or domain that was the outcome variable was limited (changes in explained variance ranged from 2.3 to 9.0%), suggesting, for instance, in the case of the OQ-45 SD it cannot be relied upon to be the sole or best predictor of outcome and disorder-specific measures should always be used alongside the OQ-45 SD when monitoring treatment response with patients with personality disorders.

Furthermore, another question for future research relevant to clinical practice is how much early change is indicative of good treatment outcomes. The reliable change indices (Jacobson & Truax, 1991) for the SIPP and the GAPD may be helpful in establishing clinical guidelines for this. This will help with the interpretation of the scores. Furthermore, future research should focus more on the overlap and differences between the SIPP and GAPD, as there might measure different aspects of personality functioning or differ in their sensitivity to change, also considering that for patients, it might be burdensome to use both the SIPP and the GAPD.

There are several limitations of this study that need to be considered. The major limitation was the large percentage of missing data at various time points. We addressed this issue by using a Random Forest algorithm to impute the missing data. Imputing data is the preferred alternative to excluding participants with missing data because doing so can lead to biased results and conclusions (Madley-Dowd et al., 2019). Some of the missing data might be explained by the lack of checks to ensure the questionnaires were completed during treatment and having even less control over questionnaire completion after patients had been discharged from treatment. Additionally, the motivation of patients to complete the questionnaires might have been diminished after the termination of their treatment. Moreover, some patients transitioned to another treatment program within the same mental health organization after having received treatment in the current program (CfP), and personality questionnaires are not routinely included in the ROM in other treatment programs.

Another limitation of this study was the relatively small percentage of patients with a borderline personality disorder. This might be due to the fact that at the same mental health organization (but at other treatment locations than the CfP), Dialectical Behavior Therapy (DBT) was available as another intensive treatment option for patients with externalizing personality and severe emotion regulation problems. Therefore, it is likely that some patients with a borderline personality disorder were referred to the DBT program.

Despite these limitations, the current study is, to the best of our knowledge, the first study to examine early change in a population of patients with a personality disorder. This study showed that concerning personality dysfunction, early changes on a specific domain or measure predicted outcomes in that same domain or measure. Furthermore, early changes in symptom distress (OQ-45 SD) were significant predictors for several personality traits, especially Self-Control, Relational Functioning, Identity Integration, and the GAPD, although it had a modest effect and should not replace disorder-specific measures. In sum, in the context of personality disorder treatments, early assessments during the initial 8 weeks of inpatient care can reveal valuable insights into treatment responsiveness.