Introduction

Spinal disorders represent a common clinical condition and leads to a substantial social and economic burden: yearly, up to 15% of the population will suffer a first-time back pain episode and up to 80% of patients will experience a recurrence episode [1]. Even if the complication rate is decreasing over time, the late-onset complications after complex spine surgery are still considerable [2]. Beyond the social and economic influence of these factors [3], they can lead to a decrease in the patients’ quality of life. The evaluation of this last aspect should always be taken into account in the pre- and post-operative assessment of the patients.

Patient-Reported Outcome Measures (PROMs) offer a useful tool to assess subjective clinical data such as pain or quality of life [4]. Still, it is fundamental to combine these parameters with objective data such as surgical data and clinical outcomes [5]. Registries that systematically collect data regarding clinical practice, users' safety data, and PROMs are thus an exceptional opportunity to monitor the impact and value of surgical procedures and to conduct research on the factors influencing the results and complication rate of different surgical techniques [6, 7].

The development of a clinical registry allows merging a large amount of data [8], and traditional paper-based systems seem expensive and inefficient compared to electronic ones [9]. The electronic administration of PROMs allows an easier collection and management of a considerable volume of data, thus facilitating the workflow. Furthermore, an electronic format also permits the collection of all data in a single database and makes them readily available for extraction and further analysis. To achieve the goal of merging all the data regarding a patient in one electronic dataset, a validation of the electronic version of PROMs is required, as factors like the graphic organization of the elements on a screen or the impossibility to skip questions may alter the way patients answer the questionnaire [9].

The study aimed to assess the reliability and the agreement between PROMs administered via an electronic-based method and the data collected using a paper-based format.

Material and methods

This project was based on the retrospective analysis of patients prospectively enrolled in a spinal surgery registry: SpineReg [10]. The accuracy, reliability, and validity of the electronic-based data were evaluated through the comparison with the paper-based data collection, which is considered the gold standard.

The study was conducted at a single-centre and included the analysis of patients’ reported outcomes in subjects prior to or after spine surgery. The inclusion criteria were age ≥ 18 years, both genders, and the capability to read and understand the Italian language. Patients who were unable to understand and answer questionnaires independently were excluded from the study. The study protocol was in accordance with the Helsinki Declaration of 1957 as revised in 2000. The procedures followed the ethical standards of the responsible committee on human experimentation and was approved by the ethics committee of our Institution (second amendment to the SPINEREG protocol issued on 13/04/2016). The project was supported with funds from the Italian Ministry of Health (project code CO-2016–02,364,645). All patients gave their written informed consent for the participation in this study.

The sample size (100 subjects) was determined based on the study conducted by G. Rankin et al. [11] who established the minimum number of cases required to test a musculoskeletal disorders questionnaire. All patients awaiting surgery or in follow-up at our institution and who voluntarily agreed to participate in this study were stratified in two groups: the Follow-Up group (FU), who received the questionnaires during an outpatient appointment for the routine follow-up, and the Pre-Operative group (PO), which filled out the questionnaires at the hospital admission, before surgery. Patients in the FU group filled in the questionnaires in paper form before (T1), and electronic form after the clinical examination (T2). The electronic version was administered on a tablet with the help of a supervisor, whose only role was to provide instructions on how to operate the tablet for patients who were not familiar with the device. Further instruction on how to interpret the question (e.g. Numeric Rating Scale—NRS: 0 = no pain, 10 = worst pain imaginable) was embedded in the questionnaire. Unlike the paper form, the electronic version only showed one item per screen, but the users could scroll forwards or backwards. Questionnaire explanations were available on the top of the screen during the entire completion of the questionnaire. The current study was performed on a tablet. However, the software maintains a similar layout on multiple devices such as PC, tablets and smartphones; except for the COMI pain scales, for which it was neither possible to add anchors at each end of the scale nor (in the smartphone presentation) to present the scale horizontally, due to the small screen size. An example of how questionnaires appear to the patients is shown in Fig. 1. Patients in the PO group completed the electronic questionnaires first, using a tablet and under supervision (T1). After one hour, the patients completed the same PROMs in paper format (T2). The questionnaires in electronic and paper versions were filled out on the same day in order to eliminate any score variations due to a change in the patient's clinical condition. A schematic representation of the study is presented in Fig. 2.

Fig. 1
figure 1

User interface: item presentation. The picture shows the item presentation extracted from the ODI questionnaire (Italian version) of the electronic format. The patients can go forward (Prossimo > >) or backward (Prec. < <). A loading line (blue segmented bar) displays the percentage of questionnaire completion. The subjects can visualize the mean time (Tempo medio di compilazione ~ 3 min) and the number of the questionnaires necessary to complete the survey (0 formulari rimasti da compilare). Paziente, Patient’s name and surname; Questionario, Questionnaire; Data di compilazione, Compilation date; Strumento di compilazione, Instrument used to fill the questionnaire (tablet, e-mail, outpatients’ interview, kiosk, phone or paper format)

Fig. 2
figure 2

Schematic representation of the study. The first evaluation by either paper or a tablet (electronic format) took place in the day of admission in hospital (Pre-Operative Group) and during follow-up visit (Follow-Up Group). The second evaluation either by a tablet or paper after 1 h. PROMs = Patient-Reported Outcome Measures [Short Form-36 (SF-36), Oswestry Disability Index (ODI) and Core Outcome Measures Index for the back (COMI-back)]

The recruited subjects were asked to fill in three questionnaires using the already validated Italian versions of the Oswestry Disability Index (ODI) [12], the Short Form Health Survey (SF-36) [13] with Physical Component Score (PCS) and Mental Component Score (MCS) and the Core Outcome Measures Index for the back (COMI-back) [14], [15]. These three questionnaires are routinely used to assess the disability associated with pain and examine the general health status of the patients [16,17,18].

Numerical data were expressed as Mean ± Standard Deviation (SD). The items correlation between electronic and paper version was estimated using the Gamma Correlation Coefficient. The correlation was defined excellent for γ = 1.0, strong for 0.3 ≤ γ < 1, moderate for 0.09 ≤ γ < 0.3 and weak for 0.01 ≤ γ < 0.09. To assess the reliability between Likert type item’s response in paper and electronic format, the linear weighted Kappa statistics was used. The levels of Kappa statistics were defined as follows: 0.00 < K < 0.20 poor or slight agreement; 0.21 < K < 0.40 fair; 0.41 < K < 0.60 moderate; 0.61 < K < 0.80 substantial or good; 0.81 < K < 1.00 very good or perfect [19].

The paired t-test for related samples was used to compare normally distributed parameters. The Mann–Whitney t-test was performed to determine the differences in continuous variables between the two cohorts, while the Chi-square test (χ2) was used for the categorical variables. Non-normal variables were compared with the two-tailed Wilcoxon Range Test for dependent variables.

In accordance with previous studies on the same topic, an ICC value grater or equal to 0.90 signified an excellent reliability between the electronic and paper forms of the PROMs. [20, 21]

To determine the amount of variation in the measurement errors for the electronic and paper format, the Standard Error of Measurement [\(SE{M}_{\mathrm{agreement}}=\sqrt{{\sigma }_{r}^{2}+{\sigma }_{pt}^{2}}\), where \({\sigma }_{pt}^{2}\) represents the variance due to systematic differences between the two types of questionnaire administration (electronic and paper); and \({\sigma }_{r}^{2}\) represents the residual variance (namely the part of the variance which cannot be attributed to specific causes)] was calculated, which also accounted for interrater variation in order to provide an agreement measure. Furthermore, the minimal detectable change (MDC95) [\(\mathrm{MDC}=1.96 x \sqrt{2} x \mathrm{SEM}\)] was calculated to estimate the size of any change that was not likely due to measurement error [22]. The Bland–Altman analysis graphically showed the agreement by plotting the difference between paper and electronic scores against their mean. Limit of Agreement (LOA) was also used to estimate the absolute agreement between test–retest of the different questionnaires [23]. Furthermore, the whole sets of agreement and correlation tests between paper and electronic format were calculated for PO and FU groups. The statistical analysis was performed with SPSS (SPSS Inc. Version 22.0, Armonk, NY: IBM Corp.) and conducted with a 95% confidence interval (CI). The degree of statistical significance was set at p value < 0.05.

Results

From January 2017 to October 2019, 1564 patients requiring spine surgery were recorded in the electronic registry and were eligible for the study. Of these, 100 volunteered to participate in the study, gave their informed consent and were enrolled in the project (mean age 55.6 ± 14.9 years). Among these patients, 59% were women and 41% were men. Forty-six patients were enrolled in the PO group, and 56 in the FU group. The follow-up visits were performed 3 to 24 months after surgery. No statistically significant differences, in terms of age and gender, were observed between the eligible population and the patients enrolled in the study.

There were no significant differences in the demographics among the PO and FU group (female PO 59.3%, FU = 58.7%, p = 0.954; age PO 54.0 ± 15.9, FU 57.5 ± 13.8, p = 0.294). The parameters of evaluation of the quality of life and disability in the PO group were as follows: ODI 41.3 ± 17.3, SF-36 MSC 46.7 ± 12.1, SF-36 PCS 35.4 ± 8.2, COMI-back 7.1 ± 1.9. The results of the same questionnaires in the FU group were ODI 30.8 ± 18.6, SF-36 MSC 48.8 ± 9.5, SF-36 PCS 37.5 ± 9.2, COMI-back 4.2 ± 2.2.

The values of the Gamma coefficients showed a correlation between paper and electronic form ranging from good to excellent (Gamma minimum value = 0.91, maximum value = 1.00). The weighted Kappa coefficients showed a significant positive agreement between paper and electronic form for all examined items (Kappa minimum value = 0.67, maximum value = 0.95).

The reliability between the results of paper and electronic questionnaires was excellent. The ICC values ranging from 0.963 (ODI, overall sample) to 0.982 (COMI-back, overall sample). The test–retest reliability was high for the PO and FU groups (p < 0.001 for both). The PO and FU groups mean absolute difference, SEM and MDC95% for paper and electronic reliability are reported in Table 1. The 95% limit of agreement ranged from −9.72 to 9.04 for the ODI, from −0.82 to 0.75 for the COMI-back, from −4.01 to 4.45 for SF-36 MCS, and from −3.38 to 3.37 for SF-36 PCS. The mean ± SD and LOA of Pre-Operative and Follow-up groups are summarized in Table 1.

Table 1 Reliability and Agreement analysis

The Bland–Altman graph (Fig. 3) showed that the mean differences were −0.337 for the ODI, −0.036 for the COMI-back, 0.244 for the SF-36 MCS and −0.012 for the SF-36 PCS. The differences between the two acquisition formats were plotted around the zero line for all the questionnaires. No systematic bias comparing electronic and paper form was detected.

Fig. 3
figure 3

Bland–Altman plot. Bland Altman plot of ODI (A), COMI-back (B), SF-36 MCS (C) and SF-36 PCS (D) assessed for electronic format and paper format. The bias (mean) between the two methods is marked by the full line (–), the overall upper and lower limits of agreement (LOA) by bold line (––––), the Pre-Operative upper and lower LOA by the broken line (-—-) and the Follow-Up upper and lower LOA by dash-dotted line (–•–•–).SF-36 MCS, Short Form-36 Mental Component Score; SF-36 PCS, Short Form-36 Physical Component Score; ODI, Oswestry Disability Index; COMI-back, Core Outcome Measures Index for the back. Triangles: Follow-up cases; Circles: Pre-Operative cases

Discussion

Several correlation studies proved the correlation between paper and electronic format of PROMs [9]. Our results were consistent with scientific literature examining the reliability and consistency of the electronic versions of the PROMs, along with result comparability with the paper form [24]. The Gamma coefficients denoted a high correlation between the two methodologies used for filling out the questionnaires. Furthermore, the weighted Kappa Coefficient showed a good to perfect reliability between paper and electronic version.

The comparison between the results of paper and electronic questionnaires revealed an excellent test–retest reliability for all questionnaires. The SEM for the overall scores was acceptable for all questionnaires: 3.39/100 points for ODI, 0.28/10 for COMI-back, 1.53/100 for SF-36 MCS and 1.21/100 for SF-36 PCS. Furthermore, the Bland–Altman graph showed limits of agreement values comparable with MDC95% of our study; both were smaller if compared with MCIC values of the literature for all the questionnaires analysed (ODI = 15U [12], COMI-back = 3U [16], SF-36 = 4.9U [25]). Thus, the electronic form can be used for the same clinical purpose as the paper form. The mean bias for paper and electronic forms was always plotted around zero, confirming the absence of a systematic error. Since the two analysed groups were uniform and there were no systematic differences between electronic and paper collection methods, these can be used interchangeably.

Having assessed the absence of significant differences between the two methods, the authors believe that the electronic will over time substitute the paper form. It is possible to make this prediction since the distribution, collection and compilation of PROMs within electronic registers entails obvious advantages regarding logistics and data usability [26]. Several advantages can be obtained with the systematic use of electronic formats. Patient scores can be calculated in real-time and be easily monitored and shared with the other members of the research team. An electronic-based data collection allows patients to complete the questionnaires remotely, thus reducing the time spent in hospital facilities. This represents a considerable advantage to preserve the social connections while promoting physical distancing [27]. The distribution of electronic questionnaires is also favoured by the constant increase in users of electronic devices (smartphones, tablets or personal computers). According to epidemiologic surveys, 80% of people in Europe have internet access, with peaks of 88% and 87%, respectively, in the UK and Spain [28]. In Italy, 76.1% of the population has an internet connection; this percentage grows to over 90% among young people (15–24 years), but the number of internet users among 65–74 year-olds is also increasing in the last years (up to 41.9%) (ISTAT 2019—Italian Statistics Institute). Despite these encouraging numbers, clinicians cannot yet assume that all patients have an internet connection and are familiar enough with electronic devices to fill the questionnaires out remotely. It is essential to discuss these topics during office visits and provide patients with the support they may require (e.g. written instructions on how to fill out the questionnaires online, provide a user-friendly graphic interface, provide a contact in case of further help) and offer alternative solutions such as phone calling interview. For this study, the patients were aided by supervisors to optimize the tablet use.

The advantages of PROMs collection in electronic format can lead to a further spreading in the use of this tool. It is established that patient-reported outcomes have a considerable clinical value and, combined with the objective clinical and surgical data, allow the clinician to obtain a complete and accurate understanding of a patient’s condition [29]. Stored on a registry, all these data are readily available for use in personalized medicine [30]. This is particularly significant for spine conditions, where numerous available surgical techniques and the differences in surgical strategies open a wide range of different scenarios [31, 32]. The storage and analysis of the outcomes of previously treated patients may help defining a surgical approach tailored on the needs of a specific patients’ group [33,34,35]. In 2019, Ghogawala et al. [36] examined how clinical registries might be used to generate new evidence to support a specific treatment option when comparable alternatives exist. In particular, the use of artificial intelligence can be aimed at building mathematical algorithms to approximate conclusions from real medical and patient-reported data to develop data-driven personalized care [37]. In the era of “Omic” data, with a large mixture of information coming from several medical fields (genomics, proteomics, nutrigenomics and phenomics), the registries can make the difference in the capability to identify novel associations between biological and medical events [38]. An electronic registry allows the collection and storage of a large amount of records in a single database, permitting in turn easy access and extraction of the data for multiple purposes: follow-up and complication analysis, research purposes [39], quality control and risk modelling, establishing standards, share best practices, and build trust among institutions [40, 41].

PROMs have increasingly been raising interest in their predictive value. In a recent systematic review [42] predictive models using PROMs in spine surgery could be used for quantitative prediction of clinical outcomes. In particular, the authors were able to define a mathematical model capable of identifying several key predictors of surgical results such as age, sex, BMI, ASA score, smoking status, and previous spinal surgeries. Another recent study [25] showed that a specific pre-operative threshold for ODI and other PROMs was capable of predicting a significant clinical improvement after surgery for adult spine deformity. These results highlight the usefulness of PROMs in pre-operative decision-making. Given the changing health-care environment, the electronic-based data collection systems seem essential and can boost the development of surgical registries.

Despite the encouraging conclusion of our study, several limitations should be taken into account evaluating the results. First, the lack of randomization and a low rate of eligible patients that accepted to participate may have introduced possible selection bias. The patients filled the questionnaires in paper and electronic form at two close moments in time: this entails a high reproducibility of mnemonic errors in completing the two questionnaires. A longer pause between the two formats, or repeated submissions of the questionnaire, may have solved this limitation. An ideal scenario for assessing the reliability of repeated measurements would have been to administer the questionnaires the second and third time in the same way and in the same time intervals (i.e. T1. Paper >  > T2. Electronic >  > T3. Electronic, or T1. Electronic >  > T2. Paper >  > T3. Paper). This methodological approach would potentially prove that the “intra-method” and “inter-method” variability does not increase when questionnaires are administered electronically.

A substantial limitation of the study is related to the methodological approach in the FU population. Specifically, the test and retest were performed before and after the clinical examination, potentially introducing a distortion effect through the clinical interview. Based on our data, it appears that the error of measurement was higher in the FU population. In particular, the LOAs for ODI were acceptable in the overall population, but excessively wide in the FU group. This means that the agreement of these measures was not sufficient to allow them to be interchangeable—an aspect that requires further investigation. A random allocation to set the order in which the PROMs were administered (paper first or digital first), and the lack of intervention between the two administrations may have addressed some of the limits of the study.

Even considering the proofreading and the double entry verification, the additional steps of the paper version related to the transcription process may result in a potential source of bias.

Furthermore, the influence of a supervisor while the patients filled the questionnaire may have introduced a bias. The presence of a supervisor and his or her attitude towards PROMs has been proved to have an influence on patients, most of all on the more fragile ones [43]. Further studies are required to investigate the effects of the presence of a supervisor on the patients’ perspective and to analyse how this reflects on studies results. An analysis of possible differences in the time required for the patients to fill out the different versions of the questionnaires and on the preferred format was not performed. The fact that the PO and FU groups were administered the questionnaire in different succession may have also introduced a bias; however, a separate analysis between PROMs collected before surgery and at follow-up was not conducted. Considering the limitations of the study, our results only provide preliminary evidence. Further research is necessary to consolidate the acquired knowledge.

Conclusion

The result of our study showed excellent reliability and a significant agreement between the paper-based and the electronic-based data collection system of three relevant questionnaires in spine surgery, ODI, SF-36, and COMI-back. This validation supports the use of the electronic versions of these PROMs, which allows for quick accessibility of the data in the daily clinical practice and research purposes. Further studies will be necessary to consolidate the acquired knowledge and prove the efficacy of electronic registries to pave the way for data-driven personalized care.