Patient-reported outcome measures (PROMs) have the potential to transform health care delivery through enhancing the clinical management of patients and assessing the quality of providers’ performance [1, 2]. To date, the use of PROMs in assessing the outcome of hospital admissions has inevitably been restricted to elective surgery in which before and after measurements of patients’ symptoms, functional status and health-related quality of life can be compared. The most ambitious example of this covers four elective surgical procedures in the NHS in England [3].

The challenge of using PROMs for emergency admissions, which account for 40% of hospital inpatients in England, has not been addressed and yet this is an area of increasing resource use, political importance and concern about variations in quality of care [4]. The methodological challenge is how to quantify outcome when a patient’s health status before their sudden and unexpected ill-health that led to an emergency hospital admission is, inevitably, not available. One potential solution would be if patients were able to recall accurately their health status before the admission. If they could, then a retrospective (or recalled) PROM would offer a means of obtaining their baseline health status in the absence of a prospectively collected contemporary report.

A recent literature review on the relationship between retrospective and contemporary health status reports found strong agreement when the recall period is short [5]. However, only six studies have been undertaken of which only one was conducted in the UK [6]. The relevance of findings from other countries is uncertain given the potential influence of culture and other contextual factors. In addition, only two studies considered the influence of patients’ characteristics, such as social demographic factors, on the relationship. Both studies found that agreement was slightly weaker in older patients [7, 8].

Our aim was to investigate the relationship between retrospective and contemporary PROMs in England (inevitably, in elective conditions) and to explore the influence on the relationship of two patient characteristics (age, socio-economic status) and the length of time between the two data collection points. Contemporary reports are often considered the ‘gold standard’ so if retrospective reports differ, it is the latter that are judged ‘unreliable’. However, in the context of PROMs, from a patient’s point of view the way they recall their previous health may be of greater relevance to them and to assessing the quality of health care than how they assessed it at the time. Rather than assuming one as the ‘gold standard’ over the other type of PROM, we consider the extent to which they agreed. We hypothesise that if the two agree then one can substitute for the other without any impact on assessment of the impact of health care interventions. If they differ, it would be necessary to consider the reasons for this and its implications for the use of PROMs in clinical management and in provider comparisons in emergency admissions.



This is a multi-centre study of patients undergoing either hip or knee arthroplasty (primary operation or revision surgery) in four hospitals, which were part of the North Thames Academic Health Science Network (UCL Partners), and CLAHRC. Health Research Authority ethics approval was obtained from North East – Newcastle & North Tyneside 2 Research Ethics Committee (REC Ref: 16/NE/0081).

Patients were eligible if, as part of the National PROMs Programme, they had completed a PROM questionnaire before undergoing surgery (Q1), either at a pre-operative assessment clinic or on their day of admission. They were invited to complete a retrospective PROM questionnaire (QR) in the immediate post-operative period prior to discharge asking them to recall their health status during the 4 weeks prior to surgery. Written informed consent was obtained.

Patients’ QR was deterministically linked to their contemporaneous PROMs data (Q1) using a hierarchy of patient identifiers: NHS number, date of birth, postcode and date of birth and postcode combined.


The self-reported questionnaires included socio-demographic information: age; sex; living arrangement (with family/friends, alone, other). Socio-economic status (SES) was measured with national quintiles of the Index of Multiple Deprivation based on patients’ residential postcode [9]. Self-reported health included co-morbidities (from a list of 12 conditions); duration of primary condition (< 1, 1–5, 6–10, > 10 years); primary or revision surgery; disease-specific PROM (Oxford Hip Score or Oxford Knee Score); and a generic PROM (EQ-5D-3L)—the latter was used as it was the version used in the National PROMs Programme in England for elective surgery at the time.

The Oxford Hip Score (OHS) is a disease-specific PROM for patients undergoing total hip replacement to capture symptoms and functional status [10]. It has good face validity, construct validity and reliability, and is sensitive to change. The Oxford Knee Score (OKS) is the knee arthroplasty equivalent [11]. For both PROMs, respondents answer 12 questions to assess pain and mobility in relation to the relevant joint. Each item can be scored from 0 (severe problem) to 4 (no problem). Summated scores provide an overview, from 0 (worst) to 48 (best) health statuses [12].

For the Oxford Scores, instructions were adapted to enable usage for retrospective assessment (QR) by including a statement on the timeframe with the following wording; ‘We are interested in finding out about the problems you have had with the hip (knee) on which you have had surgery. Please let us know how you were before your operation’. This kept the wording similar to the instructions for the prospective version use in the National PROMs programme (Q1); ‘We are interested in finding out about the problems you have had with the hip (knee) on which you are about to have surgery’. The tense of individual questions were also altered, e.g. Q1: ‘During the past 4 weeks…How would you describe the pain you usually have from your knee?’ was changed to ‘During the past 4 weeks before your operation…How would you describe the pain you usually had from your knee?’.

The EQ-5D-3L has five questions that investigate the domains of mobility, usual activities, self-care, pain/discomfort and anxiety/depression [13]. For each of these questions, the respondent chooses from three responses indicating the level of their function. A multi-attribute utility score where death and perfect health are represented by 0 and 1 are calculated [14]. Scores less than 0 are considered worse than death and 1 is the maximum score possible. The EQ-VAS (a visual analogue scale) was also included in the questionnaires but this was not included in the analysis of the results, due to missing data and respondents not completing it according to instructions [15].

For the EQ-5D-3L, wording was adapted to provide instructions suitable for retrospective assessment with ‘before your operation’ in place of ‘today’. The full instructions on QR were: ‘By placing a tick in one box in each group below, please indicate which statements best describe your own health state before your operation’. Each statement of individual items was changed to past tense (e.g. ‘I have no problems walking about’ was changed to ‘I had no problems walking about)’.

Sample size

Sample size was designed to achieve the required degree of precision in the estimation of the ICC. For example, a sample of 200 patients would give a two-sided 95% confidence interval of 0.14 if the ICC was 0.7 (ICC CI 0.62–0.76). Consequently, we selected a total sample of 400 (200 for each procedure), which meant that the width of the CI (0.14) was less than the width of bands used to define categories of agreement (see below). It also provided sufficient statistical power for some sub-group analyses [16, 17].

Statistical analysis

Agreement between patients’ retrospective and contemporaneous PROMs scores was judged both in terms of absolute agreement and consistency. It was assumed that both time points measure the same construct and should thus be in strong absolute agreement. However, while any systematic differences in recall could reduce absolute agreement, if patients retained their Q1 and QR ranking order, then there would still be consistency in the scores. We therefore also looked at consistency which could be useful from a policy perspective as even if scores lacked absolute agreement but remained consistent, then PROMs retrospective scores would be useful in assessing provider performance. Agreement was categorised as 0–0.20 weak, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 strong and 0.81–1 very strong [18].

We calculated separate intraclass correlations for absolute agreement (ICC(A,1)) and consistency (ICC(C,1)) using the definitions given by McGraw and Wong [16], as well as Pearson’s correlation coefficient as a measure of association. The analysis was conducted using Stata version 14 [17]. The ICCs were calculated using repeated measures of analysis of variance (ANOVA) which divides the variance into three components: between-subjects (patients), within-subjects (contemporaneous recall) and error. They are presented with their 95% confidence intervals.

To explore patterns of differences in the contemporary and retrospective score visually, we used a version of the Bland–Altman plot that accounts for trend. Individual differences in scores were plotted against the mean of the two scores, and a regression model was used to calculate the limits of agreement [19]. As neither the contemporaneous nor the retrospective method is assumed to be a gold standard, the mean of the two is the best estimate of the true health status and most appropriate for the x-axis [20].

Finally, linear regression analysis was conducted to explore whether a patient’s retrospective PROM is able to predict their contemporary PROM, judged from differences in their predicted (based on retrospective) and contemporary PROM (mean absolute error). Scatterplots of contemporary score (y-axis) against the retrospective score (x-axis) are shown in Fig. 1, along with the mean predicted score (linear fit) and 95% confidence intervals. The wider lines show the 95% confidence intervals around individual predictions, taking into account the residual variation in individual scores.

Fig. 1
figure 1

Patterns of differences in contemporary and retrospective PROMs (OHS, OKS and EQ5D) adjusting for trend. Each dot is a patient; shaded area is 95% limits of agreement for differences

The influence on the relationship between retrospective and contemporary PROMs of two patient characteristics (age and social-economic status) and one logistical (length of time between the two data collection points) was explored using linear regression analysis; ICCs were also calculated for age subgroups.


Patient characteristics

The required sample size of 400 in total was exceeded. Of the 406 hip replacement patients who had completed a Q1 and were invited to complete a QR, 244 (60%) did so. Equivalent figures for knee replacement were 276 out of 486 (57%). It was not possible to link data from the two questionnaires for some patients (20 hip; 16 knee) and the disease-specific PROM was not fully completed by some patients (20 hip; 21 knee) (Appendix). This left 204 hip and 239 knee patients for the analysis.

The sample was broadly similar to the population of patients completing pre-operative PROM questionnaires in 2009–2010, the latest year for which published data exist [21, 22]. There were some small differences (Table 1). The hip replacement sample was slightly older (mean age 69.1 vs. 67.7 years) and more likely to be female (67 vs. 61%), and to live alone (34 vs. 28%). The knee patients were also more likely to live alone (29 vs. 25%). For both operations, patients reported having more severe conditions (mean OHS 15.1 vs. 18.2; mean OKS 17.4 vs. 19.3; knee symptoms for over 5 years 55 vs. 44%). This may reflect selection bias in the sample or a change between 2009/2010 and 2016 in the severity of patients’ conditions.

Table 1 Characteristics of samples compared with population of patients (2009–2010) [21, 22]

While most patients (75%) completed their QR within 50 days of having completed the contemporary Q1, for 3% it was over 3 months (due to delays in surgery following their pre-operative assessment). The median length of time was 30 days (IQR 14–54 days).

Comparison between retrospective and contemporary PROMs

The mean difference between retrospective and contemporary scores was small for all PROMs and both operations (Table 2). The direction of the difference was consistent: patients reported slightly lower scores (worse health) in the retrospective questionnaire compared to the contemporary reports. However, none of the differences were statistically significant.

Table 2 Agreement between contemporary and retrospective PROMs

Absolute agreement and consistency were very strong for both disease-specific PROMs. Agreement on the EQ-5D-3L was also strong, although weaker than for the disease-specific PROM. The level of agreement was consistent across the range of severity of pre-operative health (i.e. there was little systematic bias) as shown by the flat trend lines (Fig. 2). The clustering seen for the EQ-5D-3L results from there being only three possible levels of response to each item and the way one dimension, pain/discomfort, is weighted heavily in the index score. Therefore, patients who shifted in their level in the pain dimension resulted in a more marked change in their index score, while the average of their two scores was in the middle (see Fig. 1 EQ-5D). In contrast, there was greater concordance between retrospective and contemporary scores in patients who reported either no or extreme pain/discomfort and who did not shift their responses (with their average of their two score remaining at one extreme or the other, i.e. responses seen in the clusters to the most left and furthest right on the horizontal axis).

Fig. 2
figure 2

Contemporary PROM by retrospective PROM linear regression with 95% intervals for individual (solid line) and group (dotted line) contemporary PROMs predictions. Dots represent actual PROM scores, and the solid line the predicted contemporary PROMs scores with 95% intervals for individual and group predictions

Prediction of contemporary using retrospective PROMs scores

Patients’ retrospective PROMs were able to predict contemporary scores for all three PROMs. The mean absolute error for the prediction model were 3.89 (Q1 SD 8.7) and 3.86(Q1 SD 8.2) for the Oxford Hip and Knee scores and 0.20 and 0.21 for generic EQ-5D scores at the individual level (Table 3). At the group level, this would translate into an even smaller error. The 95% confidence intervals for the mean predicted score (group prediction) is extremely narrow (Fig. 2).

Table 3 Retrospective scores as a predictor of contemporary PROMs

Influences on relationship between contemporary and retrospective PROMs

Agreement between the retrospective and contemporary PROM was strong or very strong across the age range, although slightly weaker with increasing age. For hip patients, the ICC declined from 0.88 for those aged 60 years or younger to 0.78 for those over 75 years (p value < 0.05). The difference for knee patients was less (ICC 0.80 vs. 0.78). There was no evidence of any systematic differences in the magnitude and the direction of recall with patients’ age as well as socio-economic status for both Oxford Hip and Oxford Knee Scores. There was some evidence of a slight systematic difference with patients’ age on EQ-5D-3L for knee patients (Table 4).

Table 4 Mean difference and adjusted mean difference between retrospective and contemporary PROMs by patients’ socio-economic status (SES) and age

The difference in mean contemporary and retrospective scores was not associated with the time interval between Q1 and QR. The difference in Oxford Knee Score decreased by 0.013 (95% CI − 0.03 to 0.007) and knee EQ-5D-3L score decreased 0.0003 (95% CI − 0.001 to 0.0007). The difference for Oxford Hip Score increased by 0.006 (95% CI − 0.01 to 0.02) per day, and the hip EQ-5D-3L score increased by 0.0001 (− 0.0009 to 0.001) per additional day.


Main findings

In representative samples of patients undergoing elective hip or knee replacement, their retrospective assessment of their pre-operative health status was similar to their contemporaneous reports. Although patients tended to recall their health as being slightly worse than reported at the time across all measures, the differences were small and none was statistically significant. This could result in a slightly higher estimation of the benefits of surgery. The level of agreement between contemporary and recalled PROM scores was very strong for the disease-specific ones, and strong for the generic PROM.

The strength of agreement was consistent regardless of the severity of a patient’s primary condition. In addition, two social characteristics of patients, their age and their socio-economic status, had little or no significant influence on the relationship between retrospective and contemporary reports. It was also apparent that mean retrospective PROMs for groups of patients could reliably predict what mean contemporary reports of PROMs would have been.

Comparison to existing studies

These results confirm the findings of the four published studies which also found strong and very strong agreement between retrospective and contemporary PROMs which used continuous rather than categorical data [8, 23,24,25]. These previous studies also found that agreement for disease-specific PROMs was stronger than for generic PROMs. One explanation for this is that generic measures tend to have a more restricted range of responses, leading to greater homogeneity (smaller between-patient variability) in scores. ICCs define agreement between scores (within patients) in relative terms, so smaller population variation in scores will necessarily limit the strength of agreement.

These results suggest the main factors that may influence the differences between contemporary and retrospective reports, namely recall bias and response shift (a change in perception that can occur when circumstances change), did not have a significant influence. This may partly reflect the short time interval between measurements. Recall bias may arise when details of events go unnoticed and are not stored; new information is added to stored memories altering the details; and, over time, events are systematically distorted [26]. Such bias is influenced by the time between the event and its assessment: the longer the interval, the higher the probability of recall bias [27].

The lack of association between agreement and the length of the recall time in our results suggests that recall bias was minimal. It may be the case, as implicit theories of memory suggest, that the act of asking people to recall how they were before their surgery provided an anchor of their pre-surgical condition and hence formed the basis for stable recollection [28]. There is also a possibility that the exposure to a prior PROMs questionnaire could have aided recall. However, as an event in the patient’s life, this is likely to pale in comparison with the subsequent hospital admission and operation in terms of a ‘significant event’ in the process of aiding the anchoring and assisting recollection of the patient’s prior health.

The weaker agreement observed with the EQ-5D-3L is consistent with two previous studies that showed only moderate agreement when using PROMs with categorical data rather than continuous data [6, 7]. Lingard et al. [7] found this when items were not evenly distributed (i.e. when responses are clustered to at the severe end of the scales, e.g. severe pain and limited function).

As in this study, two previous studies observed the strength of agreement was high across age groups but decreased slightly with increasing age: OHS ICC for under 65 years 0.95 versus 0.85 for those older [8]; Western Ontario & McMaster Osteoarthritis Index for knee pain under 75 years 0.57 versus 0.47 for those older [7].

Strengths and limitations

This is the second largest such study ever undertaken, in addition to assessing agreement with ICCs which allowed differentiation between perfect agreement, systematic and random bias [29]. Bland–Altman plots [20] have provided a visual display of systematic bias or differences in relation to the scales of the PROMs providing an additional layer of understanding.

The one potential limitation concerns the representativeness of the sample who participated. Although they were broadly similar to the population of patients undergoing arthroplasty in England, they may have differed as regards some other unmeasured variables. It is possible that people who agreed to participate were more consistent in their recalled reports than the general population of patients.


These findings support the use of retrospective PROMs to obtain a baseline assessment of health status when contemporary collection is not feasible such as with emergency hospital admissions. In addition, retrospective collection offers an alternative even when contemporary is possible, an option that could not only facilitate higher participation rates but also lower the cost of data collection.

While this study has demonstrated the feasibility of collecting retrospective PROMs in patients who are recovering from an elective procedure (and who have already agreed to participate in a pre-operative contemporary report), research is now needed to determine the feasibility in emergency admissions. The latter have experienced an unexpected, sudden episode of illness and may still be unwell some days later. Whether collection of retrospective PROMs is feasible needs to be investigated.