Background

Total hip replacement (THR) is an effective and cost-effective procedure for people with severe hip osteoarthritis (OA), unresponsive to conservative therapy [1, 2]. It has become a high volume procedure throughout Europe, and annual rates continue to rise [2, 3].

It is generally accepted that the indications for THR are pain and disability in spite of the use of non-surgical interventions such as education, drugs, walking aids and physical therapy [4]. However, it is not clear how severe the pain or disability needs to be before surgery should be undertaken, or when in the course of OA it is most appropriate to perform a THR. Economic modelling [5], some outcome studies [6] and patient views [7] suggest that it might be appropriate to perform surgery early on, before the condition gets too severe. However, doctors tend to be cautious about the use of a potentially dangerous and irreversible intervention too early, particularly because of evidence that it is not universally successful [8] and the accumulating risk with time of late prosthesis loosening necessitating further, more complex surgery. The potential implications of operating relatively early or late in the course of the disease are large, as this could have massive effects on the volume of surgery undertaken, as well as on outcomes.

There have been a number of attempts to develop indications for THR through consensus procedures [9], appropriateness criteria [10], and criteria for the prioritisation of patients waiting for a THR [11, 12]. However, the resulting publications have done little more than emphasise the importance of pain and physical disability, and they struggle to take account of other factors that affect surgical decision making, such as co-morbidities, individual values and aspirations, and psycho-social circumstances. Furthermore, they do not address the issues of relative risk of complications or of a poor outcome, or how these might be affected by variations in THR indications.

The 'EUROHIP' consortium has been addressing the current debate on the utilisation and timing of THR, by studying the indications for primary hip replacement [13, 14], and the process of hip replacement in the different centres involved [15]. In addition, the group has developed a cohort of people undergoing primary hip replacement for primary OA. The overall aims of the 'EUROHIP' cohort study are threefold: 1) to describe the amount of variation in disease severity at the time of primary THR in Europe; 2) to look for determinants of any variation; 3) to examine the effects of variations and their determinants on short-term (one year) patient centred outcomes. In this paper we report on the first two of these aims – describing the cohort and the variations in disease severity observed at the baseline assessment, and exploring some of the determinants of variations in pre-operative disease severity.

Methods

Participating centres and patients

The 'EUROHIP' consortium includes 20 orthopaedic centres in 12 different European countries. The overall design was endorsed by representatives from each centre in 2002. It was agreed that we would recruit a large multi-centre cohort of patients undergoing primary THR willing and able to complete self-administered questionnaires recording demographic variables and pre-operative levels of pain, stiffness, mobility and quality of life (using standard, validated questionnaires) as well as their expectations of the operation. One year after surgery they would be sent a similar questionnaire in the post. In addition, we agreed to obtain pre-operative radiographs and to record data about the operative procedure, including the type of prosthesis used. Inclusion criteria included a diagnosis of OA of the hip, primary hip replacement, and signed informed consent; exclusion criteria included causes of hip disease other than OA, severe mental illness or dementia, and patients unwilling or unable to take part in the study.

Data collection

Each centre obtained local ethical approval, if required. The study protocol and data collection forms were designed in Bristol, UK and Ulm, Germany by the study PIs (PAD and KD) and the study co-ordinator (SW). The patient questionnaire was piloted for acceptability in Bristol and modified accordingly before being sent to Ulm for translation and distribution. Questionnaires were sent to each centre for translation and returned for checking before printing and distribution with a set of instructions. All data forms included the birth year and initials of the patient, as well as a centre ID and an individual patient number, to ensure unique identification while maintaining patient anonymity. Full identification records of all patients were kept separately in each centre. Completed questionnaires were photocopied locally and then returned to Bristol, where the database was constructed.

Prior to surgery, patients completed a questionnaire about age, sex, home circumstances, employment, education, current medications, duration of pain in the hip to be replaced and problems in other joints. In addition, they completed the WOMAC [16] and EQ5D [17] questionnaires (the EQ5D data has not been included in any of the analyses presented below).

The Western Ontario and McMaster Universities (WOMAC) OA Index

The WOMAC index (version 3.1) was used to assess the severity of symptoms. This consists of 24 items in 3 subscales: pain (5 items), stiffness (2), and physical function (17) [16]. Component items are measured on a 5-point Likert scale with higher scores indicating greater symptom severity (0 = none, 1 = slight, 2 = moderate, 3 = severe, and 4 = extreme). Missing data were treated as follows: if ≥ 2 pain, both stiffness, or ≥ 4 physical function items were not completed, the items were regarded as invalid, and the subscale score not calculated; where 1 pain, 1 stiffness, or 1–3 physical function items were missing, the average value for the subscale was used in place of the missing item. A total score was calculated for each subscale, and a normalised score (0 indicating no symptoms and 100 indicating extreme symptoms) then calculated for each subscale, by summing up the total score of each subscale, multiplying it by 100, and dividing by the possible maximum score for the scale. A total score out of 96 was created by combining the 3 subscales. This was then converted into a normalised score out of 100, as described in the WOMAC user's handbook.

The surgical teams were asked to complete a form recording the patient's height and weight (from which BMI was calculated), side of surgery, duration of arthritis, date wait-listed and date of surgery. The form also asked for prosthesis type, ASA status – a standard measure of fitness for surgery, scored in this study from 1 (normal, healthy) to 4 (life-threatening systemic disease) [18].

The Kellgren and Lawrence (K&L) Radiographic scores

The K&L score was used to assess structural disease severity. A pre-operative anterior posterior (AP) radiograph of the pelvis was obtained from all patients within 6 months of surgery. In order to standardise readings all films were examined by one of three observers within the co-ordinating centre (Bristol) who undertook training sessions together. These three observers all read 20 randomly selected films on two occasions, in a random order, to test their inter- and intra-rater reliability using Kappa scores. They collected data on the hip to be operated on, including the standard K&L grade (0–4) [19] and the intra-articular pattern of disease distribution (supero-lateral, supero-medial, medial or concentric) and whether the hip disease appeared hypertrophic (excessive osteophytes and new bone formation) or atrophic (extensive bone loss).

As most of the radiographs showed advanced OA, we divided K&L grades 3 and 4 further by adding data from the individual scores of joint space narrowing and bone attrition (assessed using the OARSI atlas – [20]). A K&L grade 3 radiograph with joint space narrowing (grade 1 on the OARSI atlas) was graded 3a, those with more severe joint space narrowing (OARSI atlas grade 2) 3b. Similarly, a K&L grade 4 radiograph (which has complete loss of joint space, graded 3 on the OARSI atlas) was divided into 4a if there was no bone attrition seen, either in the femur or acetabulum, and 4b if there was any bone attrition noted in any part of the joint.

Database management and statistical analysis

Each centre was given 18 months to collect data. A 'Microsoft Access' database was set up in Bristol where all patient data was entered and checked by trained staff. In 2006, when the study had almost been completed and the database was about to be closed, the analysis plan was agreed and some remaining missing data obtained from participants. The database was then closed and a professional database manager carried out routine data cleaning.

Stata 9.2 was used for all statistical analyses (Stata Corp., College Station, TX). The main outcomes in the analysis were the three WOMAC subscale scores (pain, stiffness, function) and the combined total WOMAC score. Exposure variables considered in analyses were: Age (< 50, 50–69, 70+), Gender, Obesity (not obese [BMI < 30], obese [BMI 30–39], morbidly obese [BMI > 40]), Employment status (employed, retired, retired early, other), Education after leaving school (none, diploma or equivalent, degree, postgraduate degree), ASA status (1, 2, 3, 4), and Kellgren & Lawrence grade (0, 1, 2, 3, 4) of the hip operated on.

Univariate linear regression analysis was performed to explore the association between the outcome with each exposure, and a multivariate regression analysis then carried out to control for confounders. The distribution of WOMAC scores was assessed to examine the assumption of normality. Wald tests were used to explore linear trends, by fitting models with the variable as a score. To assess for non-linear trends, likelihood ratio tests were used, comparing a model with a categorical variable to that with the variable as a score. Effect modification was considered using likelihood ratio tests to examine for interaction between age, sex and obesity.

Results

1. The cohort and demography of patients included in the analysis

A total of 1520 patients were entered from the 20 centres, an average of 76 patients/centre (range 41–167) (Table 1).

Table 1 EUROHIP Participating orthopaedic centres of excellence, and the number of patients entered into the cohort study reported here.

Scrutiny of the data showed that 193 cases needed to be omitted because of protocol violations. The most common violation was the collection of some of the 'baseline' patient-related information post-operatively (182) and in a further 11 cases it was unclear which hip had been operated on and at a EUROHIP group meeting in 2006 it was agreed that these cases should be omitted from the analysis. The demographic data analysed is therefore on a total of 1327 cases (87%). Unless otherwise stated, all data presented here refer to these 1327 cases.

In order to explore the data for differences within Europe, the participating centres were grouped into 5 regions, (Table 1), providing sufficient numbers in each group to undertake statistical analyses.

While the age range of the included patients was wide, the majority were in their 6th or 7th decade (Table 2). There were more women than men and the group was overweight. Only 25% were still employed at the time of surgery, the majority had retired because of age, but 8% reported that they had retired early because of their hip problems. The patients were generally fairly fit, the ASA status being recorded as grade 1 or 2 in 79%; only 1% of patients were ASA status 4.

Table 2 Demographic data

2. Joint disease and joint replacement

As shown in Table 3, the duration of hip pain was generally recorded as 1–5 years; only 30% reported that they had had their hip problem for 6 years or more. Unilateral hip disease was reported by 32%, a further 13% reported bilateral hip disease without involvement of other joints, but the majority (54.6%) had involvement of other joints, and many had undergone previous surgery on other joints (mostly the other hip or the knees). The right hip was operated on more often than the left, and the cemented type of prosthesis was used most commonly.

Table 3 Status of hip and other joint disease, and type of prosthesis used

3. Radiographic findings

The inter-rater reliability scores (kappa statistic) for the K&L grades ranged from 0.43 to 0.68, indicating a high degree of agreement between observers. Intra-rater kappa scores ranged from 0.63 to 0.89. Of the 1327 people included in the analyses, radiographs of the operated hip were only available for reading in 1107 cases (610 right side and 497 left), and the readers only felt able to assign a reliable K&L score to 1051 films, because of technical problems or the severity of pathological changes.

Results for the conventional K&L scoring were grade 0 or 1 – 12 cases (1%), grade 2 – 32 (2.4%), grade 3 – 510 (38.4%), grade 4 – 497 (37.5%), (missing data in 276 cases). We differentiated those with K&L grade 3 or 4 further, as described in the methods section. Of the 510 with grade 3 OA, only 69 had mild joint space narrowing, the remainder having extensive loss of joint space. Of the 497 with grade 4 OA (complete loss of joint space), 303 also had evidence of bone attrition in either the acetabulum or femoral head (Figure 1).

Figure 1
figure 1

K&L radiographic scores with the split into 3a and b and 4a and b as described.

The commonest recorded pattern of the intra-articular pattern of distribution of the OA radiographic changes was superolateral (44.3%), the other patterns, in order of frequency being superomedial (23.5%), medial (19%) and concentric (9.1%). 62 hips (5.9%) were thought to show hypertrophic changes and 102 (9.8%) atrophic features.

4. WOMAC scores

Total WOMAC scores were available in 94% of the 1327 patients included in this analysis and followed a normal distribution (Table 4, Figure 2). While the majority of the patients coming to hip replacement had WOMAC scores of 40 or more, a number of patients had relatively low total scores, indicating quite mild pain and disability. Overall pain scores were lower than the stiffness or function domain scores.

Table 4 Summary of WOMAC scores
Figure 2
figure 2

Histogram of distribution of total WOMAC score.

5. Associations between demographic features and severity of joint disease

We looked for associations between the total WOMAC scores and age, sex, handedness, BMI, occupational and educational status, ASA status and the K&L score (Table 5). We carried out the same analysis for the WOMAC pain score domain alone, and obtained very similar results (data not shown). It is apparent that higher scores (worse disease) were present in older subjects, women, those with obesity, those with higher ASA status, those who had retired early, and most strikingly, those with no educational qualifications after leaving school. Radiographic scores showed no correlation with WOMAC scores.

Table 5 Results of linear regression analysis for Total WOMAC score

We then looked specifically at those with low pain scores (20 or less) in relation to their age and K&L radiographic status (Table 6)

Table 6 Number of people with low pain scores by age and K&L grade

Finally, we examined the WOMAC total score data to see if there were any obvious differences in scores in the 5 different European regions (Table 7). There were wide variations in each European region, with the trends towards the highest scores in the Eastern Europe grouping (Hungary and Poland) and lowest scores in Austria/Switzerland and the UK. However, there was no evidence to suggest that patients from any one particular region or centre were very different from those of the whole group presented above (data not presented).

Table 7 Summary of total WOMAC scores by region (Not been adjusted for potential confounders)

Discussion

This large cohort study of patients with hip joint OA coming to primary total hip joint replacement surgery in European orthopaedic centres, shows that disease severity varies greatly at the time of surgery, indicating that there is no consensus on how 'bad' the patient's diseases should be to warrant surgery. The variation was much wider for symptoms than for radiographic changes, and one of the striking findings from the cohort is the absence of any relationship between the two. Other key findings are that the severity of symptoms at the time of surgery is associated with differences in age, sex, weight, general health, employment status and educational status. These variations appear to be present throughout Europe, and not dependent on individual centres.

The demography of our cohort was as expected. The average age of 69, predominance of retired people, greater numbers of women than men, and slightly raised mean BMI are all in accord with other groups of patients coming to hip replacement [21, 22]. Similarly, the wide age ranges operated on is in keeping with other data suggesting a trend to surgery being undertaken on more older people [3]. The majority of our cohort had disease of other joints (predominantly the other hip and the knees) but that they were otherwise relatively fit: very few had an ASA score of 3 or 4. The location of the hip OA is also in accord with previous descriptions, with a predominance of supero-lateral or supero-medial disease [23]. One feature of the cohort that did surprise us was the relatively short history of pain in the joint to be operated on: 70% of those for whom we have this information reported 5 years or less of joint pain, and only 12% said that they had suffered for 12 years or more. This suggests that the majority of cases coming to surgery progress fairly rapidly.

We used standard, validated instruments to assess disease severity. The WOMAC is one of the most commonly used disease specific measures, which assesses pain, stiffness and function [16]. One of its problems is the fact that the disability domain, which dominates the total score, cannot differentiate between disability due to a single joint, or that caused by disease in other joints (common in our cohort) or co-morbidities. It has been suggested that people should have a total WOMAC score of 39 or more to be considered for joint replacement [24], but 155 patients (12.5%) of our cohort had a WOMAC score of 40 or less, and 16 patients (1.3%) had total WOMAC scores of 20 or less, which would generally be considered to be indicative of very mild disease. The variation in WOMAC scores shown in Figure 2 is striking. The K&L score is the oldest and most widely used index of radiographic severity of OA [19]. One of its many problems is lack of discrimination and a 'ceiling effect', as established OA can only be scored grade 3 or 4 [25]. We attempted to get round this problem by modifying it so that we have 4 grades of severity of established disease – 3a, 3b, 4a, and 4b. In contrast to the wide variation in the clinical severity of the disease in our cohort, it is clear that the vast majority had severe radiographic changes. This raises concerns that the radiographic findings may be having an undue influence on surgical decision-making. It is not clear if radiographic severity has an impact on the postoperative outcome, and if so, which radiographic features are most predictive, there being some conflicting findings amongst recent publications on this topic [2628]. Follow-up of our cohort should contribute to this debate. In this context it is again interesting to note the lack of any significant relationship between the clinical severity on the WOMAC score and the radiographic severity of the disease. It is well known from epidemiological studies that while radiographic OA is a clear risk factor for symptoms, the correlation between x-ray changes and symptoms is weak, and that some people with severe x-ray changes of OA remain asymptomatic 29,30. However, the relationship between x-ray changes and symptoms in people with severe arthritis coming to surgery has not been studied extensively, and doctors find it hard to believe that there is not some relationship when the disease is severe. We believe that our finding of no association between x-rays and symptoms in this patient group reinforces the need to assess patients on the basis of the impact of the condition on their lives, and not on x-ray severity, when considering them for surgery.

Some of our data on the determinants of the variation in the severity of pain and disability at the time of surgery are worrying. Similar sex differences have been found in several other studies of surgical interventions (women always having more severe disease at the time of surgery than men), and remain largely unexplained [24]. The fact that those who were more unfit (higher ASA status) or more obese, had more severe symptoms than average before they were operated on, is also neither a new nor a surprising finding. However, the strong association with both employment status and educational status are new findings. There are several possible explanations for an association between socio-economic status and surgery, including access and willingness to undergo surgery, as well as bias amongst health care professionals, we believe this inequality needs further investigation.

Our initial exploration of the data to find out if the variations in severity observed, and their determinants, are dominated by events in a single region of Europe, or single centres (data not shown) suggests that these findings are fairly consistent across all countries and centres studied. However, further examination of the data centre by centre is being undertaken. The multi-centre, multi-country nature of the cohort is both a strength and a weakness. The strength is that it will allow us to examine differences in different parts of Europe, the weakness is that the huge variations in the health care systems between, for example, northern and eastern Europe, introduce another source of variation that has nothing to do with the patients' need for surgery.

This cohort study has other strengths and limitations. Its strengths include the relatively large number of patients involved, use of validated outcome measures, and relative paucity of missing data. Limitations include the fact that we do not know how representative the cohort is, as it came from self-selected centres rather than a random selection of orthopaedic centres, and the fact that we do not know how many patients were excluded from the study in each centre, or the reasons for exclusion. In addition, we do not know which potential patients might have been triaged out of the system before being able to see a surgeon, or how many were not put on the waiting list for surgery. In other words, we do not know whether the patients in our cohort are truly representative of those in need of a THR in the community. It is known that the willingness of patients to undergo surgery determines the utilisation of THR [31], and it is likely that this varies in different European countries.

There are no clear indications for THR. Consensus statements simply emphasise pain due to hip disease and functional impairment, in spite of adequate non-surgical treatment [4, 9]. However, clinical practitioners are aware of various other reasons for undertaking or delaying a hip replacement, including age and weight, the social role of the patient (as a carer for example), their psychological status, and the presence or absence of any co-morbidities that might affect surgery or its outcome. In addition, surgeons are aware of the importance of the motivation and expectations of their patients when they undertake surgery. There is an increasing trend for those who pay for surgical interventions, such as total hip replacement, to want to use simple scoring systems in order to 'triage' patients as suitable or unsuitable for surgery. But simple scoring systems, such as the total WOMAC, cannot account for subtleties of the sort noted, and these are crucial to appropriate decision making. It has recently been suggested that a different approach, using the concept of the 'capacity to benefit' from a total joint replacement might be used instead of simple scoring systems [7]. This approach provides surgeons and patients with information on the likely risk of adverse events, as well as the likely degree of benefit, from which to make the judgment on suitability of surgery. Our data show clearly that if a score of severity of pain or function were used to assess suitability for surgery, many patients who are currently being operated on might not be allowed an operation. One of the aims of the 'EUROHIP' consortium is to explore this conundrum further, with more exploration of these and other data to help us understand when it is most appropriate to undertake a hip replacement.

Conclusion

This large prospective cohort study of patients coming to primary total hip replacement for primary osteoarthritis of the hip has shown that the severity of the disease varies widely at the time of surgery. The pre-operative scores on the WOMAC instrument, a widely used measure of osteoarthritis severity that assesses pain, stiffness and function, show that many of the patients appeared to have relatively mild disease, whereas others are very severely affected. In contrast, the radiographic findings showed that almost all patients coming to surgery had severe structural changes in the affected hip. There was no correlation between clinical severity and radiographic severity.

We interpret these findings as follows. First, we believe that the data indicate that surgeons require significant structural changes to be present on the radiographs of their patients before they are willing to suggest that hip replacement is advisable. Second, these data show that simple scoring systems of pain and disability, such as the WOMAC, should not be used to define thresholds for surgical intervention. The lack of correlation between radiographic severity and clinical severity on the WOMAC suggests that the decision needs to be based on patient-related variables rather than the x-ray, and those variables should include the degree of pain and disability. However, a large number of other aspects of the patients' lives need to be considered before surgery is undertaken, including their psychological status, their motivation and expectations, roles in society and social circumstances. We believe that the patients enrolled into this cohort study were being operated on for these complex reasons, and that this explains the huge variation seen in pre-operative WOMAC scores.