Introduction

Osteoarthritis (OA) is the most common bone and joint disease and is characterized by progressive cartilage degeneration. OA is a polygenic disease caused by genetic and environmental factors [1]. Epidemiologic studies have suggested that genetic factors strongly affect the onset and development of OA [2]. Genetic association studies are now uncovering the genetic factors responsible for of OA, that is, its susceptibility genes. Candidate-gene approaches have identified several genes associated with OA, and genome-wide association studies have recently found several promising OA-susceptibility genes [1, 3].

Identification of OA-susceptibility genes would enable us to predict disease phenotypes based on genotype information. Such predictions would be a basis for personalized medicine and disease prevention. However, most previous OA genetic studies have evaluated the effects of only single susceptibility genes, which are mostly small, and hence predictions based on their results are not always useful. By analyzing the combinatorial effects of the gene calmodulin 1 (CALM1) and other genes in hip OA susceptibility, combinatorial association between CALM1 and ASPN susceptibility alleles has been reported [4]; however, the association was not conclusive because of the small sample size of the study. Because OA is a polygenic disease, evaluation of the combined effects of susceptibility genes, as well as between susceptibility genes and environmental/clinical factors, is important.

To obtain a more-comprehensive view of knee OA susceptibility and to make a good prediction model for OA by using genotype information, we investigated the combined effects of known susceptibility genes for knee OA in Japanese persons. We examined gene-gene interactions of the previously reported OA genes and estimated the population attributed risk (PAR) statistic of the genes by using data of a case-control association study [5]. We constructed prediction models for OA by using genetic data only and a combination of genetic and clinical data, and evaluated their predictability statistically. The model using only genotype information had less than fair predictability, but integration of clinical data with adjustment showed marked improvement of the prediction power.

Materials and methods

Subjects

A case-control cohort (692 cases and 748 controls) was recruited from patients of several medical institutes in Japan. A population-based cohort (241 cases and 477 controls) was recruited from inhabitants of Odai and Minami-ise town in the Mie prefecture in Japan. We obtained written informed consent from each subject, and the study was approved by the ethics committee of the Center for Genomic Medicine in RIKEN. In total, 933 knee OA (81.1% female; mean age ± standard deviation (SD) = 71.8 ± 7.7 years) and 1,225 control subjects (75.4% female; mean age ± SD = 69.3 ± 9.2 years) were included. OA was diagnosed based on clinical and radiographic findings by using previously described criteria [6]. All OA patients were older than 40 years.

Genotyping

DNA was extracted from peripheral blood by using standard methods. ASPN D-repeat polymorphism [7], rs143383 in GDF5 [8], and rs7639618 in DVWA [6] were genotyped as described previously.

Statistical analysis

We used the software R [9] for all statistical analyses. To obtain PAR for multiple risk factors among case-control study data, we used an OR assessed by the logistic regression model adjusted for gender and age [10].

Prediction model

We used logistic regression for constructing prediction models. This method assumes a multiplicative allelic effect (each allele independently increases the odds of the disease). We used the Cochran-Armitage trend test to examine this odds increase by increasing the number of risk alleles. The variables of the logistic regression model with clinical data remained after the stepwise selection of variables based on the likelihood ratio and the Wald statistics [11, 12]. We used the receiver operating characteristic (ROC) curve to evaluate abilities of prediction models. The area under the curve (AUC), which is a measure of the power to distinguish case and control individuals, was calculated for the curve by using the ROCR package [13]. The simulation study for the adjusted control group was performed 1,000 times. A perfect model would have an AUC of 1, whereas a model with no discriminative power would have an AUC of 0.5 [14].

Estimation of contribution of the genetic factor

We considered a statistical model for prediction as follows:

Phenotype  ( P ) = Genotype  ( G ) + Clinical factor  ( E ) .

By taking variances on the liability scale for both sides of equation, this model becomes:

V a r ( P ) = V a r ( G ) + V a r ( E ) + C o v ( G , E ) .

As we found no evidence of gene-environmental interactions, we could assume that Cov(G, E) = 0. We also estimated the contribution of genetic factors in our models by calculating the average proportion of variation, Var(E)/Var(P), for 1,000 simulation studies.

Results

Selection of susceptibility genes

We selected three susceptibility genes, ASPN [7], GDF5 [8], and DVWA [6], for study because they had clear associations in Japanese subjects and because the associations were replicated in other ethnic populations and supported by functional evidence for OA causality [1, 3]. We obtained clinical data (Table 1) and genotyping data (Table 2) for 933 knee OA cases and 1,225 controls and examined whether the previously reported risk alleles were associated with knee OA in this population after genotyping. We observed significant associations in GDF5 and DVWA and marginal associations in ASPN (Table 3).

Table 1 Clinical data of subjects
Table 2 Genotyping data of subjects
Table 3 Association between three genes and knee OA

Association between number of risk alleles and knee OA

We tested the independence between risk alleles for all pair-wise combinations by a χ2 association test to 3 × 3 tables by using both case-only and case-control mixed designs. We did not find evidence of dependence. We next examined gene-gene interactions by using the logistic regression model. We considered analysis of variance between two logistic regression models, with or without the term of gene-gene interaction. After Bonferroni correction (P = 0.05/3 for three pair-wise combinations among three genes), we did not find any evidence of interaction (Table 4).

Table 4 Test of independence and gene-gene interactions

We then studied the association between knee OA and the number of risk alleles possessed by a subject (Table 5). The distributions of the subjects over the number of risk alleles differed between the case and control groups. The odds ratio adjusted for age, gender, and body mass index (BMI) increased with the number of risk alleles (Figure 1) (P = 4.15e-6). As only 0.7% of subjects in the control group carried zero risk alleles, we selected subjects with either zero or one risk allele as a reference (the lowest-risk) group. The proportion of the group was 10.6%. Similarly, we combined subjects with five or six risk alleles to construct the highest-risk group. The group had an OR of 2.67 (95% CI, 1.46 to 4.87; P = 0.0020), as compared with the reference group.

Table 5 Number of risk alleles possessed by subjects
Figure 1
figure 1

Odds ratios for subjects with different numbers of risk alleles for knee OA. Effects of the number of ASPN, GDF5, and DVWA susceptibility risk alleles on knee osteoarthritis (OA) risk. Odds ratios (ORs) and their 95% confidence intervals according to the number of risk alleles were calculated by using a logistic regression model adjusted for gender, age, and body mass index. We set the control group as subjects who have zero or one risk allele. Each additional risk allele increased the OR by 1.22 (95% CI, 1.11 to 1.34), which indicated the cumulative effects of the three alleles on knee OA.

Estimation of PAR

To investigate the contribution of genetic factors for the onset OA, we estimated the PAR statistic [5]. PAR is the proportion of cases in the population that can avoid the disease if certain risk alleles are removed from the population. Hence, PAR is useful for providing a measure of how much a certain factor contributes to the disease. We compared the reference group with other groups and obtained an estimated PAR of 31.4% for the risk alleles of the three susceptibility genes. We also considered PAR for BMI because the Framingham study indicated that a higher BMI may increase the OA risk [2]. We classified BMIs for all subjects into three categories (normal/underweight, <25; overweight, 26 to 30, and obese, ≥30) [15]. Setting normal/underweight as a reference group, we estimated the PAR of BMI as 28.4%.

Application for prediction of OA susceptibility

We generated OA-prediction models by using genetic or clinical data or both. We first examined the interactions between genetic and clinical factors and found no evidence of interactions, or of gene-gene interactions. We next considered two logistic regression models: the first model (MODEL I) was constructed with only the number of risk alleles for the three susceptibility genes; the second model (MODEL II) incorporated the clinical information of individuals, including gender, age, and BMI. For both models, we confirmed that each additional risk allele increased ORs by 1.23 (95%CI, 1.12 to 1.34) and 1.22 (95%CI, 1.11 to 1.34), respectively. For MODEL II, the contributions of gender, age, and BMI were 1.35 for female, 1.05 per year older, and 1.18 per unit increase in BMI, respectively.

By a logit transformation, the probability induced by MODEL I is expressed as

P = exp ( 0.856 + 0.205 × n ) 1 + exp ( 0.856 + 0.205 × n )

where n is the number of risk alleles.

In MODEL II, the probability is expressed with clinical data of the subjects as

P = exp ( 8.395 + 0.200 × n + 0.302 × g e n + 0.045 × a g e + 0.162 × B M I ) 1 + exp ( 8.395 + 0.200 × n + 0.302 × g e n + 0.045 × a g e + 0.162 × B M I )

where the value of gen is 1 or 2 if the subject is male or female, respectively. We drew an ROC curve and evaluated the predictive power of the model by AUC [16]. AUCs of MODEL I and MODEL II were 0.554 and 0.685, respectively (Figure 2).

Figure 2
figure 2

ROC (receiver operating characteristic) curves of two prediction models. ROC curves of prediction models that used only genetic factors (MODEL I, AUC = 0.554) and that used both genetic factors and clinical data with age adjustment in the selected control group (MODEL II, AUC = 0.742). The curve for MODEL II was drawn through 1,000 simulation studies with rigorous age adjustment for control groups that reflected the Japanese elderly population. MODEL II had a significantly improved AUC. Curves were described by using the ROCR package.

To examine the contribution of clinical factors, we constructed a model by using only clinical data (MODEL III). We obtained an AUC of 0.678, which was very similar to that of MODEL II; however, comparison between MODEL II and MODEL III by a likelihood ratio test showed a significant difference (P = 1.67e-5), indicating that MODEL II was superior to MODEL III.

To examine the validity of the prediction model, we did a validation study as follows: We divided our subjects into two sets, the cohort study (718 subjects) and the others (1,440 subjects). By using data of the latter set, we constructed MODEL II and applied the model to the data of the former set. Consequently, we obtained an AUC of 0.717 for the validation study, whereas the AUC for the original study was 0.649.

Adjustment of clinical data for the control group

A prediction model is generally interpreted to be excellent, good, or fair, when its AUC is 0.9 to 1.0, 0.8 to 0.9, or 0.7 to 0.8, respectively [17]. Therefore, our models were poor. This was partly because the proportion of male subjects in our control group was only 0.246, which does not reflect that of the general Japanese population. Hence, we adjusted the age by selecting subjects from the case and control groups whose ages were 60 years or older for constructing MODEL II. Mean ages ± SD (years) of the case and control groups were 73.2 ± 6.2 and 72.0 ± 6.8, respectively. We then constructed a control group with random sampling for 171 male and 215 female sunjects, which was the same male-to-female ratio as that in the Japanese population in 2005. We also selected 400 case subjects at random and applied the logistic regression model to them. Through the simulation studies, we obtained a mean AUC of 0.867.

The distribution of subject ages in our control group was slightly different from that of the general Japanese population. Thus, we selected 128 male and 163 female sunjects so that the subjects in the control group had an equivalent age distribution to the Japanese population in 2005. We then applied the logistic regression model to 291 randomly selected cases. The simulation studies showed a mean AUC of 0.742 for MODEL II (Figure 2). For the model, we estimated the contribution of genetic factors by calculating the proportion of variation and found it very low (0.060), which indicated a nonsignificant contribution of the genetic factor in the model. Actually, we obtained a mean AUC of 0.735 for MODEL III. To confirm the prediction models, we also performed cross-validation studies and obtained their AUCs as 0.677 and 0.671, respectively, which were not different from the original ones.

Our case subjects mainly consisted of women (81.1%), and epidemiology studies have reported that women have a higher risk of OA than do men [2]. Hence, we also applied our models to only male or female subjects. With age adjustment for control subjects, we used 128 cases and 128 controls for men, and 462 cases and 462 controls for women. For men and women, we obtained AUCs of 0.737 and 0.696, respectively, and estimated the contribution of the genetic factor to be 0.110 and 0.091, respectively. The increasing proportions of genetic factors in these models were attributed to the fact that they did not contain gender information.

Discussion

Classic twin studies on OA heritability suggested that more than half of variations related to the OA susceptibility can be explained by genetic factors [18]. Therefore, we first tested a model that considered genotype information only. The prediction model had a low predictability; its AUC was only 0.554. The insufficient power of the model was partly due to the small number of integrated risk alleles. It has been reported that 20 to 25 risk alleles with frequencies greater than 0.1 and OR values of 1.5 are required for obtaining an AUC of about 0.8 [19]. A precedent study presented a similar model to our MODEL I and reported the additive effect of candidate genes associated with OA [20].

Based on our PAR analysis, the contributions of genetic factors for OA composed by three risk alleles was similar to that of the BMI (31.4% versus 28.4%). The proportion of variations and estimated contributions of genetic factors for our prediction model was calculated to be very low when compared with the previous twin studies [18]. Finding additional risk variants and integrating them into the model is necessary to increase the power of the model.

To improve the predictability of the model, we integrated clinical information, which increased the AUC to 0.685. We further modified the model by adjusting the age because our control group was composed of older subjects, as compared with the case group. After this adjustment, the AUC increased to 0.742. The effect of age difference between the case and control groups was confirmed by simulating the control group and a random sampling that considered only the male-to-female ratio and selected case subjects under the same conditions. The applied logistic regression model produced an AUC of 0.867. These results showed that despite slight differences in age between the case and control groups, mean values could be inflated. Thus, the distribution of age in the control group was influential for constructing a prediction model, underscoring the importance of age adjustment for the control group.

Among the three susceptibility genes, we found no evidence of interaction. We also found no evidence of interaction between the four clinical factors. We are uncertain whether the lack of interaction is true for only the factors that we examined, but it may be due to the limitations in the power of our study resulting from the sample size. It was previously shown that GDF5 may contribute to the variation in height with an estimated additive effect [21]; however, we did not find such an association in our study. Both gene-gene and gene-environment interactions should play roles in common diseases, and we may therefore be able to increase the predictive power by finding the yet-unidentified variants that interact with clinical-environmental factors.

Our prediction model for knee OA was constructed based on data from a case-control association study. In type 2 diabetes, some prediction models have been considered based on case-control study data that have incorporated the number of risk alleles and clinical data [17, 22, 23]. These approaches have achieved successful outcomes; however, unlike a cohort study, data from control subjects may not reflect the true distribution of ages for the target population. In the case in which susceptibility to disease varies with age, it is necessary to consider the age distribution among the control group. To our knowledge, our approach with adjustment of clinical information is the first to construct prediction models using case-control studies, while considering this problem. Our approach can compare prediction models induced by genetic data, clinical data, and both of them and can estimate the contribution of genetic factors for the last model. It would be useful in future preventive measures against not only OA, but also other common polygenic diseases.

Conclusions

To our knowledge, this study is the first report of an OA-prediction model combining with genetic and clinical information from a case-control association study. Our prediction model using genotype information from three susceptibility genes had poor predictability; however, predictability improved significantly by incorporating clinical data and by adjusting those data. In the current model, the contribution of genetic factors is small. The identification of more OA-risk polymorphisms is necessary; these should be integrated to achieve better prediction.