Background

Problem 1 of Genetic Analysis Workshop 13 (GAW13) provided the data from the Framingham Heart Study. We focused on the offspring cohort due to the missing rate of the data in the parental cohort.

Because the history of medical intervention, including the adjustment of lifestyle and the use of anti-diabetic medications were not available, we chose the highest fasting plasma glucose levels across the course of follow-up as the targeted quantitative trait to indicate the potential risk for abnormal glucose disposal. As suggested by the American Diabetes Association, the impaired fasting glucose (IFG, fasting plasma glucose between 110 and 125 mg/dl) appears as a risk factor for type 2 diabetes mellitus (T2DM) [1]. We further used the lower limit of IFG (≥110 mg/dl) as the cut-off to transform this quantitative trait into a dichotomy. In this way, we included the subjects in the group with one or more incidences of higher fasting plasma glucose. We then performed association analyses using regression and classification trees for the two traits, respectively. A marker was considered positive if at least one of its alleles showed association in both analyses.

Our purpose was to identify candidate genes related to the fasting glucose levels in the presence of covariates. We found a few interesting markers that are closely linked with some potential candidate genes biologically relevant to glucose metabolism.

Method

Data processing

For the phenotype measurements, the corresponding covariates were created using their cross-sectional means. The covariates entered in the analysis included sex, body mass index, and lipids (total plasma cholesterol, high density lipoprotein cholesterol, and triglycerides) for each subjects. To control for potential familial correlations, the cross-sectional means of the maternal and paternal phenotype measurements were also included as covariates.

For the genotypic data, an allele was chosen to enter the analyses if its allele frequency is at least 10%. Alleles with frequencies less than 10% but from the same marker are categorized as an incognito allele. The allelic covariates were created using the technique proposed by Zhang and Bonney [2].

Association analysis using classification trees

The classification tree (CT) and regression tree (RT) methods are both built on the recursive partition technique; they can be used to partition a study population into homogeneous disjointed subgroups. The optimal tree is created by both growing and pruning procedures. The maximal tree is built by splitting each node into two child nodes until the purity of the terminal node is achieved. In splitting, the best choice of the child node is derived while the minimum of the entropy impurity function is reached. In pruning, it is processed for each binary class j in the subtree τ until the unconditional misclassification rate is attained, where c(j|i) is the cost that a class j is classified as a class i and IP is the entropy impurity function. In general, choice of the cost depends on the severity of the misclassification. In this study, equal cost was chosen for both misclassifications because it frequently gives most satisfactory analyses [3], i.e., c(1|0) = c(0|1). The optimal tree in RT is similar to that in CT with a different impurity function , i.e., the within-node variance in the tree τ. More details of CT, RT, and corresponding splitting criteria are described elsewhere [35].

Tree-based association analysis was implemented by using genotype measurements such as allelic covariates and related phenotype measurements to construct binary trees. An allele shows association with the trait if its corresponding covariate is included in the optimal tree.

To illustrate the tree construction, a portion of an optimal tree created by CT is presented in Figure 1. First, a total of 1667 subjects (the offspring generation) were divided into two groups according to whether averaged BMI was less than 26.35 or not (node 1 to nodes 2 and 8). Those with averaged BMI higher than 26.35 were further subdivided according to their HBP status (node 8 to nodes 9 and 15). Those 314 subjects in node 9 were further divided into node 10 (or 14) if their averaged maternal triglyceride was lower (or higher) than 135.5 mg/dl. Finally, if the genotype was absent of allele 266 in D16S2620 then the subject was likely to have a fasting glucose levels lower than 110 mg/dl. In summary, allele 266 in D16S2620 was associated with fasting glucose levels for those with higher BMI (>26.35), no HBP, and lower maternal triglyceride levels (<135.5 mg/dl).

Figure 1
figure 1

A proportion of an optimal tree from classification tree Decision node criteria: 1. BMI: average BMI across-section 2. HBP: ever hypertension in all sections 3. MTG: average maternal triglycerides 4. Allele 266 of D16S2620 Definition of classes: Class 0, glucose levels < 110 mg/dl in all section; Class 1, at least one observed fasting glucose levels >110 mg/dl.

Genome-wide screen

In this study, we conducted a genome-wide screen to identify the candidate gene in the presence of a set of specified covariates. We performed RT- and CT-based association analyses on the quantitative and dichotomy traits, respectively. A marker was interpreted as positive if at least one of its alleles showed association in both association analyses. The allelic covariates from the same chromosome were entered in the analyses simultaneously. The genome-wide screen consisted of 22 such processes for the autosomes. The computer programs QUEST [6] and RT [7] were used to construct the binary trees for the CT and RT analyses.

Web-searching for candidate genes

The map position was defined using Ensemble Genome Server at Sanger Institute http://www.ensembl.org/Homo_sapiens/. For candidate gene search, we used Online Medelian Inheritance in Man at National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM or euGene http://iubio.bio.indiana.edu:8089/man/.

Results

Table 1 shows these seven candidate regions, consisting of nine positive markers in both analyses, were on chromosomes 1p, 2p, 6q, 11p, 16q, 18p, and 19q. Among these seven regions, four regions, covering the four markers, D1S1665, D6S474, D11S1981, and D19S254, were closely linked to the genes previously reported to be relevant to glucose metabolism or diabetes mellitus (details listed in Table 1).

Table 1 Positive markers found in the analyses using classification and regression trees

Discussions and Conclusions

In this study, the intent of our screen method was to identify candidate markers rather than to pinpoint susceptibility alleles, although it can be applied to detect the allelic or non-allelic heterogeneity. The cut-off value used in CT in this analysis was chosen for a biological reason. However, the analysis was sensitive to the choice of cut-offs when the subjects were largely clustered around the cut-off point (>110 mg/dl). Only three regions on 1p, 16q, and 18p were consistently positive at neighboring cut-offs from 100 to 120.

Although covariates such as BMI and HBP, which are associated with fasting glucose level, were included in our analyses, the cut-off of these covariates in our final 22 optimal trees were not the same. Further studies are needed to inspect the impact of different cut-off and associated alleles.

From a different point of view, our method used the RT analysis on the quantitative trait to validate the results from CT such that the positive markers showed association in both analyses. Notably, four out of the seven candidate regions harbored previously reported genes that are related to glucose metabolism or diabetes mellitus. In conclusion, our screen method shows promise for searching candidate loci in genome scans for complex traits.