Background

With the recent technological advances, various genome-wide studies have uncovered an unprecedented number of structural variants throughout the human genome [13], mainly in the form of copy number variations (CNVs). The considerable number of genes and other regulatory elements that fall within these variable regions make CNVs very likely to have functional and, ultimately, phenotypic consequences [4, 5]. In fact, recent studies have reported a correlation between copy number of specific genes and degree of disease predisposition [68], indicating that identification of DNA copy number is important in understanding genesis and progression of human diseases.

Several techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as array-based comparative genomic hybridization (aCGH). The goal of this approach is to identify contiguous DNA segments where copy number changes are present. The ability of aCGH to distinguish between different numbers of copies is limited, so various quantitative techniques are required for more precise, targeted analysis of genomic regions. For known CNVs, real time PCR assays can be used to compare the copy number status of particular loci in cases and controls. Individuals are typically binned into copy number categories using pre-defined thresholds of probe signal intensity. Recently, Multiplex Ligation-dependent Probe Amplification (MLPA) [9] has also been used to quantify copy number classes. This method allows the analysis of several loci at the same time in a single assay. MLPA is usually used to identify gains or losses in test samples with respect to controls [10], but it can also be used in the context of association studies in a case-control or cohort settings [11, 12].

The statistical methods used in CNV-disease association studies are currently very simple. Quantitative methods give CNV probe signal intensity measurements for each individual as a continuous variable, from which copy number status is inferred, generally using pre-defined thresholds. Differences in copy number distribution between cases and controls are then assessed using χ2, Fisher or Mann-Whitney tests [6, 13, 14]. However, the distribution of CNV probe measurements is continuous and multimodal, meaning that signal intensity should be considered as a mixture of curves. In many instances, these curves overlap with various underlying distributions leading to uncertainty. Therefore, scoring copy number by binning and then assessing the association may lead to misclassification and unreliable results.

Ionita-Laza et al. (2009) pointed out that it is not inmediately clear how this uncertainty of CNV calling should be incorporated in the statistical analysis [15]. To overcome this difficulty in assessing association between CNVs and disease, we propose a latent class (LC) model that incorporates possible uncertainty that appear when CNV calling is performed. After inferring copy number using Gaussian finite mixture distributions, or any other calling algorithm, the model assesses the relationship between the trait and a CNV using a mixture of generalized linear models. Association is then tested using a likelihood ratio procedure. We validate and compare our method with existing methods through a simulation study. We then illustrate how to test association between CNVs and the trait by using two real examples. One of them corresponds to a case-control study using data from a MLPA experiment where the true copy number status is known. The second example belongs to a study where breast cancer cell lines are analyzed using aCGH.

Methods

Inference of copy number status

Let us assume that we observe I individuals from a given population, consisting of mutually exclusive latent classes c = 1, ..., (e.g. copy number status). Instead of observing these classes, we observe a surrogate variable, X, corresponding to a continuous variable arising from any quantitative method. For instance, in targeted studies using MLPA or real-time PCR, X corresponds to peak intensities for each CNV probe. In the context of a whole genome scan, one may have quantitative data from aCGH or any other platform such as Illumina or Affymetrix, where, for each probe, the variable X corresponds to a ratio of intensities. Figure 1 shows a number of possible distributions that signal intensities may have. Some variants clearly show different underlying copy number status with multimodal signal intensities distributions (CNV2, CNV4 and CNV6). In other cases, where the existence of different copy numbers is not clear, inferring copy number by binning the data may be difficult or unfeasible.

Figure 1
figure 1

CNV quantitative measurements. Examples of CNV data showing different clustering quality and copy number status.

For each CNV variant, we are interested in classifying the subjects into the classes using the surrogate variable X. We propose to model the unobserved latent classes using a finite mixture model with components of the form

(1)

where N(·|η c , ) is the Gaussian distribution with Θ denoting all model parameters (e.g., Θ = (η c , ), c = 1, ..., ), and x is the surrogate variable that corresponds to the quantitative measure of copy number status. For the component weights π c it holds that

The value of to be used is chosen by applying Bayesian Information Criteria (BIC) [16]. This mixture model approach for calling is similar to some used for the analysis of aCGH data [17, 18] where correlation among probes should be considered. When analyzing MLPA data, it should be pointed out that in some instances, especially when there are individuals with 0 copies, the intensity distributions (see CNV2 and CNV4 in Figure 1) for a null allele is meant to be equal to 0. However, due to experimental noise it is fact that in some cases this ratio shows values that slightly deviate from this theoretical value. After our experience with hundreds of home-made MLPA probes, the value for null alleles is typically below 0.1; nevertheless, we recommend this parameter to be determined experimentally for each of the probes used in the MLPA experiments using the appropriate control samples. For these cases, the procedure used to estimate the parameters in (1) fails because the underlying distribution of individuals with 0 copies is not normal. In these situations we propose to fit the following mixture model to determine the latent classes

(2)

where τ is given by the user, as previously indicated, , ℐ denotes an indicator function, and

The posterior probabilities are used to segment data by assigning each individual to a given copy number status corresponding to the class with maximum posterior probability (MAP). After fitting this finite mixture model, we can perform a goodness-of-fit test using χ2 test statistic. Finite mixture parameters can be estimated using the EM algorithm [19, 20] or Newton-type procedures [20]. Then, the posterior probability that individual i with an observed value x belongs to copy number class j is given by

(3)

Latent class model

Discrete traits

Let us suppose that copy number status is associated with a binary phenotype (case-control). The association is typically assessed using a χ2 test for the contingency table (Table 1). Misclassification in the table (due to uncertainty when inferring CNVs) is incorporated when we assign each individual to a given class c using maximum a-posteriori probability (MAP). Thus, this problem can be seen as an association study with misclas-sification ("measurement error") [21]. It is well known that misclassification of covariates has important implications for parameter estimates and statistical inference [22]. Some approaches account for such error [23, 24]. These are, however, based on performing validation studies in a subsample. In the present context, this is unfeasible because hundreds of genes are normally analyzed at a time, and the technology may have a different sensitivity and specificity for each of the inspected loci. Therefore, we propose to use the posterior probability of belonging to each latent class to model the degree of misclassification of copy number status. We then take this information into account in the association model.

Table 1 Contingency table of disease status and copy number category

Conditioning on cluster c, we have that

(4)

where β= (β1, ..., β c ), c = 1, ..., is our vector of parameters, and

Then, equation (4) can be rewritten as

Now, we consider that copy number status is measured with error (i.e., the latent class is not known). Therefore, we are modeling the probability of being an affected individual as a mixture of binomial variables, as follows:

where w ic is the posterior probability that individual i belongs to copy number class c, given in (3). Therefore, assuming conditional independence of case-control status, given latent class, the likelihood function for model parameters β can be written as

(5)

We can then simply compute the odds ratio (OR) of belonging to class c with respect to a given reference r as

(6)

Quantitative traits

We now consider the case where our phenotype, Y, is continuous. We assume that Y |c N(μ c , σ2). In this case, conditioning on cluster c

(7)

where

Similar to the case of discrete traits, the likelihood function for model parameters β is given by

(8)

In this case we are interested in evaluating the difference between the mean effect of individuals with c copies and r copies. This can simply be computed as

Covariate Adjustment

In some instances researchers are interested in assessing the effect of CNVs after adjusting for other covariates, Z1, ..., Z K (usually called confounding variables). In this case, the likelihood function can be written as

where

(9)

for discrete traits, and

(10)

for quantitative traits. In both cases

(11)

Parameter estimation

In this section we address parameter estimation for the general situation of having covariates and either discrete or quantitative traits. For brevity, let θ≡ (β, γ, σ) (notice that for discrete traits σ = 1). We consider that the weights, , are known and that they are given by the surrogate variable X from equation (3). Therefore, they can be used in the log-likelihood calculation, resulting in

(12)

Here P(y i | = c, Z, θ) is given by equations (9) and (10) for discrete and quantitative traits, respectively. The maximum likelihood estimators (MLE) of the model parameters maximize this log-likelihood function. We propose to use a Newton-Raphson procedure to find parameter estimates. The k-th component of the score, S, is given by

The k-th element of the Hessian, H, is

where

Formulae for the derivatives of h ic for covariates and for discrete and qualitative traits are given in the Appendix.

MLE can be used to estimate, under the multiplicative model, the OR between individuals with copy number status c with respect to a reference category (e.g., individuals with copy number status r) as

(13)

Similarly, when analyzing continuous traits, the estimated mean effect among individuals with c copies with respect to those with r copies is

(14)

The asymptotic variance-covariance matrix of maximum likelihood estimates of θ can be estimated using the observed information matrix, F, as

(15)

Therefore, we can compute a 95% confidence interval (CI95%) for ORc/rusing the expression

(16)

and for

(17)

where zα/2 denotes the (1 - α/2)-th quantile of a standard normal distribution, α is the desired type-I error, and subindex [·, ·] denotes the position in the inverse of Fisher's information matrix.

Hypothesis testing

We propose to use a likelihood ratio test to assess disease association, taking the model without the copy number variable as reference. Twice the increase in the log-likelihood provides the asymptotic χ2 statistic that tests H0: β1 = β2 = ... = . In many instances, we are interested in studying the trend in effect with respect to copy number status (e.g., additive model). This can be done by generalizing equation (11) in the form

(18)

where D is a I × M design matrix, and ζ is a vector of dimension M having the model parameters. M is the total number of variables included in the model, including copy number status and confounding variables (e.g., M = + K). For example, a trend test on copy number status without covariates D would have the form

and the trend hypothesis on copy number status is tested using a likelihood ratio test, comparing this model with the null model. Notice that this formulation allows us to accommodate different or common effects for each latent class. In this case, parameter estimates are obtained as shown above. Formulae for the derivatives obtained in the score and Hessian, where coefficients are not shared by each latent class, are shown in the Appendix. R language functions for the methods discussed in this paper are freely available at http://www.creal.cat/jrgonzalez/software.htm[25]

Results

Simulation study

We performed computer simulation studies to empirically examine the properties of the parameter estimators developed in the previous sections. The specific goals of these studies were: (i) to evaluate the performance of the proposed likelihood ratio trend test based on the latent class model for a number of CNV measurement distributions; (ii) to examine the effect of sample size (I) on the distributional properties of the estimators; (iii) to examine the bias and mean square error (MSE) of the estimators; (iv) to assess the accuracy whether of the variance and parameter estimates obtained using the observed information matrix. Simulations were performed as follows: To study (i), we simulated a binary trait using 300 cases and 300 controls. The unobserved copy number statuses (e.g. latent classes) were simulated depending on 3 different copy number status ( = 3), with the proportion of individuals in each category set as π = (0.5, 0.4, 0.1). The trend OR was set equal to 1.5. The observed signal intensity ratio (X variable) were simulated as a finite mixture of normal distributions using different means, η, and variances, σ2, to assess whether the separation of clusters and their variance affects power.

To study (ii)–(iv) we simulated binary and quantitative traits. For the binary trait, simulation was performed as above but simulating various scenarios of sample size (I), OR and proportion of individuals with each copy number status, π. Again, we simulated different CNV distributions by varying η and σ2. For quantitative traits, we used the same simulation procedure but copy number status was simulated depending on a fixed mean trait level for the reference copy number status and a desired mean difference with respect to other copy number statuses. Next, we describe the settings for the different simulation parameters. Sample size: We chose the values of I: I ∈ {50, 300}. Although current studies are analyzing thousands of individuals, these values were chosen to evaluate the performance of our proposed method in moderately large samples. Copy number status: Since we were interested in evaluating the performance of the parameter estimates, we only simulated two different copy number statuses = {1, 2}. Odds ratio: To assess the impact of the strength of association between the disease and CNV, we chose two values for OR: OR ∈ {1.3, 2} in order to consider a moderate association and a strong one. Proportion of cases with normal copy number status: To evaluate the impact of classes with different number of individuals we set π ∈ {(0.8, 0.2), (0.5, 0.5)}. Finite mixture: To asses the impact of distribution of intensity ratio, X, we simulated two normal distributions with the following parameters: η ∈ {1, 1.5}, which correspond to having 2 (considered as normal copy number status) and 3 copies, respectively, and σ ∈ {(0.15, 0.15), (0.15, 0.2), (0.2, 0.2)}. In this case, these scenarios also helped us to model different situations regarding misclassification or how latent classes were separated.

We compared three different approaches. The first (NAIVE) was based on assessing association between disease and copy number status obtained using MAP from the finite mixture model (2). That is, association was assessed using a χ2 test from Table 1. The second is the approach that has been used predominantly to date when analyzing this kind of data and is based on assigning CNV status using pre-defined thresholds (THRES). Association is then assessed using a χ2 test. As mentioned previously, we simulated data from two mixtures of normal distributions with means of 1 and 1.5. This is equivalent to simulating individuals with 2 and 3 copies, respectively. In this situation, it is considered that individuals with intensity (or intensity-ratio) greater than 1.33 correspond to individuals with 3 copies [10]. The third method is the one proposed in this paper, based on latent class (LC) using a χ2 test. In order to make the results comparable, the performance of LC based on likelihood ratio trend test was compared with that of the two other methods using a χ2 trend test (e.g. 1 degree of freedom). To evaluate bias and MSE of parameter estimates, χ2 of association was used for all three methods.

Simulation results for evaluating the performance of the likelihood ratio trend test in our proposed model are shown in Figure 2. The top figures show the power for all methods analyzed under two scenarios (other scenarios are given in Additional file 1).

Figure 2
figure 2

Empirical power for simulation studies. Empirical power for the three different approaches analyzed, varying the quality of clustering for underlying copy number status. Left panel is for fixed variance and varying means, while the right panel is for fixed mean and varying variances.

The left panel shows the power for each method, varying the CNV measurement distribution with regard to the mean of each latent class, η, while the right panel gives the same information but with fixed means and varying variances, σ2. Figure 2 also depicts the distribution of CNV signal intensities for various scenarios. We observe that our proposed latent class model performs better in all cases, even when distribution of copy number status are not very well separated (e.g. more uncertainty).

Simulation results to evaluate parameter estimates for discrete traits are presented in Table 2 and in Table S1 and Figures S3 and S4 (see Additional file 1). Similar results and conclusions are obtained for a quantitative trait. Table 2 and Figures S3 and S4 (see Additional file 1) summarize the OR obtained by comparing individuals with 3 copies to those with 2 copies (reference category) and give the MSE for two different sample sizes, I, two different proportions of individuals with 2 copies, π, and two different variances for each component of the mixture, σ. Table S1 (see Additional file 1) compares different methods to compute the standard error of the ORs for the various scenarios described above. The results compare asymptotic variance based on an observed information matrix (ASYM) with respect to empirical variance (EMP). Supplementary Table S1 also shows coverage and power of confidence intervals based on the three methods analyzed. As expected, when the sample size increased, the performance of the estimators of the finite-dimensional parameters improved (Table 2). In all cases, the LC method performs better than the others. LC has less bias than NAIVE and THRES in all cases, and also shows better MSE.

Table 2 Simulation study

Regarding variance estimates, the estimation based on ASYM showed good performance in all scenarios (see Additional file 1, Table S1). Despite slightly overestimating of EMP, the bias was less pronounced for I = 300, as expected. Confidence intervals based on the LC method outperform those obtained by other methods with regard to power.

Application to real data

MLPA example

The first data set used to analyze CNV and disease was generated and kindly provided by one of the coauthors of the current work. Although data is still unpublished, it has been made available in a blinded format for reproducing our findings using the approach presented herein, and for other validation studies. Some candidate genes were identified after performing a whole genome scan analysis using aCGH, where a pool of controls and cases were compared. In order to further investigate the relationship between the disease and altered the genes, a targeted study including several variants was designed using the MLPA technique. We obtained signal intensities of MLPA assays for 360 cases and 291 controls. Figures 3 and 4 show the intensities for cases and controls for two selected genes. In both cases, we observe 3 latent classes, corresponding to 0, 1, and 2 copies of the gene. We found that the finite mixture model fits very well (χ2 goodness-of-fit test, P = 0.6615 and P = 0.4888). The main difference between these two cases is that copy number status for gene 1 can be established using a threshold method, while for the second gene this classification seems more arbitrary. As a consequence, misclassification should be taken into account when analyzing gene 2. Table 3 shows the classification of individuals as having 0, 1, 2 copies, estimated using equation (2) and the true copy number obtained by breakpoint cloning and assessing allele presence by PCR, which unequivocally reports the exact number of copies. From the table, we can see that the finite mixture model gives a perfect classification for gene 1 and some misclassification for gene 2. Goodness-of-fit test revealed that the proposed mixture model to determine CNV status was appropriate (p = 0.6615 and p = 0.1586).

Figure 3
figure 3

Association between Gene 1 and disease. Graphical representation of peak intensities (CNV quantitative measurement) of individuals for Gene 1 analyzed in the example. The various colors indicate copy number status inferred using our proposed finite mixture model.

Figure 4
figure 4

Association between Gene 2 and disease. Graphical representation of peak intensities (CNV quantitative measurement) of individuals for Gene 2 analyzed in the example. The various colors indicate copy number status inferred using our proposed finite mixture model.

Table 3 Contingency table of estimated and true copy number status for the two genes examined in the real data example.

Table 4 shows the ORs and their 95%CI for the two genes analyzed. The first three columns show the results obtained in the laboratory using PCR, while the other columns show the results obtained after estimating the copy number status using our proposed finite mixture model and computing the ORs using a naïve approach (e.g. assuming that there is no misclassification) and the LC model that accounts for misclassification. As we can see, the results are the same for gene 1, since no misclassification is observed (see Figure 3 and Table 3). However, for gene 2, copy number status could not be determined as easily as for gene 1. Thus, we observe a different OR estimation and, more importantly, a different P-value for association. For instance, the order of magnitude of the association between the disease and gene 2 is better captured by the LC model than by the NAIVE approach. Regarding the OR estimates, the analysis using the true copy number status shows that individuals with one copy of gene 2 have a 63% decrease in disease risk with respect to individuals with 0 copies. As the 95%CI shows, this difference is statistically significant. We arrive at the same conclusion when we compare individuals with 2 copies with respect to those with 0 copies. Note that in both cases we observe that the naïve approach underestimates the OR, as shown by the simulation study.

Table 4 Association analysis of disease status and copy number category using the true copy number status and the estimated status obtained using the finite mixture proposed.

aCGH example

The analysis of aCGH data requires additional steps to take into account the dependency across probes. Table 5 shows four steps we recommend for the analysis of this kind of data. First, MAP should be obtained with an algorithm that considers probe correlation. We use, in particular, the CGHcall R program which includes a mixture model to infer CNV status [18]. Second, we build blocks/regions of consecutive clones with similar signatures. To perform this step the CGHregions R library was used [26]. Third, the association between the CNV status of blocks and the trait is assessed by incorporating the uncertainty probabilities in the LC model. And fourth, corrections for multiple comparisons must be performed. We use the Benjamini-Hochberg(BH) correction [27]. This is a heuristic method that is robust against positive dependence and increasingly conservative as correlation increases [28].

Table 5 Steps used to assess association between CNVs and traits when aCGH is used.

We applied the methodology to the breasts cancer data studied by Neve et al. [29], which is freely available from the bioconductor website http://www.bioconductor.org/[30]. The data consists on CGH arrays of 1 MB resolution [31]. The authors chose the 50 samples that could be matched to the name tokens of caArrayDB data (June 9th 2007).

In this example the association between strogen receptor positivity (dichotomous variable; 0: negative, 1: positive) and CNVs was tested. We contrasted the association as given by the LC and the NAIVE models. The original data set contained 2621 probes which were reduced to 459 blocks after the application of CGHcall and CGHregions functions. Table 6 shows the number of CNV blocks associated with strogen receptor positivity for different significance levels. We observe that incorporating classification uncertainty with the LC model substantially increased the level of association, as compared to the NAIVE approach. The number of positive association at 5% of significance after applying BH correction was 49 and 24 for LC and NAIVE approach, respectively.

Table 6 Number of CNV blocks (out of 459) associated with estrogen receptor positivity from 50 aCGH breast cancer cell lines.

Discussion

In this paper we have shown that the assessment of association between CNVs and disease using analysis methods that do no take into account uncertainty when inferring copy number status lead to larger p-values and underestimate the model parameters. This confounds the need to increase statistical power, which is reduced by the multiple comparison correction for the simultaneous testing of several loci. False positives are typically controlled by a dramatic reduction in the nominal p-value, such that very low values are required to reach statistical significance. Thus, a precise computation of these values is essential in genetic association studies.

Here we have proposed a latent class model (LC) that accounts for the uncertainty of assessing CNV status and also accommodates potential confounding factors. In the case of analyzing quantitative traits, we also provide formulae to further propagate call uncertainty, as other authors have proposed in another context [32]. By analyzing quantitative traits, we have assumed that the response variable follows a normal distribution, although this assumption does not hold in some instances. In this situation, one possibility is to analyze the log-transformed variable, although log transformation may not be not sufficient. The model could easily be extended to fit a response variable that has any exponential family distribution (e.g. normal, gamma, Poisson). However, we have not yet implemented this option in the functions reported here. The extension of our proposed latent-class model to assess survival time, possibly with right-censored data, is not trivial but could be a very interesting avenue for future investigation. The parameter estimation procedure proposed here, allows the estimation of confidence intervals. The LC model was remarkably consistent with simulated data. In particular, we found that the p-values obtained with the LC model were more similar to the expected values than those obtained by the threshold and naïve methods.

We maximize the likelihood function, assuming fixed weights for each copy number status, which accounts for possible misclassification. The main advantage of considering weights as known constants is that the Newton-Raphson procedure is much simpler, faster and feasible for obtaining the Hessian matrix analytically. We confirmed that the proposed model captures very well the nature of the synthetic data and variance estimates. Interestingly, we observed that the variance estimates using MLE were also reproduced when a bootstrap procedure was used (see Additional file 1, Table S2). In the interest of generalization, one can consider maximizing the likelihood function for both model parameters and weights. In that case, an EM algorithm should be used instead. However, one should bear in mind that EM does not allow for estimation of the variance of the model parameters and is computationally expensive, which may be particularly costly if this method is used in whole genome scan settings.

Conclusion

We have shown that the LC model can incorporate uncertainty of CNV calling in the analysis. We have also illustrated how to analyze quantitative traits as well as how to accomodate confounding variables. This is of particular importance in complex diseases studies where other clinical or biochemical factors need to be taken into account. The formulation can also be generalized to assess survival times or counts in longitudinal studies. The model has showed good performance when analyzing both targeted (MLPA data) and whole genome (aCGH data) studies.

Appendix

To obtain parameter estimates, we maximize the log-likelihood function

where P(y i | = c, Z, θ) is given by equations (9) and (10) for discrete and quantitative traits, respectively. As previously mentioned, the k-th component of the score, S, is given by

The k-th element of the Hessian, H, is

where

Herein we provide formulae for the derivatives of h ic for all cases discussed in this paper. Although the following expressions may appear complicated, they are straightforward to program and are included in the >R functions available at http://www.creal.cat/jrgonzalez/software.htm.

Binary Traits

Binary Traits without covariates

In this case, the h ic function takes the form

Therefore,

where

and

Binary Traits with covariates

In this case, the h ic function takes the form

Therefore,

where

and

For covariates:

Quantitative traits

Quantitative traits without covariates and shared variance

In this case, the h ic function takes the form

Therefore,

Quantitative traits with covariates and shared variance

In this case, the h ic function takes the form

Therefore,

Trend test

In this situation we can write the linear predictor of equation (18) as

In other words, β1 plays the role of an intercept and β2 is the slope. In this case, we consider that both β1 and beta 2 are shared for each latent class. In this situation, bearing in mind that , for the discrete traits, we have that

(19)

and

(20)

For quantitative traits, where , we have that

(21)

and

(22)

For the variance, we have that

(23)
(24)

and

(25)