Abstract
Background
MLPA method is a potentially useful semiquantitative method to detect copy number alterations in targeted regions. In this paper, we propose a method for the normalization procedure based on a nonlinear mixedmodel, as well as a new approach for determining the statistical significance of altered probes based on linear mixedmodel. This method establishes a threshold by using different tolerance intervals that accommodates the specific random error variability observed in each test sample.
Results
Through simulation studies we have shown that our proposed method outperforms two existing methods that are based on simple threshold rules or iterative regression. We have illustrated the method using a controlled MLPA assay in which targeted regions are variable in copy number in individuals suffering from different disorders such as PraderWilli, DiGeorge or Autism showing the best performace.
Conclusion
Using the proposed mixedmodel, we are able to determine thresholds to decide whether a region is altered. These threholds are specific for each individual, incorporating experimental variability, resulting in improved sensitivity and specificity as the examples with real data have revealed.
Similar content being viewed by others
Background
With the recent technological advances, different genomewide studies have uncovered an unprecedented number of structural variants in the human genome [1–7], mainly in the form of copy number variations (CNVs). As much as 12% of the human genome has been reported to be variable among different individuals [6]. The important number of genes and other regulatory elements encompassed by those variable regions, makes it very likely for them to have functional and, ultimately, phenotypical consequences [8, 9]. In fact, several publications have already correlated the number of copies of different genes with different degrees of disease predisposition [10–12]. Therefore, the identification of DNA copy number is important in understanding genesis and progression of human diseases.
Several techniques and platforms have been developed for genomewide analysisi of DNA copy number, such as arraybased comparative genomic hybridization (aCGH). The goal of the analysis of DNA copy number data using this approach is to partition the whole genome into segments where copy numbers changes between contiguous segments may be present. The ability of aCGH to discern between different number of copies is very limited, thus different kinds of techniques have been developed for targeted, and more precise, analyses of genomic regions. Multiplex Ligationdependent Probe Amplification (MLPA) [13] is one of the most used technologies among other existing ones such us Quantitative Multiplex PCR of Short Fluorescent (QMPSF) [14, 15] or Multiplex Amplifiable Probe Hybridization (MAPH) [16].
MLPA is a recently developed semiquantitative method that aims to detect copy number alterations at the genomic level (gains or loses) in a test DNA with respect to a control. Due to its low cost, reliability and ease of implementation it has become very popular both as a research and a diagnostic tool. After hybridization, ligation and multiplex amplification of specific probes targeting different genomic regions, the probes are electrophoresed and analyzed using a DNA analyzer [13]. Each specific probe migrates according to its molecular weight and the resulting pherograms show specific peaks that correspond to each probe (Figure 1). Relative dosage information can be obtained after the comparison of peak intensities (height or area) of the different probes between test and control samples.
Due to the variation of PCR efficiencies across probes (due to their different size and nature) and across samples, a normalization method is needed before comparing dosage quotients. This step is crucial because the variation in experimental conditions may lead to differences of measured values between samples, thus hampering the correct interpretation of results. After normalization, which eliminates possible differences introduced during the experimental process, we aim to find biological differences in gene dosage (copy numbers). That is, the normalization tries to minimize the amount of systematic nonbiological variation among samples. Different normalization methods have been used for analyzing MLPA data. The most common method divides each intensity by the sum of all intensities in each sample (see, for instance [17] and [18]). Other alternative approaches are based on regression methods that account for the amplification decay of larger probes. One of such regression methods uses internal control probes [19], while the another normalizes intensities based upon the statistically most probable median peak intensities using a median filter [20]. Other authors suggest to normalize peak intensities by using 4 separate peak groups according to increasing frament sizes ot the peaks (see [21] for further details). Finally, a similar approach using the mean intensity of control probes inside a normalization group as the dividing factor was considered in [22].
After an intensity normalization, the next key point is the determination of genes that are significantly altered. As an example, the peak intensity of exon 13 in Figure 1 seems to be lower in patient than in control indicating a deletion. Again, different approaches have been used to find probes that are altered. In this work we are considering ratio [18, 23] and regressionbased [19, 20] methods. In both approaches, the basis for the analysis is the comparison among normalized peak intensities from patient and control samples using a dosage ratio. In the ratiobased approach, deletions and duplications are given as outliers from the data set after defining a "biological" threshold (i.e. in a diploid genome, two copies exists of each gene). We assume that the simplest, most likely scenario, is an heterozygous gain or a loss of the material, under which ideal ratio values of 1.5 and 0.5 are expected, respectively. In such a simple scenario, and taking into account experimental noise, it is generally accepted in the literature that values below 0.7 and above 1.33 are indicative of loss and gain of a genetic material, respectively [13, 18]. The regressionbased approach is based on fitting several linear regression models, considering the normalized peak intensity of a given patient as the dependent variable and the normalized peak intensities of the mean control sample as independent variables. Another method based on taking into account the individual noise for each probe (i.e., standard deviation) for all control samples is considered in [24].
Both the ratio and regressionbased methods have general drawbacks. The ratiobased approach always considers the same gain/loss threshold without taking into account the different variability among experiments and among probes. On the other hand, although the regressionbased approaches solve this problem, fitting regression models without considering that the independent variable (normalized peak intensity in controls) is also subject to error may lead to wrong conclusions. In addition, the regression method uses confidence intervals for predictions which are built under the normal theory. The normality assumption may be untenable, particularly because of the limited amount of control probes that are typically used in routine experiments (< 10 probes). Finally, none of these procedures considers replicates for each individual, reducing information by averaging the replicate values.
To overcome these difficulties, we propose alternative methods to deal with the normalization and the discovery of alterations when analyzing MLPA data. Our approach is based on both nonlinear (normalization procedure) and linear (to determine alteration) mixed models. The main advantages of our methods are: (1) the variation among individuals is modelled explicitly using a random effect; and (2) the replicates may be easily incorporated in the model. The deletions and duplications will be determined using tolerance intervals. We validate and compare our methods with the existing methods using a controlled experiment. We evaluate the presence of gains and loses in a set of DNAs from normal individuals and individuals affected by known genetic disorders. The probemix we used included probes of single copy number regions located both inside and outside of the genomic disorder regions. A total of 30 samples, including 24 controls and 6 patients, were analyzed blindly.
Results
Simulation study
In this section, we investigate the performance of the two existing methods and our proposed approach through a number of simulations. We simulated an hypothetical assay with 45 probes. We also simulated internal control probes (i.e. nonaltered) to be used in the iterative regression approach by setting σ_{ γ }= 0. Two different scenarios were simulated in order to investigate the performance of REXMLPA approach given the number of control probes: 10% and 20% of the 45 probes. Finally, the ratios of the rest of probes were simulated as altered (gains or loses) depending on a different percentages: 5%, 10%, 20% and 50% of the probes. These values were generated from a normal distribution under different scenarios varying between probe variability (σ_{ β }∈ {0.2, 0.4, 0.6}), probetest variability (σ_{ γ }∈ {0.2, 0.4, 0.6, 1.0, 1.5, 4.0}) and withinprobe or random variability (σ_{ ε }∈ {0.05, 0.08}). We simulated 3 replicates of each peak intensity for test and control samples. Note that the average of these replicates were considered in the case of using threshold and REXMLPA approaches. We summarized our simulations by computing the mean number of altered probed simulated in each run, the empirical typeI error and the empirical power to detect gains and loses. These results are based on 1,000 simulations.
Tables 1 and 2 show the results for the case of having 10% of true altered probes when 10% and 20% of probes have been generated as a internal control probes, respectively. The results for the REXMLPA method were based on building confidence limits at 99% at each iteration. Regarding the empirical typeI error, the REXMLPA approach usually overestimate the expected 5%, while the threshold method clearly underestimates typeI error since in all simulations the simulated typeI error was closer to 0% (results omitted in the tables). On the other hand, the probespecific model shows a good performance. As expected, the power of the three methods increases when the probetest variability increases. That is, the power depends on the size effect of those altered probes. In general, the probespecific mixedmodel outperforms both REXMLPA and threshold approaches. The threshold approach is clearly underpowered for those cases in which the magnitude of the effect is not large enough. Finally, the same pattern is observed when the random variability is increased (e.g. withinprobe variability) but showing lower power. The simulation results for the case of having 5%, 20% and 50% of true altered probes showed similar behaviour and they can be found in the Additional file 1.
Validation study
In order to validate our proposed method, and compare its performace with those of the existing methods, we have designed a controlled MLPA probemix in which we have included 34 different oligonucleotides, corresponding to 16 different targeted regions with single copy number in normal individuals. Eight of the targeted regions are variable in copy number in individuals suffering from different genomic disorders (see Table 3). A total of 30 DNAs from different individuals were assayed in triplicate in the study. 24 of the DNAs came from unrelated HapMap individuals (i.e. normal general population), 2 DNAs from PraderWilli syndrome patients (15q11q13 deletion), 2 DNAs from DiGeorge syndrome patients (22q11 deletion) and 2 DNAs from autistic patients with a duplication in 10q region. After hybridization, ligation and PCR, probes corresponding to the different individuals were electrophoresed and peak intensities were recorded and blindly analyzed using the different methods. MLPA hybridization, ligation and PCR with universal MLPA primers was performed as described elsewhere [13]. PCR products were loaded on a capillar DNA analyzer and electrophoresed. Genescan software was used to analyze the runs and to retrieve peak intensities corresponding to each probe in the different samples.
Comparison of methods of copy number estimation
Before analyzing the data we illustrate that our proposed method based on a mixedmodel can be applied. Supplementary Figure S1 shows the residual error for each individual and probe indicating that the residuals are centered at zero and that the variability does not change with individuals. Table 4 shows the results obtained from the 9 test samples used in the validation study using the threshold, iterative regression and proposed probespecific mixedmodel method. For the two patients with PraderWilli syndrome, we observed that all methods were able to detect the three deletions in 15q11q13 region. The same conclusion was observed regarding patients diagnosed with DiGeorge syndrome. Considering patients with Autism, all methods detected gains in the 10q region. Only in one case threshold method had a false negative result and the REXMLPA method indicated a deletion in the UBE3A, a false positive finding. These results clearly agree with our simulation studies in which we found that iterative regression has a increased typeI error rate and the threshold approach a low power. For instance, the ratio between the second individual diagnosed with Autism and the mean controls was 1.27 at probe named ZWINT, so the threshold method did not indicate that this was a probe altered in this individual. This result is due to the fact that the ratio between the probetest variability and the random error variability was around 3 (3.09 = \frac{0.1423}{0.0461}) and the simulation study showed that in this case the power of threshold approach is very poor. Finally, regarding HapMap individuals who were considered as negative controls, we observed that any method found a false positive result.
Conclusion
The MLPA method is a potentially useful semiquantitative method to detect copy number alterations in targeted regions obtained after performing genomewide screening using comparative methods usually based on genomic hybridization. MLPA is based on comparing peak intensities of the different probes between test and control samples through its ratio. Before determining the statistical significance of dosage ratios, a normalization procedure is needed to control variation of PCR efficiencies across probes. We have proposed a nonlinear mixedmodel to perform such normalization, and two other existing methods have been discussed. One of them is based on considering the sum of peak intensities in each sample, while another uses internal control probes to fit a regression line which is used as a reference.
So far, the widely used method to detect statistical significance of dosage ratio is to consider as altered those probes that are outside a given threshold. In this paper, we have described a new approach to detect altered probes based on a mixedmodel. The threshold is then established by using a tolerance interval which accommodates the specific random error variability observed in each test sample. We have also discussed another approach based on an iterative regression. Through simulation studies and a controlled MLPA assay, we have shown that our proposed method outperforms the existing ones. Another important advantage by comparing our approach and the REXMLPA method is that our method can be used even when no internal control probes are available in the assay.
A novelty of our algorithm is that it can handle information from replicate experiments. Replicates of MLPA assays are typically not necessary for commercial kits since in most cases more than one probe is used to interrogate the same region and concordance between those probes is considered to provide enough reliability of the existence of a copy number variation. Nevertheless, in cases where a single probe per region is placed in the assay, it is definitely useful and desirable (if not necessary) to have replicates that help minimise possible variations arising during the long MLPA protocol and to ensure reliability of the findings. Thus, having a method that is able to handle such information is very useful and provides extra statistical robustness to the findings.
The use of control probes as references for data normalization is one of the methods recommended by the manufacturer. Commercial and homemade MLPA assays are clearly different in terms of quality of the resulting pherograms, mainly due to the process of producing the specific probes that target regions of interest. While commercial applications produce very high, sharp and neat peaks, homemade assays typically do not perform that well. Nevertheless, in both cases, the use of control probes is always desirable from a statistical point of view and it is so regardless of the number of probes that target a determined loci. In exploratory experiments with MLPA designs targeting extremely variable and polymorphic CNVs, adding control probes whose behaviour is proven unvariable and can help in assessing reliability of single experiments. In such cases, we understand that the inclusion of evenly spaced control probes is a good practice. One key point is to know the number of replicates that should be included in a MLPA assay. Our simulations indicates that 3 replicates was enough to reach a power of 90% keeping a value of alfa around 5% in those combinations where the probetest variability was over 8 times greater than that of random error (combinations where σ_{ γ }= 0.4 and σ_{ ε }= 0.05). Thus, if the probetest versus random error variability ratio was lower, a lesser power would be expected. In that case the researcher could improve the expected power by increasing the number of replicates.
Although our proposed method outperforms other existing ones, it has to be kept in mind that in the case of having probes targeting regions that are highly variable among the population (i.e. in an extreme case where half individuals are normal homozygous and half are homozygously deleted), the tolerance intervals calculated for those probes might collect all existing variability and become very broad, thus making it impossible to distinguish the existence of copy number alteration. Nevertheless, the other existing methods would also show inconsistent copy number calls when analysing such regions. In statistical terms, this means that we should have homogeneity of the random error variance through the probes. This is the reason why a visual inspection of the pherograms as well as manual curation of the results might be the most reliable way to proceed. However, we can easily accomodate cases where the homogeneity of random error variance assumption cannot be assumed by introducing an interaction proberandom error allowing a different random error variance for each probe. In this case, different tolerance bands would be therefore estimated for each probe.
In conclusion, we have proposed a mixedmodel which is able to determine thresholds to decide whether a region is altered. These threholds are specific for each individual, incorporating experimental variability, resulting in improved sensitivity and specificity as the examples with real data have revealed. An R language package for the three approaches discussed in this paper and data analyzed will be freely available at our web page [25].
Methods
To further illustrate the statistical methods described in this section, we are using real data provided by the United Kingdom National Genetics Reference Laboratory of Manchester [26]. In particular we are analyzing a dataset from a breast cancer study (called P002 BRCA1), in which copy number changes of different genes have previously been reported. In this example, 9 control probes and 25 analytical probes were analyzed in a total of 5 controls and 8 different test samples.
Normalization
Herein, we will focus on methods based on considering the total peak intensities and regressionbased approaches. The first and the simplest method of normalization, determines the normalized signal of each probe in a given sample by dividing peak intensity by the sum of all peak intensity of the sample. This method was initially proposed by [13] and is the recommended method by MRCHolland [27] which provides commercial probes for the copy number analysis of different genome regions for diagnostic purposes. Let us begin by giving some notations to illustrate how this method works. Let H_{ ipk }be the peak intensity for the ith individual, i = 1, ..., C, C + 1, ..., C + T (C is the number of control samples and T denotes the number of test samples), the pth probe, p = 1, ..., P, and the kth replicate, k = 1, ..., K. The existing simple method designed for normalizing data considers the average of these replicates, {\tilde{H}}_{ip}={\displaystyle {\sum}_{k=1}^{K}{H}_{ipk}/K}, and divides each measured peak intensity by the sum of all peaks, {S}_{i}={\displaystyle {\sum}_{p=1}^{P}{\tilde{H}}_{ip}} of that sample, i.e. {\tilde{Y}}_{ip}={\tilde{H}}_{ip}/{S}_{i}.
In addition to experimental variability, as Figure 1 shows, peak intensity exhibits a negative correlation with probe size. To account for this effect, regression methods may be employed. If internal control probes are provided in the MLPA experiment (dark vertical lines in Figure 2A), a regressionbased method using the control probes may also be employed to normalize the data. Different methods have been proposed. The first approach, known as "slope correction" [19], model the negative correlation between probe size and peak intensities using a linear regression model following the formula {\tilde{H}}_{{p}_{c}}=\alpha +\beta {X}_{{p}_{c}}, where X_{ pc }denotes the size of the p_{ c }th internal control probe for a given individual (Figure 2B solid blue line), or using a quadratic model {\tilde{H}}_{{p}_{c}}=\alpha +{\beta}_{1}{X}_{{p}_{c}}+{\beta}_{2}{X}_{{p}_{c}}^{2} (Figure 2B, dotted blue line).
Another possibility proposed by [20] is to fit an exponential decay model, {\tilde{H}}_{p}=\alpha {\text{e}}^{\beta {X}_{p}}. In that case, the authors propose to consider all probes p and call this method "population normalization". Notice that in that case β encodes the rate of descend. [20] argued that the trend of peak intensities varies greatly between samples regarding the control probes. This is the reason why they propose to consider all probes (both analytical and control ones) after applying a median filter approach to remove outliers. However this method has important inconveniences. The smoothed data, obtained after applying the median filter, violates some of the assumptions of nonlinear regression such as that residuals are no longer independent or that errors are not Gaussian, among others [28].
Neither slope correction nor population normalization approaches consider replicates, e.g. they use {\tilde{H}}_{ip} instead of H_{ ipk }. In addition, these methods do not model the variability among individuals either. That is, using these methods, the authors allow only β to vary among individuals fitting different models for each one. However, the pattern of decay may vary highly between subjects, which is not captured by varying β 's only. As an example, the height of the peak intensity for the first control probe (top parameter in Figure 2C) and the asymptotic value for larger probe sizes (bottom parameter in Figure 2C) are very different among test samples, as we observe in Figure 3.
To circumvent these problems, we propose the following nonlinear mixed model to normalize the data which includes replicates and variability among individuals:
where X_{ p }denotes the size for the pth probe (note that it only depends on the probe since it is the same for each individual and each replicate) φ_{i 1}is the maximum peak intensity for control probes, φ_{2i}, is the asymptotic peak intensity, and φ_{3i}is the reciprocal of the rate decay constant. The term off_{ i }provides a more stable parametrization for the data and corresponds to the average value of the peak size for the first control probe, off_{ i }= {\displaystyle {\sum}_{k=1}^{K}{H}_{i1k}}, i = 1, ..., C, C + 1, ..., C + T. The fixed effects, β represent the population average of subjectspecific parameters, φ_{ i }, and the random effects, b_{ i }, represents the deviations of the φ_{ i }'s from their population average. The random effects are assumed to be independent and the withingroup errors ε_{ ipk }are assumed to be independent for different i, p and to be independent of the random effects. The model parameters are fitted by maximizing the restricted loglikelihood (RMLE) of the data using the R library nlme [29].
After estimating this generalized exponential model, the normalization procedure is performed by dividing the peak intensities, H_{ ipk }, by the regression estimate, {\widehat{H}}_{ipk}, obtaining a normalized, sizeadjusted peak intensity for every probe, Y_{ ipk }= H_{ ipk }{\widehat{H}}_{ipk} (Figure 2D). For the control samples we consider a unique regression line corresponding to the average of all control samples. Nonetheless, for the test samples the normalization is performed individually for each subject.
Statistical significance of dosage ratios
Ratio approach
The normalized peak intensities may be directly compared between control and test samples peak by peak. The first method for determining genes that are altered (gains or loses) is based on a threshold rule. The probes whose dosage ratios are outside of the threshold lines are considered altered. The thresholds are normally 0.7 for deletions (note that 0.5 would correspond to a half reduction in the dosage but this value is considered too stringent by some researchers) and 1.33 for duplications, although other cutpoints have been used in other studies [18].
Iterative regression
The main disadvantage of defining a threshold is that the specific variability of each experiment is not considered. Some authors have proposed alternative methods using linear regression models. [19] stated that, after a squareroot transformation of the data, the plot of normalized peak intensities for test samples versus control samples should be around the diagonal if no dosage imbalances are present; otherwise, probes above or below would correspond to duplications or deletions, respectively. Their proposal was to fit a linear regression without intercept using only control probes and use the standard error of the regression to determine confidence limits for outlier detections. As in the case of normalization procedures, however, a small number of control probes is a limiting factor for adopting this approach since, as the author stated, the standard error of the regression underestimates the normal variability. Thus, [19] proposed an iterative regression algorithm for this problem. Starting with the control probes, the linear regression is fitted. Then, a prediction interval is computed for every probe. If the normalized peak intensity of a noncontrol probe falls into this interval, the probe is considered to belong to the nonaltered population and it is added to the control probes in the next step. After a few iterations, a final model is reached including only those probes that are considered as preserved. In practice, those probes that are not included in this final model are considered to be either deleted or duplicated. The authors call this method RegressionEnhanced MLPA analysis (REXMLPA). Figure 4 illustrates the confidence limits of the final regression model. [20] proposed a similar approach but starting with all probes and retaining and rejecting points at each iteration with a given level of confidence.
The REXMLPA approach assumes that the X variable (e.g. squaredroot of mean normalized peak intensities in controls) in the linear regression model is measured without error. This assumption is not tenable since this variable corresponds to the normalized peak intensity measured in multiple control samples. Violations of this assumption are a serious problem when trying to accurately predict Y (squaredroot of normalized peak intensities in test sample) from X.
The probespecific mixture model
The REXMLPA approach has the limitations discussed above, specifically, the regression models are fitted without considering that the independent variable (normalized peak intensity in controls) is also subject to error. In addition, confidence intervals for predictions are built assuming the normal theory. This assumption may not hold when only control probes are considered since we typically have no more than 10 probes. Replicates for each individual are also not considered.
Based on these limitations, we propose the following model to compare controls with a given test sample:
where μ is the mean normalized peak intensity across individuals and probes, α_{ t }is a fixed effect representing a different mean for all controls (t = 0) and a given test sample (t = 1), β_{ p }is a random effect for the p th probe, γ_{ tp }is is the probetest interaction random effect, and ε_{ tpk }is a random error variable. The index p denotes the probe, and K is the number of replicates. We assume that β_{ p }, γ_{ tp }, and ε_{ tpk }are independent, homocedastic, and normally distributed random variables with mean zero. The variances are denoted by {\sigma}_{\beta}^{2} for the, β_{ p }, or the "betweenprobe" variability, {\sigma}_{\gamma}^{2} for γ_{ tp }, or "probetest" variability, and {\sigma}_{\epsilon}^{2} for the ε_{ tpk }, or "withinprobe" variability. Considering the assumptions of the proposed model, we need to verify that the withingroups errors are independent and identically normally distributed, with mean zero and variance {\sigma}_{\epsilon}^{2}, and that they are also independent of the random effect. It can be easily checked using different plots. This assumption makes sense for biologists since probes that behave abnormally in replicate studies or when hybridised onto DNAs that have a different quality are typically removed or redesigned to ensure assay robutness and reliability. The model parameters are fitted by maximizing the restricted loglikelihood (RMLE) using the R library lme (Pinheiro and Bates, 2000).
The criterion
For the sake of simplicity, we take control samples as the reference, i.e. α_{0} = 0 and α_{1} = α, and γ_{0p}= 0. Therefore, α may be interpreted as an average deviation between test and control samples across all probes, and γ_{1p}as the deviation controltest for the pth probe.
The criterion to determine those probes that are altered for each test sample is based on the observed differences among controls and its differences with controls samples. Conditioned to pth probe, the difference between two control measurements is distributed as
Conversely, the difference between a test, i', and a control sample, conditioned to a given probe, is distributed as
where {\sigma}_{\epsilon}^{2} corresponds to the variance of the error defined in equation (2). Then, we consider that a probe is not altered if the mean of the difference between test and control samples is included within a probability interval defined by differences between control samples distribution. Hence, the criterion consists in checking whether the difference between test and control samples is greater than that expected between two control samples. That is, whether the difference is included in the interval
where Z_{1κ/2}is the 1  κ/2 percentile from the standard normal distribution. Hence, 1  κ is the proportion of difference controlcontrol that the difference testcontrol must exceed to declare that a probe is altered. Nonetheless, when using sample data to estimate such intervals, the uncertainty associated with the estimation estimating them has to be considered. Therefore, we propose to estimate the interval (5) through a tolerance interval over the controlcontrol differences distribution. To build such intervals, it is assumed that \frac{\nu {\widehat{\sigma}}_{e}^{2}}{{\sigma}_{e}^{2}}~{\chi}_{\nu}^{2}, where ν are the residual degrees of freedom, and {\widehat{\sigma}}_{e}^{2} is the REML estimator of {\sigma}_{e}^{2}. Therefore, the tolerance interval that contains (1  κ)% proportion of the controlcontrol differences estimated with a confidence of (1  α)% is
Once the limits of the criterion are established, the mean difference testcontrol, δ_{ p }for each probe, p, is estimated using the estimate of α + γ_{ i'p }, where γ_{p 1}is estimated through the best linear unbiased predictor (BLUP). Then a probe for a given test sample is considered as altered when δ_{ p }does not fall into (6). Figure 5 shows those probes that are imbalanced for each individual considering the threshold obtained using tolerance limits obtained using equation (6). In this example those limits are 0.78 for deletions and 1.29 for duplications.
Abbreviations
 CNV:

Copy Number Variant
 MLPA:

Multiplex Ligationdependent Probe Amplification
 DNA:

Deoxyribonucleic acid
 aCGH:

arraybased Comparative Genomic Hybridization
 QMPSF:

Quantitative Multiplex PCR of Short Fluorescent
 PCR:

Polymerase Chain Reaction
 MAPH:

Multiplex Amplifiable Probe Hybridization
 BRCA1:

Breast Cancer 1
 RMLE:

Restricted maximum loglikelihood
 REXMLPA:

RegressionEnhanced MLPA
 BLUP:

Best linear unbiased predictor
 PEMM:

probespecific mixture model
References
Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of largescale variation in the human genome. Nat Genet 2004, 36(9):949–51. 10.1038/ng1416
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Largescale copy number polymorphism in the human genome. Science 2004, 305(5683):525–8. 10.1126/science.1098918
Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, Olson MV, Eichler EE: Finescale structural variation of the human genome. Nat Genet 2005, 37(7):727–32. 10.1038/ng1562
Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, Oseroff VV, Albertson DG, Pinkel D, Eichler EE: Segmental duplications and copynumber variation in the human genome. Am J Hum Genet 2005, 77: 78–88. 10.1086/431652
Locke DP, Sharp AJ, McCarroll SA, McGrath SD, Newman TL, Cheng Z, Schwartz S, Albertson DG, Pinkel D, Altshuler DM, Eichler EE: Linkage disequilibrium and heritability of copynumber polymorphisms within duplicated regions of the human genome. Am J Hum Genet 2006, 79(2):275–90. 10.1086/505653
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L, Conrad DF, Estivill X, TylerSmith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature 2006, 444(7118):444–54. 10.1038/nature05329
Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE, Lam WL: A comprehensive analysis of common copynumber variations in the human genome. Am J Hum Genet 2007, 80: 91–104. 10.1086/510560
Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet 2006, 7(2):85–97. 10.1038/nrg1767
Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, TylerSmith C, Carter N, Scherer SW, Tavare S, Deloukas P, Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848–53. 10.1126/science.1136678
Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, Murthy KK, Rovin BH, Bradley W, Clark RA, Anderson SA, O'Connell RJ, Agan BK, Ahuja SS, Bologna R, Sen L, Dolan MJ, Ahuja SK: The influence of CCL3L1 genecontaining segmental duplications on HIV1/AIDS susceptibility. Science 2005, 307(5714):1434–40. 10.1126/science.1101160
RoveletLecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Dumanchin C, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant earlyonset Alzheimer disease with cerebral amyloid angiopathy. Nat Genet 2006, 38: 24–6. 10.1038/ng1718
Le Marechal C, Masson E, Chen JM, Morel F, Ruszniewski P, Levy P, Ferec C: Hereditary pancreatitis caused by triplication of the trypsinogen locus. Nat Genet 2006, 38(12):1372–4. 10.1038/ng1904
Schouten JP, McElgunn CJ, Waaijer R, Zwijnenburg D, Diepvens F, G P: Relative quantification of 40 nucleic acid sequences by multiplex ligationdependent probe amplification. Nucleic Acids Res 2002, 30(12):e57. 10.1093/nar/gnf056
Charbonnier F, Raux G, Wang Q, Drouot N, Cordier F, Limacher JM, Saurin JC, Puisieux A, Olschwang S, Frebourg T: Detection of exon deletions and duplications of the mismatch repair genes in hereditary nonpolyposis colorectal cancer families using multiplex polymerase chain reaction of short fluorescent fragments. Cancer Res 2000, 60(11):2760–3.
Casilli F, Di Rocco ZC, Gad S, Tournier I, StoppaLyonnet D, Frebourg T, Tosi M: Rapid detection of novel BRCA1 rearrangements in highrisk breastovarian cancer families using multiplex PCR of short fluorescent fragments. Hum Mutat 2002, 20(3):218–26. 10.1002/humu.10108
Armour JA, Sismani C, Patsalis PC, Cross G: Measurement of locus copy number by hybridisation with amplifiable probes. Nucleic Acids Res 2000, 28(2):605–9. 10.1093/nar/28.2.605
Lai KK, Lo IF, Tong TM, Cheng LY, Lam ST: Detecting exon deletions and duplications of the DMD gene using Multiplex Ligationdependent Probe Amplification (MLPA). Clin Biochem 2006, 39(4):367–72. 10.1016/j.clinbiochem.2005.11.019
Palomares M, Delicado A, Lapunzina P, Arjona D, Aminoso C, Arcas J, Martinez Bermejo A, Fernandez L, Lopez Pajares I: MLPA vs multiprobe FISH: comparison of two methods for the screening of subtelomeric rearrangements in 50 patients with idiopathic mental retardation. Clin Genet 2006, 69(3):228–33. 10.1111/j.13990004.2006.00567.x
Mavrogiannis LA, Cockburn DJ: RegressionEnhanced Analysis of Multiplex LigationDependent Probe Amplification (REXMLPA). Tech rep., Yorkshire Regional DNA Laboratory 2004. [http://leedsdna.info/downloads.htm]
Kellander M, Riley M, Liu C: GeneMarker Software for Multiplex Ligationdependent Probe Amplification (MLPA). Tech rep., SoftGenetics LLC 2006. [http://www.softgenetics.com/GeneMarkerMLPA.html]
Gerdes T, Kirchhoff M, Bryndorf T: Automatic analysis of multiple ligationdependent probe amplification products (exemplified by a comercial kit for prenatal aneuploidy detection). Electrophoresis 2005, 26(22):4327–7332. 10.1002/elps.200500390
Huang C, Chang Y, Chen C, Kuo Y, Hwu W, Gerdes T, Ko T: Copy number analysis of survival motor neuron genes by multiplex ligationdependent probe amplification. Genet Med 2007, 9(4):241–248.
Pastrello C, Baglioni S, Tibiletti MG, Papi L, Fornasarig M, Morabito A, Agostini M, Genuardi M, Viel A: Stability of BAT26 in tumours of hereditary nonpolyposis colorectal cancer patients with MSH2 intragenic deletion. Eur J Hum Genet 2006, 14: 63–8.
Kirchhoff M, Gerdes T, Brunebjerg S, Bryndorf T: Investigation of patients with mental retardation and dysmorphic features using comparative genomic hybridization and subtelomeric multiplex ligation dependent probe amplification. Am J Med Genet A 2005, 139(3):231–233.
CREAL's webpage[http://www.creal.cat/jrgonzalez/software.htm]
NGRLManchester[http://www.ngrl.org.uk/Manchester/Technologypubs.htm]
MRCHolland[http://www.mlpa.com/pages/support_mlpa_analysis_normalisationpag.html]
Motulsky H, Christopoulos A: Fitting Models to Biological Data Using Linear and Nonlinear Regression: a practical guide to curve fitting. Oxford: Oxford Univesity Press; 2004.
Pinheiro JC, Bates DM: MixedEffects Models in S and Splus. New York: Springer Verlag; 2000.
Acknowledgements
We would like to thank the two reviewers for their constructive comments, which have led to improvements in the manuscript. The laboratory of X.E is supported by the Departament dEducació i Universitats and the Departament de Salut of the Catalan Autonomous Government ("Generalitat de Catalunya"); the Ministry of Health, and the Ministry of Education and Science of the Spanish Government; and the European Union Sixth Framework Programme. CIBERESP is supported by the Instituto de Salud Carlos III of the Ministry of Health of the Spanish Government.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
JRG and JLC developed the new statistical methods and performed the simulation studies. JRG wrote the R functions and the main text of the manuscript. LA and SV performed the MPLA experiment, interpreted the results and tested the programs functionality. LA, LJ and YY proposed abundant suggestions for improving the implementation of the models and participated in the design and discussion of this study. YY and XE reviewed the paper and revised its framework. All authors have read, and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
González, J.R., Carrasco, J.L., Armengol, L. et al. Probespecific mixedmodel approach to detect copy number differences using multiplex ligationdependent probe amplification (MLPA). BMC Bioinformatics 9, 261 (2008). https://doi.org/10.1186/147121059261
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059261