Background

Many studies have been published over the past 20 years that involved examining pairs of tumors at the molecular level from a set of patients to determine if, for some patients, the tumors are clonal, i.e. one of the tumors is a metastasis of the other tumor. We focus in this article on the setting where the data comprise somatic mutations from a panel of genes. Various statistical methods have been proposed in the literature. One approach has been to characterize the evidence for clonality using an index of clonal relatedness (see [1] and [2]). However in constructing the index these authors have focused solely on mutations that are shared between the two tumors, ignoring the information from mutations that occur in one tumor but not the other, evidence that argues against clonal relatedness. Other authors have used the proportion of observed mutations that are shared as the index [3, 4], while Bao et al. [5] formalized this idea by assuming that the matched mutations follow a binomial distribution. All of these approaches analyze each case independently. To our knowledge, the approach we discuss in this article, improving upon Mauguen et al. [6], is the only available method that models the data from all cases collectively to obtain parametric estimates of the proportion of cases in the population that are clonal. Also our method relies heavily on the recognition of the fact that the probabilities of occurrence of the observed mutations are crucially informative,especially for shared mutations. Motivated by a study of contralateral breast cancer that will be described in more detail in the next section, we developed a random-effects model to simultaneously analyze each case for clonal relatedness and to obtain an estimate of how frequently this occurs [6]. The corresponding function mutation.rem has been added to the R package Clonality, originally described in Ostrovnaya et al. [7]. Overall, the properties of this model were demonstrated to be quite good, in the sense that the parameter estimation has generally low bias except in small samples, ie where only a few cases from the population are available [6]. Recently, in applying the model anecdotally, we noticed that in such small datasets, examples can arise where the maximum likelihood estimator of the proportion of clonal cases is zero, even when mutational matches have been observed in some cases. This tends to occur if the absolute number of cases with matches is small, either because the overall number of cases is small, or the proportion of cases that are clonal is small, or in clonal cases the proportion of mutations that are matches is small. This is problematic because it renders the probabilities of clonal relatedness to be exactly zero for all individual cases, an estimate that seems unreasonable, especially if matches on rare mutations have been observed. We thus became interested in alternate estimation methods. In this article we compare estimates obtained by the EM algorithm versus our first approach using a one-step estimate of the conditional likelihood.

Motivating example

We use data from a study that involved 49 women with presumed contralateral breast cancer [8]. That is, in all of these women the cancers in the opposite breasts were diagnosed clinically as independent primary breast cancers. The tumors were retrieved from the pathology archives at Memorial Sloan Kettering Cancer Center and subjected to sequencing using a panel of 254 genes known or suspected to be important in breast cancer. The key data, i.e. the numbers of mutations and matches for each case, as well as the probability of occurrence for the matched mutations, are reproduced in Table 1. The probabilities of occurrence of each specific mutation are considered known, but must actually be estimated from available sources, such as the Cancer Genome Atlas [9]. Six of the 49 cases had at least 1 mutational match, i.e. exactly the same mutation in both tumors. For 3 of these cases the match was observed at the common PIK3CA H1047R locus, known to occur in approximately 14% of all breast cancers. We note that common mutations like this one can vary by disease sub-type but we elect to use probabilities associated with breast cancer overall since the study has a mix of sub-types. Since it is plausible these common mutations could occur by chance in a pair of independent breast cancers, the evidence for clonal relatedness is much less strong than for the other 3 cases with matches at rarely occurring loci, something very unlikely to happen in independent tumors.

Table 1 Study of contralateral breast cancers

When we apply our random-effects analysis to these data, described in more detail in the “Methods” section, our estimate of the proportion of cases that are clonal (denoted henceforth by π) is 0.059, close to the proportion 3/49, reflecting the fact that the model appears to consider the 3 cases with rare matches as clonal and the 3 cases with the common matches as independent. Estimation problems can occur, however, in datasets very similar to this one. For example, when we eliminate from the analysis the two cases that are most clearly clonal, cases #36 and #48, the estimate of π is 0, despite the fact that case #8 possesses a very rare match pointing strongly to clonal relatedness. Thus, a different estimation method that reduces the frequency with which boundary estimates of π occur is advisable.

Results

Simulations were conducted for sample sizes of 25, 50 and 100, with the population proportion of clonal cases (π) ranging from 0.10 to 0.75. The distribution of the clonality signal is characterized by 3 different lognormal distributions plotted in Fig. 1. These three scenarios represent, respectively, settings where a small proportion of mutations in a clonal case will be matched (scenario 1), where most of these mutations will be matched (scenario 3), and an intermediate scenario. Note that scenario 1 is particularly problematic for estimation, especially when π is small, since in this setting few of the cases will be clonal and these few clonal cases will tend to have few, if any, matches.

Fig. 1
figure 1

Log-normal distributions of the clonality signal

Table 2 presents the simulation results for the estimates of π averaged over 500 simulations for each setting, along with the standard deviations and ranges of the estimates. Biases can be obtained by comparing these averages with the true value of π in the second column of the table. These biases are generally modest, though it is noteworthy that our original one-step approach tends to have positive biases while the approach using the full likelihood and the EM algorithm generally leads to negative bias. More importantly, Table 2 also reports the numbers of times the estimates were exactly on the boundary, i.e. 0 or 1. These occurrences are much less frequent using the EM algorithm and are mostly limited to the small case sample (N =25), low π (0.10) setting. The columns on the right-hand side of Table 2 summarize the results using the EM approach for those datasets in which the one-step maximization produced an estimate of π of either 0 or 1. These estimates are similar to the true π, showing the improved performance with the EM estimation strategy.

Table 2 Simulation results

The EM approach was used to re-analyze the breast cancer dataset described in the motivating example. When the full dataset of 49 cases is analyzed both methods lead to the same estimate, \(\hat {\pi } = 0.059\). However, when cases #36 and #48 are removed, the EM approach leads to \(\hat {\pi } = 0.050\) while the one-step method leads to the boundary value of \(\hat {\pi } = 0\). This is a reassuring result and is congruent with the simulations in that for the preponderance of datasets the use of EM does not affect the results. However, when we move closer to a boundary, by for example removing 2 of the 3 cases with strong evidence of clonal relatedness (cases 36 and 48), the new approach corrects the estimation where the old approach was failing.

Discussion

Our method provides a strategy for estimating, in a sample of cases with tumor pairs, the proportion of these cases that are clonally related, in addition to diagnostic probabilities for each case. As compared to other methods described in the introduction, the proposed model utilizes the information from a sample of patients, and includes all mutations observed in only one or in both tumors, in order to infer the probabilities of clonal relatedness. We now believe that an analysis of our proposed random-effects model should involve maximization of the likelihood using the EM algorithm rather than the one-step strategy based on conditioning on the latent clonality indicators that we had previously proposed. By doing so, we greatly reduce the chances that the estimator of the proportion of cases that are clonal will lead to an unsatisfactory boundary value. Of note, the increased performance comes at no cost regarding computation time. Our available R package Clonality [10] which includes the function to estimate the random-effects model, has been updated to adopt the EM strategy (version 1.32.0 and higher).

Conclusion

The EM algorithm is a preferable approach for our clonality random-effects model. It is now the method implemented in our R package Clonality, making available an easy and fast way to estimate this model on a range of applications.

Methods

The informative data Yj for case j of n cases encompasses a set of indicators for the presence of shared or private mutations in the tumor pair at genetic loci denoted by i. [Private mutations are those that occur in one tumor but not in its pair.] The sets Aj and Bj contain the shared and private mutations respectively. We denote Gj=AjBj. Each mutation i has a known probability of occurrence pi in a tumor. Let π denote the proportion of clonal cases in the population, and ξj the clonality signal for case j. The clonality signal represents the relative period of tumor evolution in which mutations accrued in the originating clonal cell, and thus represents the anticipated proportion of mutations observed in a case that are matches. The term Cj represents the true clonal status of the tumor pair, taking the value 1 when the case is clonal and 0 when the case is independent. Note that ξj=0 if Cj=0. In clonal cases, we assume that − log(1−ξj) has a lognormal density, with mean μ and standard-deviation σ. We use g(·) to denote density functions generically. As explained in Mauguen et al. [6], we previously used a conditional likelihood constructed in the following manner. Recognizing that

$$ {\begin{aligned} P\left(Y_{j} | \xi_{j}, C_{j} = 1 \right) = \prod_{i \in G_{j}} \!\left\{ \frac{\xi_{j} + (1-\xi_{j}) p_{i}}{\xi_{j} + (1-\xi_{j}) (2-p_{i})} \right\}^{I[i \in A_{j}]} \left\{ \frac{2(1-\xi_{j}) (1-p_{i})}{\xi_{j} + (1-\xi_{j}) (2-p_{i})} \right\}^{I[i \in B_{j}]} \end{aligned}} $$
(1)

and

$$ P\left(Y_{j} | C_{j}=0 \right) = \prod_{i \in G_{j}} \left(\frac{p_{i}}{2-p_{i}} \right)^{I[i \in A_{j}]} \left\{ \frac{2 (1-p_{i})}{2-p_{i}} \right\}^{I[i \in B_{j}]} $$
(2)

we elected to use case-specific likelihood contributions

$$L_{j}\left(\pi, \xi_{j} \right) = \pi P\left(Y_{j} | \xi_{j}, C_{j}=1 \right) + (1-\pi) P\left(Y_{j} | C_{j}=0 \right) $$

leading to

$$ L\left(\pi, \mu, \sigma \right) = \prod_{j=1}^{n} \int_{0}^{1} L_{j}\left(\pi, \xi_{j} \right) g(\xi_{j}) d\xi_{j}. $$
(3)

This allowed us to perform the maximization to estimate simultaneously the parameters π,μ, and σ using a one-step Box constrained quasi-Newton algorithm. However, although in simulations the properties of this process appear to indicate low bias, we found that it is not uncommon, especially in small datasets or those where π is close to a boundary of 0 or 1, for the parameter π to have an Maximum Likelihood estimate of 0 or 1, rendering the diagnostic probabilities for all cases to be either 0 or 1. This problem is caused by the fact that the simplified conditional likelihood in (3) above does not fully recognize the influences of the case-specific mutational profiles Yj on the case-specific clonality signals ξj and the individual levels of evidence regarding clonal relatedness Cj. In short we used the parameter representing the overall probability of clonality π in (3) rather than the case-specific probabilities of clonality, P(Cj=1|ξj,π,μ,σ). To address this problem we employ a likelihood structure that permits a more specific use of these data from individual cases and have constructed a strategy involving the EM algorithm to estimate the parameters.

This approach recognizes the fact that the terms Cj and ξj are latent variables and that our goal is to maximize the likelihood that is not conditioned on these latent variables, i.e.

$$ L = \prod_{j=1}^{n} P\left(Y_{j} | \pi, \mu, \sigma \right). $$
(4)

To perform the estimation we first recognize the following:

$$\begin{array}{*{20}l} P\left(Y_{j}, \xi_{j}, C_{j} | \pi, \mu, \sigma \right) = P\left(Y_{j} | \xi_{j}, C_{j} \right) \times g\left(\xi_{j}, C_{j} | \pi, \mu, \sigma \right) \end{array} $$
(5)
$$\begin{array}{*{20}l} = g\left(\xi_{j}, C_{j} | Y_{j}, \pi, \mu, \sigma \right) \!\times\! P\left(Y_{j} | \pi, \mu, \sigma \right). \end{array} $$
(6)

Note that the likelihood contribution of case j to (4) is a component of the right-hand side of (6). The EM algorithm permits us to instead maximize (iteratively) the expectation of the logarithm of this full likelihood, averaged over the latent variables conditioned on the data. That is, the expected likelihood is given by

$$ {\begin{aligned} E = \prod_{j=1}^{n} \int_{0}^{1} \log \left\{ P\left(Y_{j}, \xi_{j}, C_{j} | \pi, \mu, \sigma \right) \right\} g\left(\xi_{j}, C_{j} | Y_{j}, \tilde{\pi}, \tilde{\mu}, \tilde{\sigma} \right) d (\xi_{j}, C_{j}) \end{aligned}} $$
(7)

where \(\tilde {\pi }\), \(\tilde {\mu }\), and \(\tilde {\sigma }\) are the current estimates of the parameters. After choosing starting values for these parameters the expectation and maximization steps proceed iteratively until convergence. To calculate E we recognize that \(P(Y_{j}, \xi _{j}, C_{j} | \tilde {\pi }, \tilde {\mu }, \tilde {\sigma })\) is obtained easily from the defined terms on the right-hand side of (5), represented by (1) and (2) and the parametric model used for the distribution of ξj. Further, \(g(\xi _{j}, C_{j} | Y_{j}, \tilde {\pi }, \tilde {\mu }, \tilde {\sigma })\) can be obtained from Bayes Theorem, i.e.

$${\begin{aligned} g\left(\xi_{j}, C_{j} | Y_{j}, \tilde{\pi}, \tilde{\mu}, \tilde{\sigma} \right) = \frac{g\left(\xi_{j}, C_{j} | \tilde{\pi}, \tilde{\mu}, \tilde{\sigma} \right) P\left(Y_{j} | \xi_{j}, C_{j} \right)} {\int_{0}^{1} g\left(\xi_{j}, C_{j} | \tilde{\pi}, \tilde{\mu}, \tilde{\sigma} \right) P\left(Y_{j} | \xi_{j}, C_{j} \right) d(\xi_{j}, C_{j})}. \end{aligned}} $$