How should we measure proportionality on relative gene expression data?

Correlation is ubiquitously used in gene expression analysis although its validity as an objective criterion is often questionable. If no normalization reflecting the original mRNA counts in the cells is available, correlation between genes becomes spurious. Yet the need for normalization can be bypassed using a relative analysis approach called log-ratio analysis. This approach can be used to identify proportional gene pairs, i.e. a subset of pairs whose correlation can be inferred correctly from unnormalized data due to their vanishing log-ratio variance. To interpret the size of non-zero log-ratio variances, a proposal for a scaling with respect to the variance of one member of the gene pair was recently made by Lovell et al. Here we derive analytically how spurious proportionality is introduced when using a scaling. We base our analysis on a symmetric proportionality coefficient (briefly mentioned in Lovell et al.) that has a number of advantages over their statistic. We show in detail how the choice of reference needed for the scaling determines which gene pairs are identified as proportional. We demonstrate that using an unchanged gene as a reference has huge advantages in terms of sensitivity. We also explore the link between proportionality and partial correlation and derive expressions for a partial proportionality coefficient. A brief data-analysis part puts the discussed concepts into practice.


Introduction
The frequently compositional nature of biological data and its methodological implications (a.k.a. analysis of ''closed'' data) have not been widely acknowledged yet Lovell et al. (2011). One prominent example is RNA sequencing data, where the final readouts (total number of sequenced reads per library) need to be multiplied by factors depending on the library to recover the original total amounts of mRNA. This exact information about absolute quantities of mRNA can be unavailable or hard to precisely estimate -depending on the chosen protocols. As a result, assumptions need to be made with respect to parameter stability across experiments that will necessarily influence the final interpretation. For instance, most protocols assume equality of total mRNA amounts (in the cells) across samples, even though this (rather strong) assumption can well be violated Lovén et al. (2012) and lead to inaccurate conclusions. If total mRNA amounts of origin are not the same between samples, the dominance of certain transcripts in one condition can lead to other transcripts yielding lower read percentages even though their totals remain unchanged between conditions. This problem has been addressed by ''effective library size'' normalization Robinson and Oshlack (2012), a method that needs the weaker (and most parsimonious) assumption that most genes between samples remain unchanged and thus retain comparable expression levels. Also here we can think of many situations where this is not true, and there is simply no ''magic powder that can be sprinkled on closed data to make them open' . In data constrained to constant sums (e.g. RNA-seq data constrained to a fixed number of reads) individual readouts are not fully independent, and their comparison can easily be confounded, especially when drawing correlations. In the worst-case scenario this will result in wrong conclusions.
Statistics that were developed for unconstrained data (the most prominent being correlation), can lead to spurious results when applied to relative data Pearson (1897). Interestingly, in geochemistry, where data often are percentages of chemical compounds in rock samples, this kind of problem was reported decades ago and eventually addressed using log-ratio analysis . Logratio analysis uses log-ratio transformations to take the data from the simplex to real space, thus avoiding many of the problems associated with constrained data. Genomic research is a prime target for such methods. Their nonreliance on semi-arbitrary normalization procedures makes it possible to bypass problems when comparing data produced across a wide-range of conditions. Such an approach has the potential to significantly broaden analysis prospects by allowing the systematic re-analysis and consolidation of existing data sets. It will be especially useful in situations where the large number of experimental conditions or the involvement of different laboratories The ENCODE Project Consortium (2011), Lonsdale (2013) make it virtually impossible to perform all experiments under the exact same conditions.
Let us now consider an n Â d gene-expression data matrix where d genes correspond to the columns and the n (multivariate) observations are displayed in the rows. (Usually, this is a ''fat'' matrix in the sense that d is one or two orders of magnitude greater than n.) The observations are made under different experimental conditions or for a variety of genotypes, and for each such library, the expression values sum to a constant that is unrelated with the absolute amount of mRNA in the cells of origin. Each row in our matrix is thus considered a composition (to make each row a composition in the formal sense, we can divide all entries by the respective row sum, but this is unnecessary for the analysis proposed here). A row is denoted by a vector x whose elements x i are the gene expressions for genes i ¼ 1; . . .; d in the given condition. Such data are usually counts of sequencing reads mapped to the genomic locations in question, and their precise nature is not of interest to us here. The only condition they have to fulfill is that ratios between values from the same condition are maintained from the original data (which is the case after multiplication of a normalization factor but not after applying a quantile normalization). Additionally, due to the need for applying logarithms, we may want to consider the application of pseudocounts to the expression values or, alternatively, restrict the analysis to submatrices that do not contain zeroes. To make our treatment sufficiently general, we will consider our matrix an n-sample of a d-part random composition ðX 1 ; . . .; X d Þ. In this setting, a gene corresponds to a (compositional) random variable. On the other hand, a gene j can be visualized as the column vector of its n observations ðx 1j ; . . .; x nj Þ T (see Fig. 1 a).
To motivate our interest in proportionality, let us consider the underlying absolute data for a moment. Let x i ¼ a i =s, with a i denoting the absolute mRNA amount from gene i in the given condition and s the total mRNA amount in this condition: s ¼ P d j¼1 a j . We neither know s nor the a i ; but we have x i =x j ¼ a i =a j ; so the only information maintained from the original absolute amounts is in the gene expression ratios. Gene pairs for which these ratios stay constant across conditions can thus be correctly inferred even on relative data. Taking the log of these ratios makes them symmetric with their reciprocal values. While correlations between the columns of our compositional matrix cannot be defined coherently, the covariance structure of a compositional data matrix can be summarized considering, for all pairs i, j (i\j), the (sample) variances of their log ratios log x i x j . These will be close to zero if genes i and j maintain an approximately proportional relationship x i ' mx j across observations for some real value m.
In this contribution, we will interpret log-ratio transformations as an attempt to back transform relative data into absolute data. This point of view is not usually adopted, but it makes a connection with data normalization, a well-established field in genome research. In the first section of this paper, where we profit greatly from the treatment in Lovell et al. (2015), we show how proportionality can be measured as a kind of scaled log-ratio variance on absolute data. We then show that doing a log- Fig. 1 a Relative gene expression data matrix. Each row is considered a composition, the data can be formalized as sampled from a random composition consisting of random variables X j . b Data matrix after alr transformation using the u-th column of the original matrix as a reference. ratio transformation with a very specific (''unchanged'') reference, on relative data we can detect all the proportional pairs defined on absolute data (''absolutely proportional set'') previously. This transformation, however, in fact is a normalization, and the information we need to perform it is usually not available. We show that small deviations from this unchanged reference will result in a small adjustment to the cut-off on our measure of proportionality to obtain a subset of the absolutely proportional set. The following section deals with the more common case of references that deviate greatly from the unchanged one. In this case it is hard to approximate the set of absolutely proportional pairs, and it is more difficult to avoid pairs that are called proportional although they were not proportional on the absolute data (''spurious'' proportionality). We will give an exact result about the conditions under which prediction on relative data will coincide with the one on absolute data. In the following section, a slight generalization of proportionality leads to the concept of ''partial'' proportionality, a definition adopted from partial correlations. Finally, in the last section of the paper, we will apply the discussed concepts to a brief re-analysis of the data set used in Lovell et al. (2015). Here, in good agreement with our analytical results, the approach taken by Lovell et al. leads to a much lower overlap of prediction between absolute and relative data compared with the application of an approximately unchanged reference.

Methods and results
Scaling log-ratio variance on absolute data In this section we show how we can identify a set of proportional gene pairs on (usually unavailable) absolute data. We will later investigate to what extent we can identify this set on the (available) relative data. Let us start with defining the ''absolute'' random variables A i ¼ logðsX i Þ; where s was defined in the introduction as the original absolute mRNA amount in a given condition. As each draw of the random variables A i ði ¼ 1. . .dÞ corresponds to a condition, we now have to redefine s as a random variable s :¼ P d i¼1 e A i . Note also that we want to find proportionality between sX i and sX j , but it is more convenient to work with the logs.
Unlike correlation, but similar to covariance, the logratio variance var(A i À A j ) has no intrinsic scale that makes its size intuitive to the analyst. Aitchison's own proposal  to use 1 À expðÀ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi varðA i À A j p Þ just achieves to limit the range to values between zero and one. With the stochasticity of gene expression data in mind, however, it is much more interesting to put log-ratio variance in relation to the size of the single variances involved: In the case of high variances, we are likely to consider higher values of var(A i À A j Þ still relevant and are inclined to apply a less stringent cut-off on it. This can be seen as the idea behind the scaling used in a recent work by Lovell et al. Lovell et al. (2015), where the following statistic is proposed: Here , and r is the correlation coefficient between A i and A j . (Like Lovell et al. we drop the indices i, j from b and r. For convenience, we reproduce their derivation in the ''Appendix''.) Interestingly, b happens to be the absolute value 1 of the estimated slope when plotting A i and A j against each other. This estimate of the slope is known as standardized major axis estimate Taskinen and Warton (2011). / thus establishes a direct connection with line fitting for the scatter plot A i vs. A j , where both b and r have to be one for full proportionality (see Fig. 2a). Note that the centred n-sample vectors in our data matrix have a squared length corresponding to their variance. If logarithms are taken, log-ratio variances correspond to the squared lengths of difference vectors, which allows for intuitive representations such as the ones in Fig. 2.
A better alternative to / is also mentioned in Lovell et al. but is not used there. It is defined by This is a special case of a coefficient that was proposed in the context of determining reproducibility of measurements Lin (1989). The slightly more general ''concordance correlation coefficient'' also takes into account the shift due to the intercept, something irrelevant to detecting proportionality, as it only affects the size of the proportionality factor. The coefficient given in (2) can be understood as a scaling of the covariance that is similar to the one used when evaluating the correlation coefficient (with the geometric mean of the variances replaced by their arithmetic mean). The measure has the advantage of being symmetric in its arguments, of having a range from -1 to 1 (where -1 is attained for sX i reciprocal to sX j ), and also of having a simpler relationship with b and r. All in all, it seems a more generic measure that can be generalized more easily (as will be done in ''Partial proportionality''). We collect some (straight-forward) identities involving this ''proportionality coefficient'' in the following Proposition 1 where for (iii) we definedb¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ðvarðA i ÀA j Þ=ðvarðA i þA j Þ p : (See ''Appendix'' for the proofs of all propositions.) The first statement gives a direct relationship with logratio variance and shows the similarity of q with / [more explicitly, we have / ¼ ð1 À qÞð1 þ b 2 Þ: Given that var(A i À A j ) can reach a maximum of twice the sum of both variances (in the case of reciprocal sX i , sX j ), it makes sense to use this sum for scaling it. The second of the identities above shows that we can obtain q from the correlation coefficient by multiplying by a factor involving b. This factor is the geometric mean divided by the arithmetic mean of the variances, a function that attains one in case of equality of the variances and otherwise is smaller one. Statement (iii) shows that we can even express q as a function of a single parameterb. This parameter equals b in its functional form but depends on transformed variables that are the sum and the difference of the original ones. These transformed variables can be understood geometrically as the diagonals of the parallelogram spanned by the original vectors. Their variances can be understood in terms of a decomposition of total variance into group variance and within-group variance. Interestingly, their ratio is all we need, so the scaling of proportionality can be interpreted as relating within-group variance with group variance. Statement (iv) finally reveals an interesting duality involving the proportionality coefficients of the transformed variables and the original variables.
Detecting proportional pairs on relative data using an unchanged reference The problem with a scaling or a cut-off on log-ratio variance that depends on the individual variances is that these variances cannot be defined in a useful way on compositional data. Only variances of ratios can be defined coherently with respect to subcompositions, so we have to put X i and X j in relationship with a third variable or a term involving other variables. A first step in compositional data analysis is thus to do a log-ratio transformation of the data matrix and by this removing its constant-sum constraint. For this, a number of options are available. Perhaps the most simple and intuitive is to divide each entry by one of the components (which is thereby chosen as the reference). This additive log-ratio transformation is applied to each row in our data matrix by It results in vectors with dimension reduced by one compared with the original rows, so the reference is ''sacrificed''. The j-th column of our alr-transformed data matrix will now be The column vectors c j we can consider an n-sample of (transformed) scalar random variables Y j (see Fig. 1b). While log-ratio variances varðY i À Y j Þ ¼ var(logðX i =X j ÞÞ remain unaffected by the transformation, individual variances varðY i Þ ¼ var(logðX i =X d ÞÞ are log-ratio variances themselves and thus depend on the choice of the reference. Given a set of all proportional gene pairs inferred on the absolute data (for a given cut-off on q), are we able to detect them on the compositional data applying the coefficient discussed in the previous section? For a resounding yes, we would need a very specific reference that effectively back transforms the data to absolute data (see Fig. 2 a Two vectors of observations when their goodness-of-fit parameters are r ¼ 1 (same direction) and b ¼ 1 (same length), respectively. Their log-ratio variance is the squared length of the link connecting them (here, roughly of the same size in both cases). b The relative data logX i , logX j (black vectors) transformed to (logged) absolute data A i , A j by the normalization -logs (green vectors) and log-ratio transformed using the reference X d to Y i , Y j (red vectors).
Note that the log ratio between X i and X j (dashed line) remains the same under both transformations (colour figure online) Fig. 2b). To see this, let us spell out the terms that contribute to the transformed compositional data: If sX d is constant, we recover the results from the absolute data. Genes X d fulfilling this would have a variance just reproducing the shift s needed to normalize the data, i.e. they would be unchanged across conditions in the absolute data (i.e. varðA d Þ ¼ 0Þ: (Note that A d in Fig. 2b is given by the link between the vectors corresponding to-logs and logX d , pointing towards the latter.) Such a reference could be a housekeeping gene that is known to be unchanged under the conditions considered. This is of course an idealization, and we need to know what happens when this reference gene reproduces the shift up to a small error. The following expansion links the proportionality coefficient obtained on the absolute data with the one after transformation of the relative data using such an approximately unchanged reference: where for the reference we have logX d ¼ Àlogs þ : Then u 2 ðÞ þ Oðu 3 ðÞÞ: As can be seen from the expansion, the direction of the reference with respect to the pair (as given by the correlation coefficient) will decide if the proportionality increases or decreases. Without this information, the coefficients of the expansion can still be easily bounded. They become increasingly small for qðA i ; A j Þ close to 1. Considering pairs with a cut-off qðA i ; A j Þ ! 0:98 proportional, by how much do we have to raise the cut-off on the relative data to avoid all pairs with qðA i ; A j Þ\0:98 but qðY i ; Y j Þ ! 0:98 (i.e. false positives)? We have approximately For easier interpretation, let us introduce a parameter C defined as the ratio of the average variance of the gene pair with the variance of our reference: Using C ¼ 1=ð2u 2 ðÞÞ, we find that Increasing the cut-off to 0.99, we would still avoid pairs that have a qðA i ; A j Þ of almost 0.98 and whose members have an average variance at least C ¼ 5 times higher than the reference (see Fig. 3a). This is usually sufficient, as pairs with variances close to do not achieve high proportionality coefficients (see also Fig. 7a in the last section) and can be considered to belong to the set of unchanged genes. Note however that this is an expansion in u, not in , so strictly speaking it applies only if C is sufficiently big (we will give an exact result for all C in the next section). Qualitatively, we can see already that taking a sufficiently high cut-off should lead to a set of pairs that were also proportional on the absolute data. A similar argument applies for false negatives (where the required adjustment should be small, as pairs that are above the cut-off on relative data in practice have higher C). Lovell et al. Lovell et al. (2015) propose to use the mean over the log X i as a reference. This is known as the centred log-ratio transformation: with gðxÞ the geometric mean over the genes for the given condition. The problem with this transformation is that it is sub-compositionally incoherent , so results will change to some extent when using subsets of genes for the analysis. In some cases, gðxÞ can approximate an unchanged reference. This applies whenever the majority of genes remains unchanged across conditions, so the unchanged genes will dominate the behaviour of the reference. Note that this is also the condition needed for a normalization by effective library size Robinson and Oshlack (2012).
Measuring proportionality on relative data using a changing reference What if the reference itself is changing on the absolute data? Depending on (size and direction of) the variance of the reference, only a part of the pairs that are proportional on the absolute data can now be identified, and a certain number of pairs that were not proportional on the absolute data will be declared proportional on the relative data. More precisely, we have the following We then have If qðY i ; Y j Þ increases with respect to qðA i ; A j Þ, we maintain pairs that were absolutely proportional, but we also introduce false positives.
We see that qðY i ; Y j Þ increases for negative correlations with the reference, as well as for references having a variance that exceeds the group variance varððA i þ A j Þ=2Þ of the pair by a factor 4.
Corollary 2 Let the set of true positives S TP ; false positives S FP ; true negatives S TN and false negatives S FN for a given cut-off K ð1 [ K [ 0Þ on the proportionality coefficient be defined by Then for a given pair of genes i, j we have To determine the fate of a gene pair analytically, we thus have to solve the quadratic equation for FðA d Þ ÀK: For easier interpretation, we can again replace u by the parameter C ¼ varðA i Þ þ varðA j Þ À Á =2varðA d Þ as we did in (7). In Fig. 3b it is shown, for a gene pair with qðA i ; A j Þ ¼ 0:98; how F changes with C in case of positive and negative correlation with the reference. Two boundariesK are shown for cutoffs K ¼ 0:975 and K ¼ 0:985; and the resulting regimes of FP, TN, TP and FN are indicated. An important subset of false positive pairs are the ones from the set of unchanged genes: Corollary 3 Let be an error term characterizing the stochasticity of unchanged genes, i.e. for such genes i we have varðA i Þ varðÞ: Let K ð0\K\1Þ be the cut-off on q above which genes are proportional on the absolute data. Then pairs of unchanged genes with qðA i ; A j Þ\K will be proportional on the relative data, i.e. qðY i ; Y j Þ ! K for a reference fulfilling varðA d Þ ! 8varðÞ 1 À K : Finally, we can again ask if there is a sufficiently high cut-off on qðY i ; Y j Þ to avoid all false positives. Choosing as an example again qðA i ; A j Þ ¼ 0:98, we get an upper bound on FðA d Þ by In the worst case, the gene pair has a low variance almost of the order of the unchanged set but reaches a qðA i ; A j Þ close to 0.98. Let us assume our reference has a variance that is five times higher than the average variance of such a pair. With C ¼ 1=5 (i.e. F 7) we get So even if the reference has a relatively low variance, the cut-off will get quite close to 1 when trying to avoid pairs that are not proportional on the absolute data. The situation can be much worse for false negatives: The function qðY i ; Y j Þ À qðA i ; A j Þ is bounded below by Àð1 þ qðA i ; A j ÞÞ at F ¼ Àð1 þ qðA i ; A j ÞÞ, which in turn is attained for f. Lemma 1 in the ''Appendix''). In our example, C is pretty close to this minimum, and a lower bound on qðY i ; Y j Þ reaches 0.2, a cut-off that would give a sensitivity of 100 % but would be useless in terms of specificity.
Although raising the cut-off will generally lead to sets of pairs that are more likely to be proportional on the absolute data, we usually do not have the information to know what is a good cut-off (and it might be so high that no pairs remain). We saw that a sufficiently high absolute variance of our reference introduces proportional pairs that are spurious (in the sense that on the absolute data they are not proportional). However, usually we do not have information about the size of the variance of the reference (on the absolute data). We ended up in a situation similar to the one described in the classical work of Pearson Pearson (1897), where it is shown how spurious correlation is introduced between two variables due to the common division by a third variable 2 . Given that log-ratio analysis sets out to solve exactly the problem of spurious correlation, the fact that scaling the log-ratio variance re-introduces a similar problem (spurious proportionality) appears rather unsatisfactory.

Partial proportionality
In the previous sections, we were using the fact that the size of the log-ratios (their variance) is identical on absolute and relative data, and we investigated the effect of a scaling. Also the direction of the log-ratios is identical on both types of data. Correlation coefficients between logratios are thus identical between absolute and relative data, and log-ratio scatter plots are one of the available tools when analyzing compositional data sets Greenacre (2010), van den Boogart and Tolosana-Delgado (2013). In this section, we will show that such correlations can be interpreted in terms of a slight generalization of proportionality.
The simple functional relationship with the correlation coefficient given in Proposition 1 (ii) suggests a straightforward extension of our definition along the lines of a related measure, namely partial correlation. Partial correlations have been used extensively in the construction of gene networks because, in theory, they allow for identification of direct pairwise interactions that are not mediated by other genes, and techniques have been developed for inverting the (regularized) correlation matrix Schäfer and Strimmer (2005) to obtain them. Let us restrict the problem to partialling on a single gene k here. Geometrically speaking, for our transformed data matrix, a partial correlation between genes i and j wrt. gene k is obtained from projecting the vectors c i and c j onto the plane perpendicular to the vector c k . In these projections, the part of the correlation between c i and c j that was due to the correlation with c k is removed. In more general terms of multiple regression, projecting along Y k yields the linear least squares predictorsŶ i andŶ j wrt. Y k ; and partial correlations are obtained from correlating the projections onto the orthogonal plane, i.e. correlating the residuals Y i ÀŶ i and Y j ÀŶ j : Note that if we talk about projections of scalar random variables, we have the n-sample vectors in mind. This is also the way in which Fig. 4 should be understood.
What is the relationship of partial correlation with proportionality? Replacing the correlation coefficient by the partial correlation coefficient in Proposition 1 (ii), and adjusting b accordingly, we obtain a natural definition for a partial proportionality coefficient. More precisely, we have the following equivalents to the first three identities of Proposition 1: Proposition 4 Let Y i ; Y j ; andY k be scalar random variables, and letŶ i andŶ j be the linear least-squares predictors of Y i and Y j wrt. Y k :Then 2 Note that there is debate about whether such correlations should be called spurious because they can be interpreted correctly avoiding an over-simplified interpretation based on the correlation coefficient and considering the correct regression model, see the discussion in Firebaugh and Gibbs (1985) . 4 a Direction of the vector corresponding to Y k when it lies in the same plane as Y i and Y j and the partial proportionality coefficient is one (green) and minus one (red), repectively. If Y k is parallel to the difference (log ratio) between Y i and Y j ; the linear least squares predictorsŶ i andŶ j point in opposite directions (green labeled vectors), and the residuals Y i ÀŶ i ; Y j ÀŶ j coincide in the green, unlabeled vector. If Y k is parallel to the sum of Y i and Y j , the linear least squares predictorsŶ i andŶ j point in the same direction (red labeled vectors), and the residuals Y i ÀŶ i , Y j ÀŶ j have the same length and point in opposite directions (unlabeled red vectors). b Log ratios between X i and X j and between X k and X d are parallel whenever Y i and Y j are partially proportional wrt. Y k (colour figure online) (The proof follows immediately from some well-known identities involving variances of the least squares predictor, see ''Appendix''.) Clearly, if the direction we are partialling on is perpendicular to the plane of the gene pair, the partial coefficient coincides with the proportionality coefficient. The interesting cases occur for directions within the plane of the gene pair. From (i) it follows immediately that we have partial proportionality between i and j if Y k falls parallel to their log ratio (see the green vectors in Fig. 4a). From (ii) it follows that the partial coefficient will vanish if Y k is parallel with either Y i or Y j . From (iii) it follows that we have partial reciprocality between i and j if Y k falls parallel to the sum of Y i and Y j (red vectors in Fig. 4a). However, only the first of the described cases has a simple relationship with the absolute data: Partial proportionality between i and j wrt. k when their common reference is d implies that the log ratio between A i and A j is parallel to the one between A k and A d (see Fig. 4b).

Data analysis
In this section we will put into practice our theoretical considerations of the previous sections. We will use the data set provided in Marguerat et al. (2012) and re-analyzed by Lovell et al. Lovell et al. (2015), where we profit from the excellent documentation including the R-code made available in the latter. The data are from fission yeast cells entering quiescence after time point zero. The data at time point zero are counts obtained by RNA-seq that are supposed to be roughly proportional to the original absolute mRNA amounts. The data from the 15 subsequent time points are abundances relative to time point zero and were obtained by microarray. For biological and technical details we refer the reader to the original publication. The data set is not typical in the sense that we have both the relative data and an approximation of the absolute data at our disposal (Fig. 5). It is thus well suited to validate our theoretical considerations, especially regarding the overlap of predicted proportionality on absolute and relative data obtained by alternative analysis approaches. Note that the approach we propose differs from the one in Lovell et al. in two respects: in the use of the proportionality coefficient instead of the / statistic, and in the application of an alternative log-ratio transformation. While the coefficient has some clear advantages over / in that it is symmetric, has a limited range, can also detect reciprocality and allows for the definition of a partial coefficient, we will see that Fig. 5 Absolute and relative gene expression data box plots. Each box summarizes the distribution of logged gene expressions in a given condition (time point) and, in the case of the relative data, corresponds to one row of the data matrix. Left panel: Overall absolute expression is going down with time, with (approximately) unchanged and geometric mean references behaving accordingly on the absolute data (the green horizontal line denotes the mean log expression of the unchanged gene). Right panel On the relative data, the unchanged reference appears to be going up, while the geometric mean reference appears to remain unchanged. To recover the absolute data, each box has to be shifted by an amount approximated by the distance of the green dots from the value indicated by the green horizontal line in the left panel (colour figure online) the main difference in outcome of our approach comes from the proposed log-ratio transformation.
As log-ratio transformations need non-zero entries throughout the data matrix, we follow Lovell et al. in using a subset of 3031 genes fulfilling this condition in all 16 time points. We start with studying the goodness of fit with the expected behaviour for proportional (and reciprocal) genes on the absolute data. Scatter plots for a variety of values of qðA i ; A j Þ (all for the same A i ) are shown in Fig. 6. As proportionality between genes is a property of the original data (where no logarithm is taken), we are not showing logarithmic scatter plots here, although the ''goodness of fit'' measures r and b are used for fitting logarithms. To nevertheless give an impression of the quality of the fit, red lines with a slope corresponding to the variance ratio between the genes are drawn through the origin. This way both effects, the shift with respect to a zero intercept and how well the expected slope is reproduced can be studied independently. (For negative coefficients we divide the slope by the values of the gene on the horizontal axis instead of multiplying them.) We conclude that for absolute values of the coefficient increasingly close to one, the data seem to reproduce better and better the desired behaviour outlined by the red curves.
We next study the effect of the scaling of log ratios which the proportionality coefficient should achieve. In Fig. 7a, we show a scatter of proportionality coefficient vs. log ratio for gene pairs all containing one of three genes with different variances. It can be seen clearly that the gene with highest variance achieves higher values of the proportionality coefficient for the same log ratios. In fact, pairs involving the gene with the lowest variance do not make the cut-off of 0.98, although they reach log ratios that are closer to zero than those of pairs involving the high-variance gene. In practice, high cut-offs should make sure that pairs of unchanged genes achieving small log-ratios by chance due to their random fluctuations will not be called proportional. As desired, the gene with intermediate variance shows an intermediate slope. It attains, however, the smallest log ratios and highest coefficients due to its great number of partners that have a similar variance.
We then set out to study the relative data, which is truly compositional in the sense that it is normalized to one following Lovell et al. (2015). To do an alr transformation of this relative data matrix, we looked for a good candidate for an unchanged reference. We inspected the genes with lowest coefficient of variation on the absolute data by eye and chose the one with lowest bias towards a particular condition (green dots in Fig. 5). Interestingly, although the resulting gene Rev7 (SPBC12D12.09) can possibly be considered housekeeping (it is a subunit of DNA polymerase zeta), it is none of the usual candidates of unchanged genes (like Act1, Srb4, Rip1, Wbp1 used in the Affymetrix Yeast Genome Array 2.0), which are all changing greatly on this data set. The data is special in the sense that the set of unchanged genes is extremely small. This gives us, however, the opportunity to compare our reference with the geometric mean reference used by the clr transformation. The latter rather reflects the general downward trend observed by the majority of genes (see the red dots in Fig. 5).
The unchanged reference has a variance var(A d Þ ¼ 0:014 on the (logged) absolute data. Its variance on the relative data corresponds roughly to the shift log(s) due to the normalization, its value is var(logX d ÞÞ ¼ 0:49 (while the true var(log(sÞÞ ¼ 0:45). Clearly, the situation is rather the opposite for the geometric mean reference: its variance on the relative data var(logðgÞÞ ¼ 0:005, while on the absolute data it achieves a variance of 0.51. Once the two log-ratio transformations are obtained, we can study again scatter plots corresponding to certain values of the proportionality coefficient, this time on the transformed data and with the red lines depicting the ideal slope of b ¼ 1. In the upper two panels of Fig. 7 b , data for a gene pair under the two transformations is shown (it is the same pair as in the upper left panel of Fig. 6). While under the alr transformation the value of the coefficient remains almost unchanged, it drops to 0.97 under the clr transformation. Taking the cut-off at 0.98, this results in a false negative. Another situation is shown in the lower panels. While the gene pair shown has a proportionality coefficient of 0.82 on the absolute data, under the alr transformation it slightly drops to a value of 0.79, while under the clr transformation it goes up to 0.98, thus resulting in a false positive.
To obtain a general idea about the accuracy of our conclusions on relative data regarding proportionality on absolute data, we plotted values on transformed data versus values on absolute data for three measures: the proportionality coefficient under the two transformations as well as the statistic / under the clr transformation considered by Lovell et al. (see Fig. 8). Venn diagrams in the insets of all panels show the precise numbers of FP, TP, and FN. The breadth of the scatters under clr transformations indicates the problems we face for references of high (absolute) variance and is in good agreement with our theoretical conclusions in the previous sections. While specificity for the clr transformation is rather good for the cut-offs considered (positive predictive value (PPV) is 96 % for / and 95 % for q), the sensitivity (0.17 % for / and 0.16 % for q) is very low. The situation is quite different under the alr transformation: The scatter shows a tight correlation. A PPV of 85 % is obtained at 100 % sensitivity. As we showed analytically, the absence of false negatives could be caused by a negative correlation of our reference gene with all proportional gene pairs. The sub-optimal PPV can be raised to 100 % by increasing the cut-off slightly (to Fig. 7 a Scaling of log-ratio variance using the proportionality coefficient. Shown are coefficients of gene pairs involving three genes with different variances versus their log-ratio variances (absolute data). The three genes have variances of the logged absolute data of 2.1 (blue), 0.39 (green) and 0.076 (red), respectively. Gene pairs with higher variance can have higher log-ratio variance to attain the same value of the coefficient. b Log-ratio (alr and clr) transformed gene expression scatter plots. The red lines are fits with a slope b ¼ 1 (the intercept is estimated by the mean over Y j À Y i ). Upper panels A gene pair correctly identified as proportional (left) and incorrectly discarded (right). The same pair is shown in the first panel of Fig. 6. Lower panels A gene pair correctly discarded (left) and incorrectly identified as proportional (right) (colour figure online) 0.983) while loosing sensitivity (which goes down to 79 %). Of course, the comparison with the clr transformation is not fair in the sense that for the alr transformation additional information in form of an unchanged gene is used. The idea here is of course that it will often be possible to guess such genes correctly.
Finally, we also calculated partial proportionality coefficients with respect to each of the genes on the absolute data. While our definition of partial proportionality was for the transformed data, where we see its main application, the fact that we do not need a reference on absolute data reduces the number of possible combinations drastically. As an exhaustive study of these coefficients would go beyond the scope of this work, we here limit ourselves to a demonstration that triples of genes exist that lead to partial proportionality. This seems to be the case whenever the gene we are partialling on is behaving ''against the trend'' of the other genes. In Fig. 9 we show histograms of all partial proportionality coefficients with respect to two genes. The left panel is for a gene that is highly correlated with most genes (see inset) and its partial proportionality coefficients maintain mostly low values. The right panel shows the opposite: a gene with low correlation with most other genes reaches high values of partial proportionality coefficients.

Discussion
In this contribution we considered log-ratio analysis from the point of view of the absolute data. From this perspective, normalizations can be seen as specific log-ratio transformations allowing to back transform compositional data to absolute data. With this in mind, proportionality coefficients can be applied to traditionally normalized gene expression data and the same conclusions apply without the need for additional data transformations. This holds whenever the normalization procedure preserves ratios between gene expressions, which excludes more invasive procedures like quantile normalizations. Our view point is consistent with an analysis of accuracy of prediction of proportionality comparing results obtained on both types of data. It can also be argued that proportionality is a welldefined concept on relative data in its own right, without the comparison with outcomes on the absolute data. Since the transformed data have no constraints leading to ambiguous results when applying measures like correlation or proportionality, this is a valid point of view that avoids the notion of spurious results. The interpretation of the results is, however, less straightforward and would require the study of the resulting power-law relationships between all the variables including the reference, as can be done for log-ratio correlations Greenacre and Aitchison (2002).
The notion of proportionality was introduced by Lovell et al. to put correlational analysis of relative gene expression data on a more rigorous footing. As shown here, the proposed scaling is not without caveats due to its dependence on the chosen reference. We argue that an educated guess of an approximately unchanged gene or gene set can lead to an alr transformation with great advantages in terms of accuracy. Unfortunately, we will not generally be in a situation where we can evaluate the size and direction of the reference explicitly. Because of this, it is certainly a good idea to simply apply high cut-offs for maximum consistency with absolute outcomes, as suggested by our analytical results. The proportionality coefficient (a special case of the concordance correlation coefficient) allows for a straightforward generalization to a notion of partial proportionality, a simple analogy to partial correlations. Here we just derived some expressions for this measure and gave some heuristic interpretations of it. It could be the topic of another contribution to apply all the discussed relationships among genes to build co-expressions networks from pairs (using proportionality and reciprocality), triples (using logratio correlations involving a reference), and quadruples of genes (using log-ratio correlations involving the reference and an additional gene, or partial proportionality). This seems an interesting alternative to current approaches based on ''crude'' correlation and partial correlation.
It is to be wished that the concept of proportionality and log-ratio analysis in general will have a growing impact on established methodology in genome research. Applications range from the calculation of correlation networks Friedman and Alm (2012) over applying log-ratio transformations before doing principal component analysis Greenacre and Aitchison (2002) to alternative normalization procedures of genomic data. Other approaches to correlation on compositional data have been employed, see Filzmoser and Hron (2009) for one involving the isometric log-ratio transformation. Knowledge about the uniformity of expression of particular genes make gene expression data suitable for an alr transformation as argued in the present work. The characteristics of genomic data in general make them a new object of investigation not yet fully explored within the framework of log-ratio analysis. Among these peculiarities are the facts that there are usually many more components than observations and that the data are often counts with various sources of stochasticity. We thus see the necessity for more mathematical and bioinformatic research to fully exploit the strength of the approach.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix
Derivation of equation (1): Proof of Proposition 1 (i): (iv): We start from Thus with the definition of q we find h Proof of Proposition 2 From Proposition 1 (iii), we infer qðA i ; A j Þ ¼ ðvarðA i þ A j Þ À varðA i À A j ÞÞ=ðvarðA i þ A j Þ þvarðA i À A j ÞÞ: Combining with Eq. (5), we find The variance involving we can now decompose using the first equality in (13): where termsðÞ ¼ varð2Þ À 2covðA i þ A j ; 2Þ. Multiplying numerator and denominator by varðA i þ A j Þ þ var ðA i À A j Þ, we get Let us use the shorthand gðÞ ¼ termsðÞ varðA i þA j ÞþvarðA i ÀA j Þ . We can expand the term in the second pair of brackets: Let us now evaluate gðÞ. We have The covariance term can be expressed in terms of correlation: where we used the fact that varðA i þ A j Þ=ðvarðA i Þþ varðA j ÞÞ ¼ 1 þ qðA i ; A j Þ. Plugging this into (23) and using the notation uðÞ defined in the proposition, we find Let us now return to (22). With the new expression for gðÞ we find h Proof of Proposition 3 Using the definition ofb of Proposition 1 (iii), for the relative data we definẽ where the equality comes from Eq. (5). We now find where we used the shorthand termsðA d Þ ¼varð2A d ÞÀ where the numerator and denominator were divided by varðA i Þ þ varðA j Þ and the definition of gðA d Þ (c.f. Eq. (23)) was used. From Proposition 1 (iii) it now follows that Filling in (29), we obtain qðY i ; Y j Þ À qðA i ; A j Þ ¼ 2b 2 1 À 1þq 1þqþ2gðA d Þ ð1 þb 2 Þ 1 þb 2 1þq The prefactor involvingb 2 evaluates to 1 À q, and 1 þb 2 ¼ 2=ð1 þ qÞ. With this we finally find It remains to show that gðA d Þ ¼ FðA d Þ=2 defined in the proposition. This follows from (25). h

Lemma 1
(i) FðA d Þ as a function of u takes its minimum value F min ¼ Àð1 þ qðA i ; A j ÞÞcorr 2 ðA i þ A j ; A d Þ at u min ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ qðA i ; A j Þ p corrðA i þ A j ; A d Þ. (ii) Let Dq :¼ qðY i ; Y j Þ À qðA i ; A j Þ. Then FðA d Þ ! 0 () Dq ! 0. (iii) Dq as a function of F is monotonically increasing and thus takes its minimum value at F min .
Proof of Lemma 1 (i): With the definition of FðA d Þ in Proposition 3, we find As the second derivative evaluates to 2 [ 0, setting (33) to 0 implies a minimum at u min ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ qðA i ; A j Þ p corrðA i þ A j ; A d Þ. Putting this into F yields the value of F min .
(iii): We have implies FðA d Þ ! 0. We start with FðA d Þ ! 0: From u 2 À 2 ffiffiffiffiffiffiffiffiffiffiffi 1 þ q p corrðA i þ A j ; A d Þ u ! 0 it follows that Now we use where the last equality comes from the fact that varðA i þ A j Þ= varðA i Þ þ varðA j Þ where the first identity follows from linearity of the predictor, the second one from (46), and the third one from the definitions ofb and G. h