Background

Shannon entropy (SE) and coefficient of variation (CV) are used to measure the variability or dispersion of numerical data. Such variability has potential utility in numerous application domains, perhaps most notably in the analysis of high throughput biological data. Variability has been applied, for example, to study gene expression data in the context of human disease [1]. Increased entropy in particular, in both gene expression and protein interaction data, has been observed to be a characteristic of cancer [2]. Numerous other examples typify the utility of entropy [38] and coefficient of variation [912].

Shannon entropy is famously rooted in information theory [13]. To avoid confusion, we emphasize that we use the term “differential entropy” to denote a difference between two Shannon entropy values. This is distinct from information-theoretic terminology, in which “differential entropy” often means the entropy of a continuous, rather than a discrete, random variable [14].

We are particularly interested in differential analysis. In [15], we studied differential Shannon entropy (DSE) and differential coefficient of variation (DCV), and found them highly effective in identifying genes of potential interest not found by differential expression (DE) alone. DSE and DCV are applicable to other types of biological data as well, such as that produced by RNA-Seq technologies, although the usual caveats about careful interpretation apply. The usefulness of DSE and DCV is of course not limited to biological data. They may be applied to any numerical data for which normalized measures of differential variability are relevant.

Implementation

EntropyExplorer is implemented in R [16]. All features are wrapped into a single function call, which takes as input up to eight arguments. Two of these arguments are numerical matrices, with identical labels for each row. The output is a matrix with two, three or five columns that contains in each row two SE, CV or mean values; a DSE, DCV or DE value; and/or two p values, one raw and one adjusted. Output rows can be sorted by value, raw p value or adjusted p value, and can be filtered to show only the top-ranked rows.

Permutation testing for DSE is accomplished with the help of the R function sample.int. The default number of tests to be employed is set to 1000, which the user can override. The p value for DCV is calculated by applying the Fligner-Killeen test for homogeneity of variances, implemented via the R function fligner.test, to the log-transform of the input data. The R function t.test is used to find a p value for DE. Adjusted p values are calculated using the p.adjust function in R, which provides false discovery rate and multiple testing corrections. A more thorough explanation of p value calculations is provided in the discussion section.

EntropyExplorer checks that all matrix entries are positive. This is because calculations of a DSE value/p value and a DCV p value involve taking logarithms, which are undefined on data containing zeros or negative values. Also, CV becomes less meaningful when means approach zero or are negative. Experimental data may be noisy, however, and so EntropyExplorer provides mechanisms to handle non-positive values. An optional two-value argument permits the user to add a positive bias to all elements of one or both matrices prior to performing any other calculations. The argument can also be set to make this adjustment automatically, based on the least non-positive value in each matrix.

Metrics

Let \(x_{1} ,x_{2} , \ldots ,x_{n}\) represent a list of n positive numbers, and let \(x = \sum\nolimits_{i = 1}^{n} {x_{i} }\) denote their sum. The Shannon entropy of this list is

$$SE = \frac{{ - \sum\nolimits_{i = 1}^{n} {\frac{{x_{i} }}{x}\log_{2} \frac{{x_{i} }}{x}} }}{{\log_{2} n}}.$$

The coefficient of variation is

$$CV = \frac{s}{{|\bar{x}|}}$$

where \(\bar{x} = x/n\) is the sample mean and \(s = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {(x_{i} - \bar{x})^{2} } }}{n - 1}}\) is the sample standard deviation. Given two such lists of positive numbers with Shannon entropies \(SE_{1}\) and \(SE_{2}\), coefficients of variation \(CV_{1}\) and \(CV_{2}\), and means \(\bar{x}_{1}\) and \(\bar{x}_{2}\), \(DSE = |SE_{1} - SE_{2} |\), \(DCV = \left| {CV_{1} - CV_{2} } \right|\), and \(DE = \left| {\bar{x}_{1} - \bar{x}_{2} } \right|\).

Shannon entropy falls in the range [0, 1]; DSE therefore also falls in the range [0, 1]. Lower (higher) SE corresponds to more (less) variability. CV falls in the range [0, ∞); DCV therefore also has a range of [0, ∞).

Application

EntropyExplorer is invoked as follows:

EntropyExplorer(expm1, expm2, dmetric, otype, ntop, nperm, shift, padjustmethod)

We refer the reader to the reference manual, included as Additional file 1 and available on the project webpage, for a detailed description of all arguments and options. Included with the package is a sample mRNA microarray dataset, consisting of a few rows from a dataset obtained from the Gene Expression Omnibus (GEO) [17]. This dataset, GSE10810, contains case/control data on breast cancer [18]. Figures 1 and 2 provide example uses of EntropyExplorer on the full data.

Fig. 1
figure 1

The output of EntropyExplorer on breast cancer data. The numerical matrices m1 and m2 have been read into R. The function call has specified “dse” for differential Shannon entropy, “v” for value, and 10 to return the top 10 values

Fig. 2
figure 2

Another use of EntropyExplorer on breast cancer data. The function call has specified “dcv” for differential coefficient of variation, “bv” to specify both value and p value, and to sort by value, and 12 to return the top 12 rows

Discussion

In addition to calculating DSE, DCV and DE, EntropyExplorer can calculate both raw and adjusted p values for each. ANOVA-based tests are the standard way to obtain differential expression p values. We therefore use a t-test for this purpose. Certainly more sophisticated methods exist. See, for example, [19, 20]. Thus, we emphasize that EntropyExplorer includes DE only as a simple, convenient and straightforward point of comparison with the other two metrics. For DCV p values, we observe that 11 tests of equal relative variation were compared in [21], with the conclusion that the Fligner-Killeen test [22] is usually the most appropriate. It strikes a balance between type I and type II errors, and is robust to non-normal distributions.

Obtaining reliable p values for DSE proved much more challenging. We found no known method in the literature specific to DSE p values. We therefore investigated the extent to which SE is correlated to variance. A high correlation would suggest that they may be proxies for each other, in which case the p value of an F-test or some derivation thereof might serve as suitable estimate of the DSE p value. Unfortunately, correlations between SE and variance, or between SE and a function of variance, were not high enough to justify using one as a surrogate for the other. Table 1 shows the correlation between SE and variance V, and between SE and the function \(\frac{1}{2}\ln \left( {2\pi eV} \right)\) as an attempt to linearize the relationship, using the 16 datasets from [15]. The only notably high correlation is found in the obesity dataset. The obesity data, however, contains a large number of missing values, rendering the high correlation less reliable. We conclude that standard statistical tests related to variance do not appear suitable for testing DSE.

Table 1 Correlations between SE and variance, and between SE and \( \frac{1}{2}\ln \left( {2\pi eV} \right) \), on 16 microarray gene expression datasets

We also examined the distribution of DSE on the 16 datasets, with the goal of empirically determining a suitable reference distribution for DSE. From this, we could then estimate p values analytically. We applied the Kolmogorov–Smirnov (KS) test to compare the DSE distribution of each dataset to some of the more common reference distributions, such as normal, F, t, and Chi square. When performing a KS test, p values can be overly sensitive to deviations from the reference distribution [23], so a D-statistic value below 0.1 was used to identify matching distributions. In our experiments, only the Parkinson’s dataset produced a D-statistic below 0.1 when tested against a normal or standardized t distribution (Table 2). Figure 3 shows a sample distribution of DSE, in this case using prostate cancer data.

Table 2 KS test D-statistic results comparing the DSE distribution against several common distributions
Fig. 3
figure 3

The distribution of differential Shannon entropy. The observed distribution of differential Shannon entropy in sample prostate cancer data is shown. Similar patterns were seen in all 16 data sets. None of the standard distributions tested matched the observed distributions closely enough to be considered as a reference distribution for obtaining p values

We conclude from this that none of the distributions tested are close enough approximations to the observed DSE distribution to be used as a proxy for obtaining p values. Thus, without a known distribution function or suitable surrogate, we resort to resampling in order to obtain reliable DSE p values. While computationally demanding, the following permutation test makes no assumptions about the underlying distribution of the data. Given two lists of numbers, containing n 1 and n 2 numerical elements respectively, we first calculate their DSE and then create a new list A containing all \(n_{1} + n_{2}\) numbers from the two lists. Next we randomly permute the elements of A, then recalculate DSE, treating the first \(n_{1}\) elements of A as one list and the last \(n_{2}\) elements of A as a second list. The resultant p value is simply the proportion of all recalculated DSEs that are at least as extreme as the original DSE.

In addition to raw p values, EntropyExplorer also calculates p values adjusted for multiple testing. A user can choose to adjust based on FDR, Holm or another multiple-testing adjustment.

Conclusions

We have produced EntropyExplorer, an R package for calculating differential Shannon entropy, differential coefficient of variation and differential expression. This package also calculates raw and adjusted p values for each metric. These measures have been shown to complement one another [15], making this package an effective tool for users in search of more expansive suites of differential analysis methods.

Availability and requirements

Project name: EntropyExplorer.

Project home page: http://cran.r-project.org/web/packages/EntropyExplorer/index.html.

Operating system(s): Platform independent.

Programming language: R.

Other requirements: R version 3.0 or later is recommended.

License: GNU General Public License version 3.0 (GPLv3).

Any restrictions to use by non-academics: None.

Additional availability: EntropyExplorer is integrated into the GrAPPA toolkit at http://grappa.eecs.utk.edu/.