1 Introduction

Most biomolecules are composed of hydrogen, carbon, nitrogen, oxygen, and sulphur. It is known that the natural isotopes of these elements occur with different probabilities [1, 2], and in some experiments the relative abundances of an element’s isotopes can be manipulated by using a technique known as stable isotopic labeling [3, 4]. The relative abundances of isotopes determine a molecule’s isotopic distribution (ID), which can be measured experimentally using a mass spectrometer. The measured ID constrains the elemental composition when compared with the in-silico computed ID and, hence, helps in identifying the underlying molecule. The realization of this goal, however, demands accurate in-silico ID prediction [511].

The information content in an experimentally measured ID depends on the resolution of the mass spectrometer. An ID generated by a low resolution instrument contains less information than that by an ultra-high resolution instrument [1215]. Based on the instrument resolution, three different types of IDs are commonly mentioned in the literature: the aggregated, the fine structure, and the hyper-fine structure IDs [16]. The aggregated ID is computed by merging isotopic variants that have the same nucleon number into one aggregated isotopic variant [17, 18] whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. The fine and hyper-fine structure IDs are computed similarly to the aggregated ID, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy.

To make practical use of experimentally measured IDs, it is imperative to have methods that can compute in-silico IDs when given molecular formulas. Rockwood et al. [19] mentioned several criteria for a sound ID-computing method (IDCM): an IDCM must accurately compute in a very short time the masses and intensities without consuming much computational resource. We propose a few additional criteria by which to assess an IDCM’s application value: to handle experimentally generated IDs from both low-resolution and high-resolution instruments, an IDCM should allow adjustable mass accuracy; given that customized isotopic labeling has become a common experimental technique for quantitative analyses, an IDCM should be able to handle customized (or user-specified) isotopic abundances (or occurrence probabilities) of all chemical elements considered; finally, an IDCM should be able to compute IDs for a wide mass range and be user-friendly. Although there are several available methods [16] that can compute an aggregated ID [17, 18, 2023], fine structure ID [19], and hyper-fine structure ID [2326], there are not many methods that can satisfy all the requirements mentioned above.

In this manuscript, we present MIDAs, a software tool satisfying all the requirements above. MIDAs provides users with two accurate and efficient algorithms to compute IDs: the first algorithms belongs to the class of polynomial methods [27, 28], whereas the other algorithm belongs to the class of Fourier transform methods [29, 30]. The latter consists mainly of changes made to the existing Fourier transform method [19], and the changes made are shown to improve significantly the accuracy of the computed ID. Both algorithms can compute low and high resolution IDs, referred to as the coarse-grained isotopic distribution (CGID) and the fine-grained isotopic distribution (FGID), respectively, for the remainder of this manuscript. Also both algorithms implemented in MIDAs are capable of computing CGID and FGID with adjustable mass accuracy.

To evaluate the performance of MIDAs, we have benchmarked it against eight methods: four of these methods—Mercury [19], NeutronCluster (NC) [17], Emass [21], and BRAIN [18, 31]—are the four best performing methods taking from a recent publication by Claesen et al. [18]; four other methods included are Mercury5 (a new version of Mercury2) [32], Qmass [20], Isotope Calculator (IC) [33], and a Fourier-transform-based method recently published [34], which we refer to as JFC. JFC is an improved version of Isotopica [35], which incorporates BRAIN’s generating function. The program of JFC was downloaded from http://bioinformatica.cigb.edu.cu/isotopica/centermass.html. The BRAIN code was downloaded from http://www.bioconductor.org/packages/release/bioc/html/BRAIN.html. The program IC was downloaded from http://agarlabs.com/. The rest of the programs were provided by the code authors, whom we acknowledge in the Acknowledgment section.

The performance evaluation was conducted using 25 molecules. Ten of these molecules are benchmark proteins previously used to evaluate the accuracy of computed CGIDs [17, 18]. Another 10 are hydrocarbon molecules whose CGIDs and FGIDs can be exactly computed, making them ideal for evaluating the accuracy of computed IDs. The remaining five molecules, made of a combination of sulfur, mercury, carbon and hydrogen, are used together with some of the other 20 molecules to evaluate the computational time of MIDAs’s algorithms. Results from our investigation show that MIDAs [both the polynomial-based algorithm (MIDAsa) and the Fourier-transform-based algorithm (MIDAsb)], Emass, and JFC compute CGIDs with equivalent accuracy and are more accurate than the other methods evaluated. When computing the FGIDs, IC and MIDAsa yield FGIDs that are closest to the exact FGIDs. The results also show that MIDAsa and MIDAsb satisfy all aforementioned requirements to be considered a valuable tool, providing the community with two new options for computing accurate IDs.

2 Methods

In the subsections below we explain in detail the two algorithms implemented in MIDAs. The first subsection explains MIDAsa, a polynomial-based algorithm. The second subsection describes MIDAsb, a fast Fourier transform (FFT) based algorithm. Both algorithms can be used to compute CGIDs and FGIDs.

2.1 MIDAs Polynomial Multiplication Algorithm (MIDAsa)

It is well known that the ID of a molecule can be obtained by expanding the corresponding product of polynomials: each expanded term corresponds to an isotopic composition of the molecule’s elements. For example, the ID of a molecule having molecular formula (MF) x N y M is given by expanding

$$ {\left[p\left({x}_1\right){I}^{m\left({x}_1\right)}+\cdots +p\left({x}_p\right){I}^{m\left({x}_p\right)}\right]}^N{\left[p\left({y}_1\right){I}^{m\left({y}_1\right)}+\cdots +p\left({y}_q\right){I}^{m\left({y}_q\right)}\right]}^M, $$
(1)

where I is an indicator variable, x i and y i are the isotopes of elements x and y, respectively, p(x i ) and p(y i ) are normalized probabilities of occurrence, and m(x i ) and m(y i ) are the exact atomic masses.

There are several polynomial-based methods designed to compute an ID from the MF. Methods such as the stepwise procedure and its improvement [36, 37], symbolic expansion [4], and multinomial expansion [28, 38] have been proposed to compute the expansion of the above polynomial. Although these methods have been shown to perform well for small molecules, they fail to handle large molecules, yielding inaccurate IDs, requiring a significant amount of computer memory, and taking a considerable amount of computational time [16].

Here we present MIDAsa, a polynomial-based algorithm that is simple and easy to understand. Our algorithm computes the molecule’s CGID by directly performing polynomial multiplication. To simplify the explanation, define the polynomial between the brackets in Equation (1) containing the probabilities and atomic masses of an element’s isotopes as the element fundamental polynomial (EFP). Let us represent the EFP of element x by P x , and also define the following recursion operation that multiplies together polynomials Q x and P x and assigns the resulting polynomial back to Q x as Q x  ← (Q x  × P x ). Substituting these definitions in Equation (1) with Q x initialized to one gives

$$ \begin{array}{c}\hfill {\left[{\mathbf{P}}_x\right]}^N{\left[{\mathbf{P}}_y\right]}^M={\left\{{\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{N}{10}\right\rfloor}\right\}}^{10}\times {\left[{\mathbf{P}}_x\right]}^{N-10\left\lfloor \frac{N}{10}\right\rfloor }{\left\{{\left[{\mathbf{P}}_y\right]}^{\left\lfloor \frac{M}{10}\right\rfloor}\right\}}^{10}\times {\left[{\mathbf{P}}_y\right]}^{M-10\left\lfloor \frac{M}{10}\right\rfloor}\hfill \\ {}\hfill ={\left\{\left[\left({\mathbf{Q}}_x\times {\mathbf{P}}_x\right)\times \cdots \times {\mathbf{P}}_x\right]\right\}}^{10}{\left[{\mathbf{P}}_x\right]}^{N-10\left\lfloor \frac{N}{10}\right\rfloor}\hfill \\ {}\hfill \times {\left\{\left[\left({\mathbf{Q}}_y\times {\mathbf{P}}_y\right)\times \cdots \times {\mathbf{P}}_y\right]\right\}}^{10}{\left[{\mathbf{P}}_y\right]}^{M-10\left\lfloor \frac{M}{10}\right\rfloor },\hfill \end{array} $$
(2)

where ⌊z⌋ represents the integer part of z for any positive number z. Using the recursion operation mentioned earlier, all the x-element related polynomials finally merge into Q x and all the y-element related polynomials finally merge into Q y as shown in algorithms 1 and 2. By first computing \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \) in Equation (2), one considerably reduces the computational time needed to obtain the polynomial expansion of an EFP. The logic in computing \( {\left({\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{N}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_x\right]}^{N-10\left\lfloor \frac{N}{10}\right\rfloor } \) (or \( {\left({\left[{\mathbf{P}}_y\right]}^{\left\lfloor \frac{M}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_y\right]}^{M-10\left\lfloor \frac{M}{10}\right\rfloor } \)) and not [P x ]N (or [P y]M) is that the former requires a smaller number of arithmetic operations. This is due to two heuristic procedures of MIDASa, prune and merge, which reduce the number of retained terms in the expanded polynomial \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \). These heuristics are similar to the hyperatom concept [37] and the superatom concept [10], except that the number of atoms ⌊N/10⌋ in a superstructure is not fixed in our case. The choice of using 10 in Equation (2) was somewhat arbitrary but seemed to generate an accurate ID for each molecule used in our investigation in a short amount of computational time. Evidently, one may use a number other than 10. Choosing a smaller number, however, means that we need a larger memory to hold Q. Choosing a larger number, on the other hand, results in a longer computation time. We find using the number 10 seems to provide a good balance between the two.

The first heuristic employed by the MIDAsa algorithm prunes terms from the polynomial Q that have probability smaller than a pre-set probability value (η). The second heuristic procedure merges polynomial terms from Q that are within some user specified mass accuracy () of each other into a new polynomial term. The new polynomial term is assigned a new mass (\( \overline{m} \)) that is equal to the probability-weighted sum of the merged terms

$$ \overline{m}=\frac{{\displaystyle {\sum}_i{p}_i{m}_i}}{{\displaystyle {\sum}_i{p}_i}} $$
(3)

where m i and p i stand for the mass and probability of the merged terms, respectively. This new term associated with \( \overline{m} \) is then assigned a probability equal to the sum of the probabilities of the merged terms. The pseudo-code for computing a CGID is given by algorithm 1, which is used by MIDAsa.

figure b

Algorithm 1. Computes Coarse-Grained Isotopic Distribution

To compute the FGID for an MF, for every element x, MIDAsa first computes the expected number of occurrences μ[x i ] for each isotope x i of x. MIDAsa then computes σ 2[x i ], the variance of the number of occurrences. As an example, for the molecular formula x N y M , the expectation and variance in the number of atoms for a given isotope of element x is given by

$$ \mu \left[{x}_i\right]= Np\left({x}_i\right) $$
(4)

and

$$ {\sigma}^2\left[{x}_i\right]= Np\left({x}_i\right)\left(1-p\left({x}_i\right)\right). $$
(5)

Using the computed expectation and variance values, we denote the range \( \left[\mathcal{B}\left({x}_i\right),\mathcal{U}\left({x}_i\right)\right] \) as allowable for \( \mathcal{N}\left({x}_i\right) \), the number of atoms of isotope x i . The upper bound \( \mathcal{U}\left({x}_i\right) \) and the lower bound \( \mathcal{B}\left({x}_i\right) \) are given by

$$ \begin{array}{l}\mathcal{U}\left({x}_i\right)=\mu \left[{x}_i\right]+10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)}\hfill \\ {}\mathcal{B}\left({x}_i\right)=\begin{array}{ll}\mu \left[{x}_i\right]-10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)},\hfill & \mu \left[{x}_i\right]-10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)}>0\hfill \\ {}0,\hfill & \mu \left[{x}_i\right]-10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)}\le 0\hfill \end{array}\hfill \end{array} $$
(6)

For isotope x i , we choose \( 10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)} \) to be the span of sum as this quantity is guaranteed to be simultaneously larger then 10σ [x i ] and 10 daltons. For each element x, the \( \mathcal{U}\left({x}_i\right)\mathrm{s} \) and \( \mathcal{B}\left({x}_i\right)\mathrm{s} \) are used to construct a polynomial, \( {\tilde{\mathbf{P}}}_x \), by means of the multinomial expansion formula

$$ \begin{array}{c}\hfill {\tilde{\mathbf{P}}}_x=\underset{k_1+{k}_2+{k}_3+\cdots +{k}_p=N}{\underbrace{{\displaystyle \sum_{k_1=\mathcal{B}\left({x}_1\right)}^{\mathcal{U}\left({x}_1\right)}{\displaystyle \sum_{k_2=\mathcal{B}\left({x}_2\right)}^{\mathcal{U}\left({x}_2\right)}{\displaystyle \sum_{k_3=\mathcal{B}\left({x}_3\right)}^{\mathcal{U}\left({x}_3\right)}\cdots {\displaystyle \sum_{k_p=\mathcal{B}\left({x}_p\right)}^{\mathcal{U}\left({x}_p\right)}}}}}}}\frac{N!}{k_1!{k}_2!{k}_3!\cdots {k}_p!}{\left[p\left({x}_1\right){I}^{m\left({x}_1\right)}\right]}^{k_1}\hfill \\ {}\hfill \times {\left[p\left({x}_2\right){I}^{m\left({x}_2\right)}\right]}^{k_2}\times {\left[p\left({x}_3\right){I}^{m\left({x}_3\right)}\right]}^{k_3}\times \cdots \times {\left[p\left({x}_p\right){I}^{m\left({x}_p\right)}\right]}^{k_p}.\hfill \end{array} $$
(7)

By summing only the contributions bounded by \( \mathcal{B} \) and \( \mathcal{U} \), we direct the calculations to the relevant part of the ID. It has counter-part in FT based method, namely the heterodyning of in [24].

For numerical accuracy and efficiency we employ the following simple identity

$$ \begin{array}{c}\hfill \frac{N!}{k_1!\cdots {k}_p!}p{\left({x}_1\right)}^{k_1}\cdots p{\left({x}_p\right)}^{k_p}= \exp \left( \ln \left(N!\right)- \ln \left({k}_1!\right)-\cdots - \ln \left({k}_p!\right)\right.\hfill \\ {}\hfill +{k}_1 \log \left(p\left({x}_1\right)\right)+\cdots +{k}_p \log \left(p\left({x}_p\right)\right)\Big),\hfill \end{array} $$
(8)

and ln(n !) = ∑  n k=1  ln k. This representation reduces computational time of Equation (7) since by tabulation one only enumerates all the logarithmic terms in Equation (8) once. Once all \( {\tilde{\mathbf{P}}}_x\mathrm{s} \) have been computed, they are used together with a user specified to compute a FGID using algorithm 2.

figure c

Algorithm 2. Computes Fine-Grained Isotopic Distribution 2

2.2 MIDAs Fast Fourier Transform Algorithm (MIDAsb)

The MIDAsb algorithm is similar to an early FFT algorithm by Rockwood et al. [19], which was implemented in a computer program called Mercury. These two algorithms differ, however, in a few aspects. First, using the exact isotopic masses in discrete FFT (DFFT) [39, 40], Mercury produces IDs with leakages (assigning nonzero probabilities to masses where exactly zero probability is expected) and employs an apodization function to minimize leakage [41]. On the other hand, by assigning each isotope mass to a point on a fixed grid, MIDAsb avoids the leakage problem. Using discrete masses to avoid leakage is not new: Rockwood and Van Orden [32] have written a computer program, whose latest version is called Mercury5, to compute IDs based on the nucleon numbers (or roughly using one dalton mass grid). The improvement we made was to allow the users to specify the mass accuracy other than 1 Da. Second, Mercury uses a fixed number of sample points with the DFFT, whereas in MIDAsb the number of sample points used depends on the mass accuracy, which is a parameter adjustable by the user.

Every FFT based method relies on the convolution theorem, which states that a convolution can be performed as multiplication in the Fourier domain.

As we shall discuss in the Appendix, there are two key conditions in order for the convolution theorem to be used in the discrete case while computing IDs. The first one is that the masses of each isotope must lie on grid points. Using a mass that is not on the grid causes the “leakage" phenomenon [41]. If the masses considered all reside on grid points, the leakage problem no longer exists. The second important condition is that the mass domain must be large enough so that the “folded-back" phenomenon (which is also known as “aliasing”, “fold over”, or “wrap around” in the signal processing community) near the tail of the distribution is negligible (see Appendix).

Prior to delving into detail constructs of MIDAsb, let us first describe how one may compute the theoretical molecular mass variance σ 2 MM . Using our example molecule x N y M , one note that the molecular mass variance of this molecule can be rigorously written as σ 2 MM = 2[m x ] + 2[m y ], where σ 2[m x ] is the molecular mass variance associated with element x. Explicitly, one may calculate σ 2[m x ] as follows

$$ {\sigma}^2\left[{m}_x\right]=\left[{\displaystyle \sum_ip\left({x}_i\right){m}^2\left({x}_i\right)}\right]-{\left[{\displaystyle \sum_ip\left({x}_i\right)m\left({x}_i\right)}\right]}^2, $$

where the index i runs over all isotopes of element x and p(x i ) again represents the occurrence probability of isotope x i .

A key constraint of DFFT based ID method is that the total number of sample points, denoted by S, must be an integral power of two [42]. For a given molecule and specified mass accuracy , the total number of sample points S used in MIDAsb’s DFFT is given by

$$ S={2}^{\alpha },\operatorname{where}\alpha =\left\lceil \ln \left(\frac{15\sqrt{1+{\sigma}_{MM}^2}}{\mathit{\in}}\right)/ \ln \kern0.4em 2\right\rceil, $$
(9)

and σ 2 MM is the theoretical variance in MM due to the elements’ isotopes [32]. The quantity ⌈z⌉ represents the smallest integer that is larger than z for any positive number z. Again the quantity \( 15\sqrt{1+{\sigma}_{MM}^2}> \max \left(15,15{\sigma}_{MM}\right) \) is chosen so that S covers on both ends more than 7.5 standard deviations from the mean molecular mass, which prevents folded-back mass regions from having significant probabilities.

In order to avoid the problem of leakage, instead of using exact masses of isotopes and then applying filtering windows, we pin all isotopic masses to grid points. For each isotope mass m(x i ), we first find a corresponding grid index n(x i ) by the following formula

$$ n\left({x}_i\right)=\left\lfloor \frac{m\left({x}_i\right)}{\mathit{\in}}+0.5\right\rfloor . $$
(10)

Using this discrete approach, the probability function of the mass of element x becomes

$$ \mathrm{Prob}\ \left({m}_x=n\mathit{\in}\right)={\displaystyle \sum_ip\left({x}_i\right){\delta}_{n,n\left({x}_i\right)}} $$

where n(x i ) is the approximate expression for the exact mass m(x i ), and the Kronecker delta function takes value one if its two indices coincide and zero otherwise.

Consider the mass distribution of our example molecule x N y M . By the convolution theorem, the Fourier transform of the mass distribution, denoted by Ψ(v), can be written as

$$ \begin{array}{c}\hfill \varPsi (v)={\left[p\left({x}_1\right){e}^{2\pi in\left({x}_1\right)v/S}\cdots +p\left({x}_p\right){e}^{2\pi in\left({x}_p\right)v/S}\right]}^N\hfill \\ {}\hfill \times {\left[p\left({y}_1\right){e}^{2\pi in\left({y}_1\right)v/S}+\cdots +p\left({y}_q\right){e}^{2\pi in\left({y}_q\right)v/S}\right]}^M,\hfill \end{array} $$

where v takes S discrete values: 0,1,…, S − 1. The sample function Ψ(v) is heterodyned to have zero average mass by multiplying it by \( {e}^{-2\pi i{n}_ov/S} \), where n o is equal to \( \overline{n} \) (the molecule’s probability-weighted average grid index computed using n(x i ) and p(x i )) rounded to the nearest integer.

Once the function Ψ(v) has been calculated, three other operations are performed in order to generate the final FGID and CGID. The first operation performed is the inverse discrete fast Fourier transform (IDFFT), which transforms the sample function Ψ(v) to Φ(n) on the mass grid. Second, we apply a denoising procedure to remove small amplitudes due to rounding errors that occur during IDFFT. The rounding errors are expected to create small positive and negative amplitudes of equal amounts in the mass domain. MIDAsb thus removes all amplitudes whose absolute magnitude are smaller than that of the most negative amplitude. As a matter of fact, to be more conservative, MIDAsb uses an amplitude cutoff value that is twice the absolute value of the most negative amplitude. This means that only terms having amplitude greater than the cutoff value are reported in a computed FGID and CGID, with the amplitude values renormalized to sum to one. Figure 1 shows an example of the overlap between the positive amplitude histogram and the negative amplitude histogram. Right below the cutoff absolute amplitude, we see that the two histograms resemble each other, reflecting the fact that rounding errors have equal probability to be positive and negative. Following Rockwood and Van Orden [32], in the third step, MIDAsb applies a linear transformation to rescale the masses associated with the IDs to ensure a good agreement between the theoretically calculated and the numerically computed mean molecular mass as well as standard deviation of the molecular mass. The procedure described above is summarized in the pseudo code give by algorithm 3.

Figure 1
figure 1

Example of rounding errors. The curves plotted above are the histogram for the logarithm of absolute value of the positive (green solid line histogram) and negative (blue long-dashed line histogram) amplitudes obtained after applying the discrete Fourier transform to compute an isotopic distribution for molecule C2023H3208N524O619S20 using a mass accuracy of 0.01 Da (≈0.22 ppm). Absent the leakage, the negative amplitude can only come from rounding errors, among which equal amounts of small positive amplitudes and negative amplitudes are expected. The above histograms overlap for terms that have magnitude in amplitude less than 4.2e–10, displayed above by a dash black line, and at this point is the rounding error cutoff value used by MIDAsb

figure d

Algorithm 3. Computes Fine-Grained and Coarse-Grained Isotopic Distribution

3 Results and Discussion

All methods used in our investigation were evaluated using their default parameter settings, except for a few parameter changes made to ensure that the atomic masses and abundances of elements’ isotopes were the same for all methods (see Table 1). To conduct the evaluation, we used the 22 biomolecules and three inorganic compounds, all of which are listed in Table 2. Ten of these biomolecules, (1)–(10), are benchmark proteins previously used to evaluate the computed CGID [17, 18]; 10 biomolecules, (11)–(20), are hydrocarbon molecules whose FGID can be exactly computed and were employed to evaluate the FGID; the remaining five molecules, (21)–(25), made of a combination of sulfur, mercury, carbon, and hydrogen, are used together with the some of the other 20 biomolecules to evaluate the computational time of MIDAs’s algorithms.

Table 1 Atomic Masses and Abundances used for Benchmark Test in this Paper
Table 2 Molecules for which the Isotopic Distribution was Computed by Various Methods

3.1 Overview of Methods Benchmarked

MIDAs’s performance was evaluated against eight published methods: Mercury [19], Mercury5 [32], JFC [34], Isotope Calculator (IC) [33], Qmass [20], BRAIN [18, 31, 43], NeutronCluster (NC) [17], and Emass [21]. The first three published methods are Fourier-transform-based, IC utilizes a divide-and-recursively-combine algorithm, Qmass has its core based on FFT, BRAIN and NeutronCluster are polynomial-based, whereas Emass is based on a direct convolution approach related to the stepwise procedure and its improvement [36, 37]. BRAIN, Qmass, NC, Emass, JFC, and Mercury5 all use nucleon numbers to classify molecule’s isotopic variants, while all but the last assign to a given nucleon number the average isotopic mass of all variants of that nucleon number.

IC is suitable for computing FGIDs, not CGIDs. Qmass, BRAIN, NeutronCluster, and Emass are suitable for computing CGIDs, not FGIDs. The remaining three Fourier-transform-based methods are also suitable for computing CGIDs, although Mercury is the only one that has FGID computing capacity. To benchmark the FGIDs computed by MIDAs against those of Mercury, however, would require post-processing of Mercury data files such as removing noise from leakage and rounding errors, as well as compiling output from different specified molecular masses. All of these steps may be done differently and make the benchmark test less meaningful. For these reasons, we only evaluated MIDAs’s FGIDs against that of IC, not that of Mercury.

3.2 Benchmarking of Computed CGIDs

Following previous publications [18, 19, 24], the accuracy of a method is gauged by how accurately it yields ID mean, ID standard deviation, lightest and heaviest molecular masses, while computing a CGID. In our evaluation, the lightest mass and heaviest molecular mass are defined as a molecule’s molecular mass computed using the masses of the lightest and heaviest isotopes, respectively.

Lightest masses comparisons for biomolecules, numbered (1)–(10) in Table 2, with elements having their naturally occurring isotopic abundances taken from Table 1, are displayed in Table 3. Unexpectedly, the lightest masses for the first six molecules, reported by Mercury, Mercury5, and Qmass, are even lighter than their exact lightest masses, which should be the lightest masses possible in these six IDs. This observation was also described by [18]. For Mercury5 (and Mercury), this is caused by the rounding errors (and the leakage) when applying the DFFT. (In principle, both methods can avoid this problem by not reporting any terms in the computed ID that have masses lighter than the exact lightest mass.) For Qmass, this seems to arise from computing ID terms that are outside of the allowed mass range imposed by the biomolecule’s MF. This is because in the Qmass output file the reported masses lighter than the exact lightest mass are associated with elemental compositions that differ from the biomolecule’s MF used in the evaluation.

Table 3 Coarse - Grained Isotopic Distribution Results using Naturally Occurring Isotopes

The software NC reports correct lightest masses for nine out of the 10 molecules. For biomolecule number four, NC reports a mass that is 360 Da heavier. This same result has also been observed independently by others [17, 18].

For MIDAsa, BRAIN, and Emass, the differences between exact and computed lightest masses, for small and medium size biomolecules [numbered (1)–(6)], are smaller than 1.0e–08 Da. As for JFC and MIDAsb, although they do not perform as well as the polynomial-based methods above, they are not inferior to other Fourier-transform-based methods such as Mercury and Mercury5. When the biomolecules become heavier [say molecules numbered (7)–(10)], the chance of experimentally observing the exact lightest masses rapidly decreases, and the computed difference between exact and computed lightest masses becomes less important.

The evaluation of getting the correct heaviest mass is not as important under natural conditions. This is because heavy isotopes typically carry very low natural occurrence probabilities so that it is impossible to observe the exact heaviest isotopic variant of the molecule. Of course, when artificial isotopic abundances are enforced, obtaining the correct heaviest masses can become important, while obtaining the correct lightest masses can become unimportant. Since the current evaluation is using the natural isotopic abundances, we do not expect any method to provide correct heaviest masses. Indeed, because most methods are computing terms of an ID that are concentrated around a molecule’s average molecular mass, which is closer to the exact lightest mass under natural isotopic abundances, the mass range used for computing IDs usually will not include the heaviest masses. For biomolecules numbered (1)–(10), the differences between the exact heaviest masses and the heaviest masses computed by all methods considered are all of the same order of magnitude.

Displayed in the upper (lower) half of Table 4 are the relative differences of computed CGIDs derived molecular mass averages (standard deviations) to their theoretical values. Molecules numbered (1)–(10) in Table 2 are used with elements assuming isotopic abundances shown in Table 1. In terms of the average masses, MIDAsa,b, JFC, and Emass have comparable errors and have slightly smaller errors than the other methods. In terms of mass standard deviations, MIDAsa,b, JFC, and Emass have slightly smaller errors than the other methods. In principle, the accuracy of BRAIN might be improved by increasing the number of aggregated isotopic variants computed for each computed CGID. However, to accomplish this would require changing its default option. As mentioned earlier, to keep the benchmarking test simple, we only use the default option for each method considered. From Table 4, one can also infer that Qmass yields small errors for small and medium size molecules, but the error increases as the molecular mass increases.

Table 4 Coarse - Grained Isotopic Distribution Results using Naturally Occurring Isotopes

We have also considered the possibility of deviations from the natural frequencies of occurrence of an element’s isotopes. Such customized modifications can be accomplished experimentally by a technique known as isotopic labeling [3], which is frequently employed in quantitative proteomics [44]. To mimic such a situation, we have computed CGIDs for various molecules assuming different carbon isotopic abundances: 99% 13C and 1% 12C as listed in Table 1. We then derive from the computed CGIDs the average molecular masses and standard deviations, and compare them to the corresponding theoretical values that can be analytically calculated.

The results of using the above mentioned customized carbon abundances are shown in Table 5. The differences displayed in Table 5 show that MIDAsa,b, JFC, and Emass have the smallest errors. However, in terms of ID’s standard deviations, Mercury and Mercury5 yield comparable errors to MIDAsa,b, JFC, and Emass. The results for Qmass are similar to the ones obtained in Table 4: in terms of average masses and standard deviations, it yields small errors for small to medium sizes biomolecules. Table 5 also shows that the current versions of BRAIN and NC are not able to compute IDs using the modified isotopic abundances for carbon. However, the developers of NC have mentioned how NC could be modified to handle stable isotope enrichment by partition of the elements of enriched isotopes away from the equatransneutronic isotopes groups [17]. This option is not currently available in NC. Also, the proposed solution reduces NC back to a polynomial method algorithm, which, if not efficiently implemented, can significantly influence the overall computation time. In BRAIN’s case, there are no reasonable IDs reported and it is difficult to speculate what might have happened.

Table 5 Coarse - Grained Isotopic Distribution Evaluation using Abundances for Carbon’s Isotopes of 99% 13C and 1% 12C

3.3 Assessing Fidelity of Computed CGIDs and FGIDs

To evaluate the fidelity of CGIDs and FGIDs reported, we used 10 hydrocarbon molecules [numbered (11)–(20) in Table 2] because the “exact” CGIDs and FGIDs can be calculated for these molecules. Exact CGID is defined as follows. First, one merges isotopic variants that have the same nucleon number into one aggregated isotopic variant, whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. However, only aggregated isotopic variants having probability greater than 5e–12 were retained for accuracy evaluation. The exact FGIDs were obtained/defined similarly to the exact CGIDs, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy, here set to 0.01 Da. The probability cutoff of 5e–12, for typical sample loads, probably already surpasses the detection capability of current mass spectrometer. Furthermore, it is also a small enough cutoff that ignoring terms below the cutoff has negligible effect in the ID profile.

Four quantities were then utilized to evaluate the fidelity of computed IDs. The first quantity is the difference in the numbers of terms (Δτ) kept by a computed ID and by its corresponding exact ID, be it the exact CGID or the exact FGID. The second quantity was the difference in the probability sums (Δχ), one from the computed ID and the other from the exact FGID (or the exact CGID). The third quantity was the root-mean-square differences of masses (σ m ) between computed and the exact CGIDs (or the exact FGIDs).

$$ {\sigma}_m^2\equiv \frac{1}{N}{\displaystyle \sum_{i=1}^N\underset{j\in \operatorname{exact}}{ \min }{\left|{m}_i-{m}_j\right|}^2.} $$
(11)

In the equation above, m i represents a computed mass term while each m j represents a mass term in the exact FGID (or the exact CGID). N is the number of terms retained in the computed ID. That is, for every mass term in a computed ID, the closest mass term within the exact FGID (or the exact CGID) is found and their difference square is summed. The average of such sum of squares constitutes σ 2 m . The fourth quantity computed was the weighted correlation (ρ) between computed and exact IDs. The weighted correlation (ρ) is defined as follows. Let p(m i ) and p(m j ) be the terms of a computed ID and the corresponding exact FGID (or exact CGID), respectively. We first introduce the weight (w ij ) between a computed ID term (index i) and exact FGID (or exact CGID) term (index j) as

$$ {w}_{ij}=\left\{\begin{array}{l}{e}^{-\zeta },\zeta ={ \min}_j\left|{m}_i-{m}_j\right|\le 2\mathit{\in}\hfill \\ {}0,{ \min}_j\left|{m}_i-{m}_j\right|>2\mathit{\in},\hfill \end{array}\right. $$
(12)

where, in the above equation, min j |m i  − m j |, is the minimum mass difference between a term (m i ) from the computed ID and terms (m j ) from the exact ID. The computed weights (w ij ) are then normalized by the normalization factor, W j = ∑  i w ij , by summing over all i terms from the computed ID that are close to a common term j in the exact FGID (or the exact CGID). The weighted correlation using the above definitions is given by

$$ \rho =\frac{{\displaystyle {\sum}_{j=1}^M{\sum}_{i=1}^Np\left({m}_i\right)p\left({m}_j\right){w}_{ij}/{W}_j}}{\sqrt{{\displaystyle {\sum}_{i=1}^Np{\left({m}_i\right)}^2}}\sqrt{{\displaystyle {\sum}_{j=1}^Mp{\left({m}_j\right)}^2}}}. $$
(13)

For CGIDs, is set to one Da, while for FGIDs, is set to 0.01 Da.

Using molecules numbered (11)–(20) in Table 2, we document the analysis results of the four quantities mentioned above in Table 6 (for CGIDs) and Table 7 (for FGIDs). For CGIDs, we include in Table 6 only four methods that largely satisfy the criteria for being a sound IDCM of application value. For FGIDs, only IC, MIDAsa and MIDAsb are included in Table 7 since they are the only methods that can do FGID-computing reasonably fast and without additional post processing.

Table 6 Coarse-Grained Isotopic Distribution (CGID) Fidelity Assessment Results τ is the Number of Terms in the Exact CGID Having Probability Greater than 5e - 12. Δτ is the Difference Between τ and the Number of Terms of a Computed CGID. Δχ is the Difference Between the Sum of Probability Terms from the Exact CGID and the Sum of Probability terms from the Computed CGID; σ m is the Root-Mean-Square Differences of Masses Between Exact and Computed CGID, see Equation (11); U is the Number of Terms from the Computed CGID that are not with ± 2 ( = 1 Da) from any Terms in the Exact CGID; E is the Number of Terms in the Exact CGID that Have at Least One Corresponding Term in Computed CGID that are with ± 2; ρ is the Weighted Correlation Between Computed and Exact CGID
Table 7 Fine - Grained Isotopic Distribution (FGID) Fidelity Assessment Results τ is the number of terms in the exact FGID having probability greater than 5e - 12; Δτ is the difference between τ and the number of terms of a computed FGID; Δχ is the difference between the sum of probability terms from the exact FGID and the sum of probability terms from the computed FGID; σ m is the root-mean-square differences of masses between exact and computed FGID, see Equation (11); U is the number of terms from the computed FGID that are not with ±2 ( = 0.01 Da) from any terms in the exact FGID; E is the number of terms in the exact FGID that have at least one corresponding term in computed FGID that are with ±2; ρ is the weighted correlation between computed and exact FGID

For fidelity assessment of CGIDs, all four methods shown in Table 6 yield small Δτ and ρ values close to one. In terms of σ m and Δχ, more differences are revealed. Emass always yields small σ m , reflecting good fidelity in terms of mass locations, but seems to give a larger |Δχ|, reflecting less accuracy in amplitudes. JFC and MIDAsb seem to yield less precise mass locations, evidenced by a larger σ m , but seem to provide more accurate amplitudes, evidenced by a smaller |Δχ|. MIDAsa yields both accurate mass locations and accurate amplitudes.

The values of Δχ and σ m in Table 7 indicate that IC, MIDAsa, and MIDAsb report FGID terms with similar mass accuracy and with probability sums that are close to the expected value. For small to medium molecules, numbered (11)–(15), IC, MIDAsa, and MIDAsb have equivalently accurate results. For molecules numbered (16)–(20), IC and MIDAsa have comparable performances, both slightly better than MIDAsb. The values for Δτ indicates that MIDAsb reports many more terms than expected in its computed FGID. Not expecting any leakage, MIDAsb gains these extra terms mainly due to rounding errors associated with the DFFT numerical procedure.

The difference observed in Δτ for MIDAsa is caused by the pruning and merging procedures employed by the algorithm. All the FGID terms computed by IC and MIDAsa are within 2 from the exact FGID terms, which is shown by the number of unexplained term (U) being zero in Table 7. It is also true that most of the terms computed from MIDAsb are within 2 from the exact FGID terms with the exception of molecules (17)–(20) where the number U ranges from 1 to 47. The computed weighted correlation also shows that for heavier molecules, (18)–(20), both IC and MIDAsa produce FGIDs that are more similar to the exact FGIDs than MIDAsb.

What causes MIDAsb to perform worse here might be related to the fact that pinning the elemental masses to grid points may introduce appreciable mass errors while computing IDs for larger molecules. In the worst case scenario, the mass error introduced is apparently proportional to the number of atoms contained in the molecule. Even though MIDAsb employs a mass rescaling [32] to bring the computed average masses and standard deviations close to their theoretical values, the linear mass rescaling is not sufficient to guarantee the full profile resemblance between the computed ID and the exact ID. The non-negligible discrepancy (indicated by the weighted correlation ρ not very close to one) between the computed FGID and the exact FGID for molecules (18)–(20) is reflecting this problem.

3.4 MIDAs Web Interface

MIDAs web interface http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html is user-friendly, but at the same time offers considerable flexibility. For example, in terms of the input molecule, the user may type in the box an elemental composition, a molecular formula, a peptide, or even a protein sequence. The program recognizes the input molecule in all formats above and extracts the corresponding elemental compositions for computing CGIDs and FGIDs. The isotopic abundances and elements’ masses can also be customized within the web interface. The user simply clicks on the “change” button to edit the abundance table of all elements. Other fields that can be easily customized and specified by the user are the charge of the input molecule and the cutoff probability. MIDAs displays both CGID and FGID together using user-specified accuracies, one for each. The “algorithm” drop down box allows the user to select either the FFT or the polynomial algorithms. The output, including the lightest mass, theoretical average mass, theoretical mass standard deviation, computed average mass, computed mass standard deviation, FGID peak list, and CGID peak list can be exported to a flat file by clicking on the “download output” button on the result page. There is also a contextual help for every functional button.

4 Conclusion and Outlook

The two algorithms introduced here, MIDAsa and MIDAsb, for the 25 molecules tested, seem to be able to compute IDs quickly and accurately. Between the two, MIDAsa seems slightly more accurate. For CGIDs MIDAsb appears to be faster (see Table 8), whereas for FGIDs they are of comparable speed. Both algorithms benchmark well with existing methods and stand out because of their ability to compute CGIDs and FGIDs using a user-specified accuracy. These two algorithms were also shown to accurately compute IDs for molecules labeled with stable isotopes, which was not the case for some of the methods evaluated. In summary, in terms of CGIDs derived average masses, MIDAsa, MIDAsb, JFC, and Emass yield smaller errors than other methods. In terms of CGIDs derived standard deviation, our investigation shows that MIDAsa, MIDAsb, JFC, and Emass yield smaller errors than other methods. When computing the FGID, MIDAsa computes a FGID that better resembles the exact FGID than MIDAsb using our evaluation gauges. Both algorithms described here were coded using the C++ programming language in a computer program called MIDAs that is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/MIDAs.html. To make these algorithms widely accessible, we have made them available through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.

Table 8 Computation Time in Seconds (s) and Number of Terms Reported with MIDAs’s Computed Coarse-Grained (CG) and Fine-Grained (FG) Isotopic Distributions (ID) Using 1.0 Da and 0.01 Da Mass Accuracy, Respectively