# Molecular Isotopic Distribution Analysis (MIDAs) with Adjustable Mass Accuracy

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s13361-013-0733-7

- Cite this article as:
- Alves, G., Ogurtsov, A.Y. & Yu, Y. J. Am. Soc. Mass Spectrom. (2014) 25: 57. doi:10.1007/s13361-013-0733-7

- 1 Citations
- 670 Views

## Abstract

In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs’s algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs’s polynomial algorithm, and MIDAs’s Fourier transform algorithm. Among the three, IC and MIDAs’s polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html

### Key words

Isotopic DistributionAccurate massMass spectrometryProteomics## 1 Introduction

Most biomolecules are composed of hydrogen, carbon, nitrogen, oxygen, and sulphur. It is known that the natural isotopes of these elements occur with different probabilities [1, 2], and in some experiments the relative abundances of an element’s isotopes can be manipulated by using a technique known as stable isotopic labeling [3, 4]. The relative abundances of isotopes determine a molecule’s isotopic distribution (ID), which can be measured experimentally using a mass spectrometer. The measured ID constrains the elemental composition when compared with the in-silico computed ID and, hence, helps in identifying the underlying molecule. The realization of this goal, however, demands accurate in-silico ID prediction [5–11].

The information content in an experimentally measured ID depends on the resolution of the mass spectrometer. An ID generated by a low resolution instrument contains less information than that by an ultra-high resolution instrument [12–15]. Based on the instrument resolution, three different types of IDs are commonly mentioned in the literature: the aggregated, the fine structure, and the hyper-fine structure IDs [16]. The aggregated ID is computed by merging isotopic variants that have the same nucleon number into one aggregated isotopic variant [17, 18] whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. The fine and hyper-fine structure IDs are computed similarly to the aggregated ID, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy.

To make practical use of experimentally measured IDs, it is imperative to have methods that can compute in-silico IDs when given molecular formulas. Rockwood et al. [19] mentioned several criteria for a sound ID-computing method (IDCM): an IDCM must accurately compute in a very short time the masses and intensities without consuming much computational resource. We propose a few additional criteria by which to assess an IDCM’s application value: to handle experimentally generated IDs from both low-resolution and high-resolution instruments, an IDCM should allow adjustable mass accuracy; given that customized isotopic labeling has become a common experimental technique for quantitative analyses, an IDCM should be able to handle customized (or user-specified) isotopic abundances (or occurrence probabilities) of all chemical elements considered; finally, an IDCM should be able to compute IDs for a wide mass range and be user-friendly. Although there are several available methods [16] that can compute an aggregated ID [17, 18, 20–23], fine structure ID [19], and hyper-fine structure ID [23–26], there are not many methods that can satisfy all the requirements mentioned above.

In this manuscript, we present MIDAs, a software tool satisfying all the requirements above. MIDAs provides users with two accurate and efficient algorithms to compute IDs: the first algorithms belongs to the class of polynomial methods [27, 28], whereas the other algorithm belongs to the class of Fourier transform methods [29, 30]. The latter consists mainly of changes made to the existing Fourier transform method [19], and the changes made are shown to improve significantly the accuracy of the computed ID. Both algorithms can compute low and high resolution IDs, referred to as the coarse-grained isotopic distribution (CGID) and the fine-grained isotopic distribution (FGID), respectively, for the remainder of this manuscript. Also both algorithms implemented in MIDAs are capable of computing CGID and FGID with adjustable mass accuracy.

To evaluate the performance of MIDAs, we have benchmarked it against eight methods: four of these methods—Mercury [19], NeutronCluster (NC) [17], Emass [21], and BRAIN [18, 31]—are the four best performing methods taking from a recent publication by Claesen et al. [18]; four other methods included are Mercury5 (a new version of Mercury2) [32], Qmass [20], Isotope Calculator (IC) [33], and a Fourier-transform-based method recently published [34], which we refer to as JFC. JFC is an improved version of Isotopica [35], which incorporates BRAIN’s generating function. The program of JFC was downloaded from http://bioinformatica.cigb.edu.cu/isotopica/centermass.html. The BRAIN code was downloaded from http://www.bioconductor.org/packages/release/bioc/html/BRAIN.html. The program IC was downloaded from http://agarlabs.com/. The rest of the programs were provided by the code authors, whom we acknowledge in the Acknowledgment section.

The performance evaluation was conducted using 25 molecules. Ten of these molecules are benchmark proteins previously used to evaluate the accuracy of computed CGIDs [17, 18]. Another 10 are hydrocarbon molecules whose CGIDs and FGIDs can be exactly computed, making them ideal for evaluating the accuracy of computed IDs. The remaining five molecules, made of a combination of sulfur, mercury, carbon and hydrogen, are used together with some of the other 20 molecules to evaluate the computational time of MIDAs’s algorithms. Results from our investigation show that MIDAs [both the polynomial-based algorithm (MIDAs^{a}) and the Fourier-transform-based algorithm (MIDAs^{b})], Emass, and JFC compute CGIDs with equivalent accuracy and are more accurate than the other methods evaluated. When computing the FGIDs, IC and MIDAs^{a} yield FGIDs that are closest to the exact FGIDs. The results also show that MIDAs^{a} and MIDAs^{b} satisfy all aforementioned requirements to be considered a valuable tool, providing the community with two new options for computing accurate IDs.

## 2 Methods

In the subsections below we explain in detail the two algorithms implemented in MIDAs. The first subsection explains MIDAs^{a}, a polynomial-based algorithm. The second subsection describes MIDAs^{b}, a fast Fourier transform (FFT) based algorithm. Both algorithms can be used to compute CGIDs and FGIDs.

### 2.1 MIDAs Polynomial Multiplication Algorithm (MIDAs^{a})

*x*

_{N}

*y*

_{M}is given by expanding

*I*is an indicator variable,

*x*

_{i}and

*y*

_{i}are the isotopes of elements

*x*and

*y*, respectively,

*p*(

*x*

_{i}) and

*p*(

*y*

_{i}) are normalized probabilities of occurrence, and

*m*(

*x*

_{i}) and

*m*(

*y*

_{i}) are the exact atomic masses.

There are several polynomial-based methods designed to compute an ID from the MF. Methods such as the stepwise procedure and its improvement [36, 37], symbolic expansion [4], and multinomial expansion [28, 38] have been proposed to compute the expansion of the above polynomial. Although these methods have been shown to perform well for small molecules, they fail to handle large molecules, yielding inaccurate IDs, requiring a significant amount of computer memory, and taking a considerable amount of computational time [16].

^{a}, a polynomial-based algorithm that is simple and easy to understand. Our algorithm computes the molecule’s CGID by directly performing polynomial multiplication. To simplify the explanation, define the polynomial between the brackets in Equation (1) containing the probabilities and atomic masses of an element’s isotopes as the element fundamental polynomial (EFP). Let us represent the EFP of element

*x*by

**P**

_{x}, and also define the following recursion operation that multiplies together polynomials

**Q**

_{x}and

**P**

_{x}and assigns the resulting polynomial back to

**Q**

_{x}as

**Q**

_{x}← (

**Q**

_{x}×

**P**

_{x}). Substituting these definitions in Equation (1) with

**Q**

_{x}initialized to one gives

*z*for any positive number

*z*. Using the recursion operation mentioned earlier, all the

*x*-element related polynomials finally merge into

**Q**

_{x}and all the

*y*-element related polynomials finally merge into

**Q**

_{y}as shown in algorithms 1 and 2. By first computing \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \) in Equation (2), one considerably reduces the computational time needed to obtain the polynomial expansion of an EFP. The logic in computing \( {\left({\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{N}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_x\right]}^{N-10\left\lfloor \frac{N}{10}\right\rfloor } \) (or \( {\left({\left[{\mathbf{P}}_y\right]}^{\left\lfloor \frac{M}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_y\right]}^{M-10\left\lfloor \frac{M}{10}\right\rfloor } \)) and not [

**P**

_{x}]

^{N}(or [

**P**

_{y}]

^{M}) is that the former requires a smaller number of arithmetic operations. This is due to two heuristic procedures of MIDAS

^{a}, prune and merge, which reduce the number of retained terms in the expanded polynomial \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \). These heuristics are similar to the hyperatom concept [37] and the superatom concept [10], except that the number of atoms ⌊

*N*/10⌋ in a superstructure is not fixed in our case. The choice of using 10 in Equation (2) was somewhat arbitrary but seemed to generate an accurate ID for each molecule used in our investigation in a short amount of computational time. Evidently, one may use a number other than 10. Choosing a smaller number, however, means that we need a larger memory to hold

**Q**. Choosing a larger number, on the other hand, results in a longer computation time. We find using the number 10 seems to provide a good balance between the two.

^{a}algorithm prunes terms from the polynomial

**Q**that have probability smaller than a pre-set probability value (

*η*). The second heuristic procedure merges polynomial terms from

**Q**that are within some user specified mass accuracy (

*∈*) of each other into a new polynomial term. The new polynomial term is assigned a new mass (\( \overline{m} \)) that is equal to the probability-weighted sum of the merged terms

*m*

_{i}and

*p*

_{i}stand for the mass and probability of the merged terms, respectively. This new term associated with \( \overline{m} \) is then assigned a probability equal to the sum of the probabilities of the merged terms. The pseudo-code for computing a CGID is given by algorithm 1, which is used by MIDAs

^{a}.

**Algorithm 1.** Computes Coarse-Grained Isotopic Distribution

*x*, MIDAs

^{a}first computes the expected number of occurrences

*μ*[

*x*

_{i}] for each isotope

*x*

_{i}of

*x*. MIDAs

^{a}then computes

*σ*

^{2}[

*x*

_{i}], the variance of the number of occurrences. As an example, for the molecular formula

*x*

_{N}

*y*

_{M}, the expectation and variance in the number of atoms for a given isotope of element

*x*is given by

*x*

_{i}. The upper bound \( \mathcal{U}\left({x}_i\right) \) and the lower bound \( \mathcal{B}\left({x}_i\right) \) are given by

*x*

_{i}, we choose \( 10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)} \) to be the span of sum as this quantity is guaranteed to be simultaneously larger then 10

*σ*[

*x*

_{i}] and 10 daltons. For each element

*x*, the \( \mathcal{U}\left({x}_i\right)\mathrm{s} \) and \( \mathcal{B}\left({x}_i\right)\mathrm{s} \) are used to construct a polynomial, \( {\tilde{\mathbf{P}}}_x \), by means of the multinomial expansion formula

By summing only the contributions bounded by \( \mathcal{B} \) and \( \mathcal{U} \), we direct the calculations to the relevant part of the ID. It has counter-part in FT based method, namely the heterodyning of in [24].

*n*!) = ∑

_{k=1}

^{n}ln

*k*. This representation reduces computational time of Equation (7) since by tabulation one only enumerates all the logarithmic terms in Equation (8) once. Once all \( {\tilde{\mathbf{P}}}_x\mathrm{s} \) have been computed, they are used together with a user specified

*∈*to compute a FGID using algorithm 2.

**Algorithm 2.** Computes Fine-Grained Isotopic Distribution 2

### 2.2 MIDAs Fast Fourier Transform Algorithm (MIDAs^{b})

The MIDAs^{b} algorithm is similar to an early FFT algorithm by Rockwood et al. [19], which was implemented in a computer program called Mercury. These two algorithms differ, however, in a few aspects. First, using the exact isotopic masses in discrete FFT (DFFT) [39, 40], Mercury produces IDs with leakages (assigning nonzero probabilities to masses where exactly zero probability is expected) and employs an apodization function to minimize leakage [41]. On the other hand, by assigning each isotope mass to a point on a fixed grid, MIDAs^{b} avoids the leakage problem. Using discrete masses to avoid leakage is not new: Rockwood and Van Orden [32] have written a computer program, whose latest version is called Mercury5, to compute IDs based on the nucleon numbers (or roughly using one dalton mass grid). The improvement we made was to allow the users to specify the mass accuracy other than 1 Da. Second, Mercury uses a fixed number of sample points with the DFFT, whereas in MIDAs^{b} the number of sample points used depends on the mass accuracy, which is a parameter adjustable by the user.

Every FFT based method relies on the convolution theorem, which states that a convolution can be performed as multiplication in the Fourier domain.

As we shall discuss in the Appendix, there are two key conditions in order for the convolution theorem to be used in the discrete case while computing IDs. The first one is that the masses of each isotope must lie on grid points. Using a mass that is not on the grid causes the “leakage" phenomenon [41]. If the masses considered all reside on grid points, the leakage problem no longer exists. The second important condition is that the mass domain must be large enough so that the “folded-back" phenomenon (which is also known as “aliasing”, “fold over”, or “wrap around” in the signal processing community) near the tail of the distribution is negligible (see Appendix).

^{b}, let us first describe how one may compute the theoretical molecular mass variance

*σ*

_{MM}

^{2}. Using our example molecule

*x*

_{N}

*y*

_{M}, one note that the molecular mass variance of this molecule can be rigorously written as

*σ*

_{MM}

^{2}=

*Nσ*

^{2}[

*m*

_{x}] +

*Mσ*

^{2}[

*m*

_{y}], where

*σ*

^{2}[

*m*

_{x}] is the molecular mass variance associated with element

*x*. Explicitly, one may calculate

*σ*

^{2}[

*m*

_{x}] as follows

*i*runs over all isotopes of element

*x*and

*p*(

*x*

_{i}) again represents the occurrence probability of isotope

*x*

_{i}.

*S*, must be an integral power of two [42]. For a given molecule and specified mass accuracy

*∈*, the total number of sample points

*S*used in MIDAs

^{b}’s DFFT is given by

*σ*

_{MM}

^{2}is the theoretical variance in MM due to the elements’ isotopes [32]. The quantity ⌈

*z*⌉ represents the smallest integer that is larger than

*z*for any positive number

*z*. Again the quantity \( 15\sqrt{1+{\sigma}_{MM}^2}> \max \left(15,15{\sigma}_{MM}\right) \) is chosen so that

*S*covers on both ends more than 7.5 standard deviations from the mean molecular mass, which prevents

*folded-back*mass regions from having significant probabilities.

*leakage*, instead of using exact masses of isotopes and then applying filtering windows, we pin all isotopic masses to grid points. For each isotope mass

*m*(

*x*

_{i}), we first find a corresponding grid index

*n*(

*x*

_{i}) by the following formula

*x*becomes

*n*(

*x*

_{i})

*∈*is the approximate expression for the exact mass

*m*(

*x*

_{i}), and the Kronecker delta function takes value one if its two indices coincide and zero otherwise.

*x*

_{N}

*y*

_{M}. By the convolution theorem, the Fourier transform of the mass distribution, denoted by Ψ(

*v*), can be written as

*v*takes

*S*discrete values: 0,1,…,

*S*− 1. The sample function Ψ(

*v*) is heterodyned to have zero average mass by multiplying it by \( {e}^{-2\pi i{n}_ov/S} \), where

*n*

_{o}is equal to \( \overline{n} \) (the molecule’s probability-weighted average grid index computed using

*n*(

*x*

_{i}) and

*p*(

*x*

_{i})) rounded to the nearest integer.

*v*) has been calculated, three other operations are performed in order to generate the final FGID and CGID. The first operation performed is the inverse discrete fast Fourier transform (IDFFT), which transforms the sample function Ψ(

*v*) to Φ(

*n*) on the mass grid. Second, we apply a denoising procedure to remove small amplitudes due to rounding errors that occur during IDFFT. The rounding errors are expected to create small positive and negative amplitudes of equal amounts in the mass domain. MIDAs

^{b}thus removes all amplitudes whose absolute magnitude are smaller than that of the most negative amplitude. As a matter of fact, to be more conservative, MIDAs

^{b}uses an amplitude cutoff value that is twice the absolute value of the most negative amplitude. This means that only terms having amplitude greater than the cutoff value are reported in a computed FGID and CGID, with the amplitude values renormalized to sum to one. Figure 1 shows an example of the overlap between the positive amplitude histogram and the negative amplitude histogram. Right below the cutoff absolute amplitude, we see that the two histograms resemble each other, reflecting the fact that rounding errors have equal probability to be positive and negative. Following Rockwood and Van Orden [32], in the third step, MIDAs

^{b}applies a linear transformation to rescale the masses associated with the IDs to ensure a good agreement between the theoretically calculated and the numerically computed mean molecular mass as well as standard deviation of the molecular mass. The procedure described above is summarized in the pseudo code give by algorithm 3.

**Algorithm 3.** Computes Fine-Grained and Coarse-Grained Isotopic Distribution

## 3 Results and Discussion

Atomic Masses and Abundances used for Benchmark Test in this Paper

Isotope | Atomic mass Da | Abundance (%) |
---|---|---|

Atomic masses and naturally occurring isotopic abundances [1] | ||

| 12.0000000000 | 98.9300 |

| 13.0033548378 | 1.0700 |

| 1.0078250321 | 99.9885 |

| 2.0141017780 | 0.0115 |

| 14.0030740052 | 99.6320 |

| 15.0001088984 | 0.3680 |

| 15.9949146 | 99.7570 |

| 16.9991312 | 0.0380 |

| 17.9991603 | 0.2050 |

| 31.97207070 | 94.9300 |

| 32.97145843 | 0.7600 |

| 33.96786665 | 4.2900 |

| 35.96708062 | 0.0200 |

| 195.965833 | 0.0015 |

| 197.966769 | 0.0997 |

| 198.968279 | 0.1687 |

| 199.968326 | 0.2310 |

| 200.970302 | 0.1318 |

| 201.970643 | 0.2986 |

| 203.973493 | 0.0687 |

Atomic Masses and Enriched Carbon’s Isotopic Abundances | ||

| 12.0000000000 | 1.0000 |

| 13.0033548378 | 99.0000 |

Molecules for which the Isotopic Distribution was Computed by Various Methods

No. | Molecular formula | Lightest Mass (Da) | Average Mass (Da) |
---|---|---|---|

(1) | C | 1045.5345145467 | 1046.1811074558 |

(2) | C | 5729.6008666397 | 5733.5107592120 |

(3) | C | 11616.8493497485 | 11624.4487510271 |

(4) | C | 16812.9547750824 | 16823.3213522608 |

(5) | C | 45387.0070331016 | 45415.6793695079 |

(6) | C | 66389.8624747027 | 66432.4555603617 |

(7) | C | 112823.8795468070 | 112895.1259319964 |

(8) | C | 186386.7992654122 | 186506.0525933526 |

(9) | C | 398470.3669960258 | 398722.9724824960 |

(10) | C | 533403.4750914392 | 533735.2146493989 |

(11) | C | 65.0391251605 | 65.0933832534 |

(12) | C | 130.0782503209 | 130.1867665069 |

(13) | C | 650.3912516049 | 650.9338325345 |

(14) | C | 1300.7825032099 | 1301.8676650690 |

(15) | C | 13007.8250320999 | 13018.6766506902 |

(16) | C | 130078.2503209999 | 130186.7665069023 |

(17) | C | 260156.5006419999 | 260373.5330138047 |

(18) | C | 390234.7509629999 | 390560.2995207072 |

(19) | C | 520313.0012839999 | 520747.0660276095 |

(20) | C | 650391.2516049999 | 650933.8325345119 |

(21) | S | 639441.4139999999 | 641321.6938997399 |

(22) | Hg | 159860.3534999999 | 160330.4234749349 |

(23) | Hg | 227937.9037000000 | 232665.2510595869 |

(24) | S | 44979.8957320999 | 45084.7613456772 |

(25) | Hg | 208973.6580321000 | 213617.8430152902 |

### 3.1 Overview of Methods Benchmarked

MIDAs’s performance was evaluated against eight published methods: Mercury [19], Mercury5 [32], JFC [34], Isotope Calculator (IC) [33], Qmass [20], BRAIN [18, 31, 43], NeutronCluster (NC) [17], and Emass [21]. The first three published methods are Fourier-transform-based, IC utilizes a divide-and-recursively-combine algorithm, Qmass has its core based on FFT, BRAIN and NeutronCluster are polynomial-based, whereas Emass is based on a direct convolution approach related to the stepwise procedure and its improvement [36, 37]. BRAIN, Qmass, NC, Emass, JFC, and Mercury5 all use nucleon numbers to classify molecule’s isotopic variants, while all but the last assign to a given nucleon number the average isotopic mass of all variants of that nucleon number.

IC is suitable for computing FGIDs, not CGIDs. Qmass, BRAIN, NeutronCluster, and Emass are suitable for computing CGIDs, not FGIDs. The remaining three Fourier-transform-based methods are also suitable for computing CGIDs, although Mercury is the only one that has FGID computing capacity. To benchmark the FGIDs computed by MIDAs against those of Mercury, however, would require post-processing of Mercury data files such as removing noise from leakage and rounding errors, as well as compiling output from different specified molecular masses. All of these steps may be done differently and make the benchmark test less meaningful. For these reasons, we only evaluated MIDAs’s FGIDs against that of IC, not that of Mercury.

### 3.2 Benchmarking of Computed CGIDs

Following previous publications [18, 19, 24], the accuracy of a method is gauged by how accurately it yields ID mean, ID standard deviation, lightest and heaviest molecular masses, while computing a CGID. In our evaluation, the lightest mass and heaviest molecular mass are defined as a molecule’s molecular mass computed using the masses of the lightest and heaviest isotopes, respectively.

*exact*lightest mass.) For Qmass, this seems to arise from computing ID terms that are outside of the allowed mass range imposed by the biomolecule’s MF. This is because in the Qmass output file the reported masses lighter than the

*exact*lightest mass are associated with elemental compositions that differ from the biomolecule’s MF used in the evaluation.

Coarse - Grained Isotopic Distribution Results using Naturally Occurring Isotopes

Difference in lightest mass | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | 0 | -3.4e - 05 | -2.6e - 10 | 2.2e - 13 | 7.0 | 7.3 | 0 | 12.2 | 0 |

(2) | 0 | -1.7e - 03 | -1.3e - 09 | 0 | 12.0 | 12.1 | 0 | 15.9 | 0 |

(3) | 0 | -2.6e - 03 | -2.8e - 09 | 0 | 8.0 | 8.4 | 0 | 18.2 | -1.0e - 10 |

(4) | 0 | -2.1e - 03 | -4.2e - 09 | 0 | 22.0 | 21.6 | -360 | 39.1 | 8.0e - 10 |

(5) | 7.2e - 12 | -7.4e - 03 | 1.0e - 08 | 1.4e - 11 | 2.9 | 3.3 | 0 | 0.045 | -1.6e - 01 |

(6) | -1.4e - 11 | -5.0 | -1.6e - 08 | 0 | 22.0 | 21.3 | 0 | 65.2 | -4.6 |

(7) | 1.5e - 11 | -19.1 | -2.7e - 08 | -8.0 | -8.0 | -7.2 | 0 | -69.7 | -18.1 |

(8) | 0 | -49.1 | -4.1e - 08 | -31.1 | -55.2 | -55.3 | 0 | -90.7 | -48.8 |

(9) | -5.8e - 11 | -147.4 | -1.1e - 07 | -114.3 | -124.3 | -124.7 | 0 | -188.3 | -118.9 |

(10) | 0 | -210.6 | -1.2e - 07 | -172.5 | -203.7 | -203.7 | 0 | -355.6 | -146.4 |

Difference in heaviest isotopic mass | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | 7.4e + 01 | 1.4e + 02 | 1.5e + 02 | 1.4e + 02 | 1.5e + 02 | 1.5e + 02 | 1.5e + 02 | 1.4e + 02 | 1.4e + 02 |

(2) | 7.2e + 02 | 8.4e + 02 | 8.7e + 02 | 8.4e + 02 | 8.5e + 02 | 8.5e + 02 | 8.6e + 02 | 8.4e + 02 | 8.4e + 02 |

(3) | 1.6e + 03 | 1.8e + 03 | 1.8e + 03 | 1.7e + 02 | 1.8e + 03 | 1.8e + 03 | 1.8e + 03 | 1.8e + 03 | 1.8e + 03 |

(4) | 2.5e + 03 | 2.6e + 03 | 2.6e + 03 | 2.6e + 03 | 2.6e + 03 | 2.6e + 03 | 2.3e + 03 | 2.6e + 03 | 2.6e + 03 |

(5) | 6.8e + 03 | 7.0e + 03 | 7.0e + 03 | 7.0e + 03 | 7.0e + 03 | 7.0e + 04 | 7.0e + 03 | 6.9e + 03 | 7.0e + 03 |

(6) | 9.9e + 03 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 | 1.0e + 04 |

(7) | 1.7e + 04 | 1.7e + 04 | 1.7e + 04 | 1.7e + 04 | 1.7e + 04 | 1.7e + 04 | 1.7e + 04 | 1.8e + 04 | 1.7e + 04 |

(8) | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 | 2.9e + 04 |

(9) | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 | 6.0e + 04 |

(10) | 8.2e + 04 | 8.3e + 04 | 8.3e + 04 | 8.3e + 04 | 8.3e + 04 | 8.3e + 04 | 8.3e + 04 | 8.2e + 04 | 8.2e + 04 |

The software NC reports correct lightest masses for nine out of the 10 molecules. For biomolecule number four, NC reports a mass that is 360 Da heavier. This same result has also been observed independently by others [17, 18].

For MIDAs^{a}, BRAIN, and Emass, the differences between exact and computed lightest masses, for small and medium size biomolecules [numbered (1)–(6)], are smaller than 1.0e–08 Da. As for JFC and MIDAs^{b}, although they do not perform as well as the polynomial-based methods above, they are not inferior to other Fourier-transform-based methods such as Mercury and Mercury5. When the biomolecules become heavier [say molecules numbered (7)–(10)], the chance of experimentally observing the *exact* lightest masses rapidly decreases, and the computed difference between exact and computed lightest masses becomes less important.

The evaluation of getting the correct heaviest mass is not as important under natural conditions. This is because heavy isotopes typically carry very low natural occurrence probabilities so that it is impossible to observe the *exact* heaviest isotopic variant of the molecule. Of course, when artificial isotopic abundances are enforced, obtaining the correct heaviest masses can become important, while obtaining the correct lightest masses can become unimportant. Since the current evaluation is using the natural isotopic abundances, we do not expect any method to provide *correct* heaviest masses. Indeed, because most methods are computing terms of an ID that are concentrated around a molecule’s average molecular mass, which is closer to the *exact* lightest mass under natural isotopic abundances, the mass range used for computing IDs usually will not include the heaviest masses. For biomolecules numbered (1)–(10), the differences between the exact heaviest masses and the heaviest masses computed by all methods considered are all of the same order of magnitude.

^{a,b}, JFC, and Emass have comparable errors and have slightly smaller errors than the other methods. In terms of mass standard deviations, MIDAs

^{a,b}, JFC, and Emass have slightly smaller errors than the other methods. In principle, the accuracy of BRAIN might be improved by increasing the number of aggregated isotopic variants computed for each computed CGID. However, to accomplish this would require changing its default option. As mentioned earlier, to keep the benchmarking test simple, we only use the default option for each method considered. From Table 4, one can also infer that Qmass yields small errors for small and medium size molecules, but the error increases as the molecular mass increases.

Coarse - Grained Isotopic Distribution Results using Naturally Occurring Isotopes

Difference in average mass | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | 2.3e - 13 | 6.8e - 13 | 4.4e - 03 | -4.5e - 13 | 9.1e - 05 | -2.9e - 05 | 6.51e - 05 | -4.6e - 13 | 2.3e - 12 |

(2) | 1.8e - 12 | 3.5e - 11 | 3.2e - 01 | -1.8e - 12 | 8.1e - 04 | 7.6e - 05 | 3.7e - 03 | -7.3e - 12 | 3.6e - 12 |

(3) | -5.4e - 12 | -5.4e - 12 | 8.2e - 02 | -3.6e - 12 | 2.2e - 04 | -5.1e - 05 | 5.9e - 03 | 5.4e - 12 | 0 |

(4) | -7.3e - 12 | 2.0e - 10 | 4.6e - 02 | 0 | 2.9e - 03 | -7.3e - 04 | -360 | 5.8e - 11 | 7.3e - 12 |

(5) | 4.3e - 11 | 2.6e - 10 | 1.4e - 04 | 7.3e - 12 | -3.1e - 03 | 1.8e - 04 | 3.7e - 03 | -7.3e - 12 | -2.9e - 11 |

(6) | 0 | 1.3e - 10 | 1.7e - 06 | -5.8e - 11 | -4.1e - 03 | 3.1e - 03 | -8.5e - 04 | 4.2e - 09 | 1.2e - 10 |

(7) | 4.3e - 11 | -2.4e - 09 | -2.7e - 08 | -1.4e - 11 | -4.1e - 03 | 2.1e - 03 | -3.9e - 03 | -1.3e - 10 | 5.8e - 11 |

(8) | -2.9e - 11 | 1.6e - 09 | -4.1e - 08 | 0 | -5.1e - 03 | -7.9e - 03 | -1.0e - 02 | -7.5e - 01 | 2.6e - 10 |

(9) | -3.5e - 10 | 7.2e - 09 | -1.1e - 07 | -3.5e - 10 | 1.7e - 02 | 7.7e - 03 | -4.0e - 02 | -9.7e - 03 | 7.6e - 10 |

(10) | -1.2e - 10 | 7.6e - 09 | -1.2e - 07 | -1.2e - 10 | -1.1e - 01 | 3.3e - 02 | -3.8e - 02 | -4.4e + 02 | -4.6e - 10 |

Difference in standard deviation | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | 1.1e - 06 | -4.2e - 10 | 1.2e - 02 | 1.1e - 06 | 1.6e - 06 | -1.2e - 04 | -3.6e - 04 | 1.1e - 06 | 1.1e - 06 |

(2) | 6.5e - 06 | -5.0e - 09 | 3.6e - 01 | 6.5e - 06 | 1.2e - 06 | -1.8e - 04 | -1.2e - 03 | 6.5e - 06 | 6.5e - 06 |

(3) | 8.0e - 06 | 1.5e - 08 | 1.2e - 01 | 8.0e - 06 | 9.8e - 05 | -3.3e - 05 | 2.2e - 03 | 7.9e - 06 | 8.0e - 06 |

(4) | 7.2e - 06 | 2.5e - 08 | 7.3e - 02 | 7.2e - 06 | -3.0e - 07 | -4.6e - 04 | -4.5e - 02 | 7.0e - 06 | 7.1e - 06 |

(5) | 1.3e - 05 | 1.8e - 07 | 3.7e - 04 | 1.3e - 05 | 9.7e - 06 | -2.7e - 04 | -1.8e - 03 | 1.3e - 05 | 1.3e - 05 |

(6) | 1.8e - 05 | -1.9e - 07 | 2.3e - 05 | 1.8e - 05 | -3.9e - 07 | -9.0e - 04 | -8.4e - 03 | -2.7e - 06 | 1.7e - 05 |

(7) | 2.0e - 05 | -8.0e - 07 | 2.1e - 05 | 2.0e - 05 | -2.7e - 07 | -7.1e - 04 | -7.5e - 03 | 2.2e - 05 | 2.0e - 05 |

(8) | 2.5e - 05 | 2.1e - 06 | 2.5e - 05 | 2.5e - 05 | 4.4e - 06 | -5.4e - 04 | -8.7e - 03 | -4.8e + 00 | 2.6e - 05 |

(9) | 4.2e - 05 | -7.8e - 06 | 4.1e - 05 | 4.5e - 05 | -5.9e - 07 | -1.5e - 03 | -5.2e - 03 | -9.9e - 02 | 3.9e - 05 |

(10) | 4.8e - 04 | -1.0e - 05 | 5.0e - 05 | 5.4e - 05 | -1.2e - 05 | -1.3e - 03 | 9.6e - 03 | -1.4e + 02 | 3.8e - 05 |

We have also considered the possibility of deviations from the natural frequencies of occurrence of an element’s isotopes. Such customized modifications can be accomplished experimentally by a technique known as isotopic labeling [3], which is frequently employed in quantitative proteomics [44]. To mimic such a situation, we have computed CGIDs for various molecules assuming different carbon isotopic abundances: 99% ^{13}C and 1% ^{12}C as listed in Table 1. We then derive from the computed CGIDs the average molecular masses and standard deviations, and compare them to the corresponding theoretical values that can be analytically calculated.

^{a,b}, JFC, and Emass have the smallest errors. However, in terms of ID’s standard deviations, Mercury and Mercury5 yield comparable errors to MIDAs

^{a,b}, JFC, and Emass. The results for Qmass are similar to the ones obtained in Table 4: in terms of average masses and standard deviations, it yields small errors for small to medium sizes biomolecules. Table 5 also shows that the current versions of BRAIN and NC are not able to compute IDs using the modified isotopic abundances for carbon. However, the developers of NC have mentioned how NC could be modified to handle stable isotope enrichment by partition of the elements of enriched isotopes away from the equatransneutronic isotopes groups [17]. This option is not currently available in NC. Also, the proposed solution reduces NC back to a polynomial method algorithm, which, if not efficiently implemented, can significantly influence the overall computation time. In BRAIN’s case, there are no reasonable IDs reported and it is difficult to speculate what might have happened.

Coarse - Grained Isotopic Distribution Evaluation using Abundances for Carbon’s Isotopes of 99% ^{13}C and 1% ^{12}C

Difference in average mass | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | -6.8e - 13 | 3.9e - 12 | -1.7e + 01 | 2.3e - 13 | 5.0e - 05 | -4.0e - 05 | 3.3e - 02 | -2.3e - 13 | 1.6e - 12 |

(2) | 0 | 4.5e - 11 | NR | 3.6e - 12 | 3.3e - 04 | 7.3e - 05 | 1.7e - 01 | -1.8e - 12 | 4.5e - 12 |

(3) | 1.8e - 12 | 2.2e - 10 | NR | -1.8e - 12 | -6.3e - 04 | -3.1e - 04 | 3.1e - 01 | -6.4e - 11 | -2.4e - 11 |

(4) | -7.3e - 12 | 4.4e - 11 | NR | 0 | 3.9e - 03 | -2.5e - 04 | NR | -1.1e - 11 | -1.1e - 11 |

(5) | -2.9e - 11 | -5.8e - 11 | NR | 7.3e - 12 | -5.2e - 03 | 6.4e - 04 | 1.8e + 03 | -4.1e - 07 | -7.3e - 12 |

(6) | 5.8e - 11 | -1.0e - 10 | NR | 2.9e - 11 | -5.4e - 03 | 2.1e - 03 | 2.7e + 03 | -2.9e - 11 | 2.9e - 11 |

(7) | 1.4e - 11 | 6.8e - 10 | NR | -7.3e - 11 | -8.7e - 04 | 6.6e - 04 | 4.8e + 03 | -4.9e - 07 | 1.4e - 10 |

(8) | 3.2e - 11 | -3.5e - 10 | NR | 8.7e - 11 | -8.1e - 03 | 6.4e - 03 | 8.4e + 03 | -1.1e + 02 | 1.2e - 10 |

(9) | 0 | 4.2e - 09 | NR | 0 | -3.7e - 03 | 8.7e - 03 | 1.7e + 04 | -1.5e + 02 | -4.1e - 10 |

(10) | 2.3e - 10 | 7.7e - 09 | NR | 8.15e - 10 | -1.2e - 01 | 5.4e - 03 | 2.4e + 04 | -5.1e + 02 | -1.2e - 10 |

Difference in standard deviation | |||||||||

No. | MIDAs | MIDAs | BRAIN | Emass | Mercury | Mercury5 | NC | Qmass | JFC |

(1) | 7.9e - 07 | -3.3e - 10 | -1.4e + 01 | 7.9e - 07 | -2.6e - 08 | -1.2e - 04 | -4.9e + 00 | 7.4e - 07 | 7.9e - 07 |

(2) | 5.9e - 06 | 5.9e - 09 | NR | 5.9e - 06 | 2.2e - 07 | -1.9e - 04 | -1.1e + 01 | 5.9e - 06 | 5.9e - 06 |

(3) | 7.6e - 06 | 1.4e - 08 | NR | 7.6e - 06 | 8.2e - 06 | -1.3e - 04 | -1.6e + 01 | 7.6e - 06 | 7.6e - 06 |

(4) | 7.1e - 06 | 1.4e - 08 | NR | 7.1e - 06 | -1.7e - 07 | -4.7e - 04 | NR | 7.1e - 06 | 7.1e - 06 |

(5) | 1.3e - 05 | -2.0e - 07 | NR | 1.3e - 05 | 3.6e - 07 | -2.8e - 04 | 5.2e + 00 | 1.2e - 05 | 1.3e - 05 |

(6) | 1.7e - 05 | -5.7e - 07 | NR | 1.7e - 05 | -1.2e - 07 | -9.2e - 04 | 6.6e + 00 | 1.8e - 05 | 1.8e - 05 |

(7) | 2.1e - 05 | -1.1e - 06 | NR | 2.1e - 05 | -6.9e - 09 | -7.2e - 04 | 8.6e + 00 | 1.9e - 05 | 2.0e - 05 |

(8) | 2.2e - 05 | 6.3e - 07 | NR | 2.4e - 05 | -1.7e - 06 | -5.5e - 04 | 1.1e + 01 | -1.1e + 02 | 2.5e - 05 |

(9) | 4.0e - 05 | 1.0e - 06 | NR | 4.4e - 05 | -3.5e - 06 | -1.5e - 03 | 1.6e + 01 | -2.0e + 02 | 5.1e - 05 |

(10) | 3.3e - 05 | 1.0e - 05 | NR | 3.5e - 05 | -1.5e - 05 | -1.4e - 03 | 1.9e + 01 | -2.3e - 04 | 6.8e - 05 |

### 3.3 Assessing Fidelity of Computed CGIDs and FGIDs

To evaluate the fidelity of CGIDs and FGIDs reported, we used 10 hydrocarbon molecules [numbered (11)–(20) in Table 2] because the “exact” CGIDs and FGIDs can be calculated for these molecules. *Exact* CGID is defined as follows. First, one merges isotopic variants that have the same nucleon number into one aggregated isotopic variant, whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. However, only aggregated isotopic variants having probability greater than 5e–12 were retained for accuracy evaluation. The *exact* FGIDs were obtained/defined similarly to the exact CGIDs, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy, here set to 0.01 Da. The probability cutoff of 5e–12, for typical sample loads, probably already surpasses the detection capability of current mass spectrometer. Furthermore, it is also a small enough cutoff that ignoring terms below the cutoff has negligible effect in the ID profile.

*τ*) kept by a computed ID and by its corresponding exact ID, be it the exact CGID or the exact FGID. The second quantity was the difference in the probability sums (Δ

*χ*), one from the computed ID and the other from the exact FGID (or the exact CGID). The third quantity was the root-mean-square differences of masses (

*σ*

_{m}) between computed and the exact CGIDs (or the exact FGIDs).

*m*

_{i}represents a computed mass term while each

*m*

_{j}represents a mass term in the exact FGID (or the exact CGID).

*N*is the number of terms retained in the computed ID. That is, for every mass term in a computed ID, the closest mass term within the exact FGID (or the exact CGID) is found and their difference square is summed. The average of such sum of squares constitutes

*σ*

_{m}

^{2}. The fourth quantity computed was the weighted correlation (

*ρ*) between computed and exact IDs. The weighted correlation (

*ρ*) is defined as follows. Let

*p*(

*m*

_{i}) and

*p*(

*m*

_{j}) be the terms of a computed ID and the corresponding exact FGID (or exact CGID), respectively. We first introduce the weight (

*w*

_{ij}) between a computed ID term (index

*i*) and exact FGID (or exact CGID) term (index

*j*) as

*min*

_{j}|

*m*

_{i}−

*m*

_{j}|, is the minimum mass difference between a term (

*m*

_{i}) from the computed ID and terms (

*m*

_{j}) from the exact ID. The computed weights (

*w*

_{ij}) are then normalized by the normalization factor,

*W*

_{j}= ∑

_{i}

*w*

_{ij}, by summing over all

*i*terms from the computed ID that are close to a common term

*j*in the exact FGID (or the exact CGID). The weighted correlation using the above definitions is given by

For CGIDs, *∈* is set to one Da, while for FGIDs, *∈* is set to 0.01 Da.

^{a}and MIDAs

^{b}are included in Table 7 since they are the only methods that can do FGID-computing reasonably fast and without additional post processing.

Coarse-Grained Isotopic Distribution (CGID) Fidelity Assessment Results *τ* is the Number of Terms in the Exact CGID Having Probability Greater than 5e - 12. Δ*τ* is the Difference Between *τ* and the Number of Terms of a Computed CGID. Δ*χ* is the Difference Between the Sum of Probability Terms from the Exact CGID and the Sum of Probability terms from the Computed CGID; *σ*_{m} is the Root-Mean-Square Differences of Masses Between Exact and Computed CGID, see Equation (11); U is the Number of Terms from the Computed CGID that are not with ± 2*∈* (*∈* = 1 Da) from any Terms in the Exact CGID; E is the Number of Terms in the Exact CGID that Have at Least One Corresponding Term in Computed CGID that are with ± 2*∈*; *ρ* is the Weighted Correlation Between Computed and Exact CGID

No. |
| Δ | Δ |
| U | E |
| Method |
---|---|---|---|---|---|---|---|---|

(11) | 6 | 0 | -5.6e - 16 | 5.1e - 13 | 0 | 6 | 1.0 | MIDAs |

0 | -7.4e - 15 | 2.2e - 04 | 0 | 6 | 1.0 | MIDAs | ||

0 | -4.7e - 04 | 1.2e - 14 | 0 | 6 | 0.99999988 | Emass | ||

0 | -4.4e - 16 | 2.1e - 05 | 0 | 6 | 1.0 | JFC | ||

(12) | 7 | 0 | -8.9e - 16 | 7.7e - 12 | 0 | 7 | 1.0 | MIDAs |

0 | 7.8e - 16 | 7.5e - 05 | 0 | 7 | 1.0 | MIDAs | ||

0 | -8.3e - 04 | 2.6e - 14 | 0 | 7 | 0.99999963 | Emass | ||

0 | -5.6e - 16 | 1.3e - 05 | 0 | 7 | 1.0 | JFC | ||

(13) | 12 | 0 | -5.0e - 15 | 1.5e - 11 | 0 | 12 | 1.0 | MIDAs |

0 | -2.2e - 14 | 3.3e - 05 | 0 | 12 | 1.0 | MIDAs | ||

0 | -1.6e - 03 | 7.3e - 14 | 0 | 12 | 0.99999885 | Emass | ||

0 | -5.0e - 15 | 8.3e - 04 | 0 | 12 | 1.0 | JFC | ||

(14) | 15 | 0 | -1.0e - 14 | 1.2e - 11 | 0 | 15 | 1.0 | MIDAs |

0 | 1.5e - 14 | 2.3e - 05 | 0 | 15 | 1.0 | MIDAs | ||

0 | -1.3e - 03 | 1.9e - 13 | 0 | 15 | 0.99999957 | Emass | ||

0 | -1.2e - 14 | 3.6e - 03 | 0 | 15 | 1.0 | JFC | ||

(15) | 40 | 0 | -9.8e - 14 | 1.0e - 11 | 0 | 40 | 1.0 | MIDAs |

0 | 1.3e - 12 | 6.3e - 06 | 0 | 40 | 1.0 | MIDAs | ||

0 | -5.6e - 04 | 3.2e - 12 | 0 | 40 | 0.99999996 | Emass | ||

0 | -9.7e - 14 | 2.5e - 03 | 0 | 40 | 1.0 | JFC | ||

(16) | 139 | 0 | -9.6e - 13 | 4.4e - 10 | 0 | 139 | 1.0 | MIDAs |

0 | 3.8e - 12 | 6.3e - 06 | 0 | 139 | 1.0 | MIDAs | ||

0 | -1.9e - 04 | 5.2e - 11 | 0 | 139 | 1.0 | Emass | ||

0 | -6.6e - 13 | 4.1e - 02 | 0 | 139 | 1.0 | JFC | ||

(17) | 195 | 0 | -2.0e - 12 | 5.5e - 10 | 0 | 195 | 1.0 | MIDAs |

0 | 1.3e - 12 | 6.3e - 06 | 0 | 195 | 1.0 | MIDAs | ||

0 | -1.3e - 04 | 1.3e - 10 | 0 | 195 | 1.0 | Emass | ||

0 | -1.9e - 12 | 6.1e - 02 | 0 | 195 | 1.0 | JFC | ||

(18) | 238 | 0 | -3.0e - 12 | 9.4e - 10 | 0 | 238 | 1.0 | MIDAs |

0 | 2.5e - 11 | 6.4e - 06 | 0 | 238 | 1.0 | MIDAs | ||

0 | -1.1e - 04 | 2.3e - 10 | 0 | 238 | 1.0 | Emass | ||

0 | -2.6e - 12 | 6.0e - 02 | 0 | 238 | 1.0 | JFC | ||

(19) | 274 | 0 | -4.1e - 12 | 1.2e - 09 | 0 | 274 | 1.0 | MIDAs |

1 | 2.5e - 11 | 6.1e - 02 | 0 | 274 | 1.0 | MIDAs | ||

0 | -9.5e - 05 | 3.0e - 10 | 0 | 274 | 1.0 | Emass | ||

0 | -5.4e - 12 | 5.9e - 02 | 0 | 274 | 1.0 | JFC | ||

(20) | 306 | 0 | -4.8e - 12 | 1.6e - 09 | 0 | 306 | 1.0 | MIDAs |

0 | 2.6e - 11 | 6.8e - 06 | 0 | 306 | 1.0 | MIDAs | ||

0 | -8.5e - 05 | 4.2e - 10 | 0 | 306 | 1.0 | Emass | ||

0 | -4.6e - 12 | 5.6e - 02 | 0 | 306 | 1.0 | JFC |

Fine - Grained Isotopic Distribution (FGID) Fidelity Assessment Results *τ* is the number of terms in the exact FGID having probability greater than 5e - 12; Δ*τ* is the difference between τ and the number of terms of a computed FGID; Δ*χ* is the difference between the sum of probability terms from the exact FGID and the sum of probability terms from the computed FGID; *σ*_{m} is the root-mean-square differences of masses between exact and computed FGID, see Equation (11); U is the number of terms from the computed FGID that are not with ±2*∈* (*∈* = 0.01 Da) from any terms in the exact FGID; E is the number of terms in the exact FGID that have at least one corresponding term in computed FGID that are with ±2*∈*; *ρ* is the weighted correlation between computed and exact FGID

No. | (ppm) |
| Δ | Δ |
| U | E |
| Method |
---|---|---|---|---|---|---|---|---|---|

(11) | 307.25 | 6 | 0 | -5.6e - 16 | 1.1e - 09 | 0 | 6 | 1.0 | MIDAs |

0 | 8.9e - 14 | 7.3e - 05 | 0 | 6 | 1.0 | MIDAs | |||

0 | 1.5e - 08 | 1.8e - 09 | 0 | 6 | 1.0 | IC | |||

(12) | 153.63 | 7 | 0 | -2.7e - 15 | 4.2e - 10 | 0 | 7 | 1.0 | MIDAs |

0 | 2.3e - 12 | 1.6e - 05 | 0 | 7 | 1.0 | MIDAs | |||

0 | 3.7e - 07 | 2.3e - 09 | 0 | 7 | 1.0 | IC | |||

(13) | 30.72 | 13 | 1 | -5.7e - 13 | 3.1e - 03 | 0 | 13 | 0.99999591 | MIDAs |

0 | 3.0e - 13 | 4.4e - 03 | 0 | 12 | 0.99999592 | MIDAs | |||

1 | -2.9e - 07 | 3.1e - 03 | 0 | 13 | 0.99999591 | IC | |||

(14) | 15.36 | 16 | 3 | 2.7e - 12 | 5.3e - 03 | 0 | 15 | 0.99937104 | MIDAs |

3 | 3.6e - 12 | 7.1e - 03 | 0 | 15 | 0.99937227 | MIDAs | |||

4 | -3.5e - 07 | 5.1e - 03 | 0 | 16 | 0.99937103 | IC | |||

(15) | 1.53 | 65 | -6 | -4.6e - 11 | 1.7e - 03 | 0 | 58 | 0.99927870 | MIDAs |

3 | 1.2e - 11 | 4.0e - 03 | 0 | 64 | 0.98806083 | MIDAs | |||

5 | 2.4e - 05 | 3.6e - 03 | 0 | 65 | 0.98803755 | IC | |||

(16) | 0.15 | 291 | -5 | 2.6e - 11 | 5.2e - 03 | 0 | 257 | 0.99999001 | MIDAs |

53 | 6.9e - 11 | 5.6e - 03 | 1 | 282 | 0.99958599 | MIDAs | |||

82 | 1.4e - 07 | 5.2e - 03 | 0 | 280 | 0.99998237 | IC | |||

(17) | 0.077 | 500 | -18 | 1.8e - 10 | 4.0e - 03 | 0 | 453 | 0.99785805 | MIDAs |

126 | 1.3e - 10 | 7.3e - 03 | 13 | 488 | 0.95950051 | MIDAs | |||

124 | -1.6e - 08 | 4.5e - 03 | 0 | 496 | 0.99951891 | IC | |||

(18) | 0.051 | 715 | -16 | -5.4e - 10 | 4.5e - 03 | 0 | 636 | 0.99466182 | MIDAs |

242 | 1.5e - 10 | 7.0e - 03 | 10 | 681 | 0.71069880 | MIDAs | |||

19 | -5.0e - 08 | 4.3e - 03 | 0 | 690 | 0.99735244 | IC | |||

(19) | 0.038 | 881 | 57 | 7.6e - 11 | 4.8e - 03 | 0 | 824 | 0.95007936 | MIDAs |

437 | 1.9e - 10 | 7.8e - 03 | 33 | 866 | 0.59270671 | MIDAs | |||

-26 | -1.7e - 06 | 5.9e - 03 | 0 | 713 | 0.97935224 | IC | |||

(20) | 0.031 | 1143 | 93 | -1.4e - 10 | 5.7e - 03 | 0 | 1011 | 0.85638390 | MIDAs |

498 | 2.2e - 10 | 8.4e - 03 | 47 | 1092 | 0.63960000 | MIDAs | |||

-173 | -1.7e - 05 | 4.6e - 03 | 0 | 838 | 0.98564325 | IC |

For fidelity assessment of CGIDs, all four methods shown in Table 6 yield small Δ*τ* and *ρ* values close to one. In terms of *σ*_{m} and Δ*χ*, more differences are revealed. Emass always yields small *σ*_{m}, reflecting good fidelity in terms of mass locations, but seems to give a larger |Δ*χ*|, reflecting less accuracy in amplitudes. JFC and MIDAs^{b} seem to yield less precise mass locations, evidenced by a larger *σ*_{m}, but seem to provide more accurate amplitudes, evidenced by a smaller |Δ*χ*|. MIDAs^{a} yields both accurate mass locations and accurate amplitudes.

The values of Δ*χ* and *σ*_{m} in Table 7 indicate that IC, MIDAs^{a}, and MIDAs^{b} report FGID terms with similar mass accuracy and with probability sums that are close to the expected value. For small to medium molecules, numbered (11)–(15), IC, MIDAs^{a}, and MIDAs^{b} have equivalently accurate results. For molecules numbered (16)–(20), IC and MIDAs^{a} have comparable performances, both slightly better than MIDAs^{b}. The values for Δ*τ* indicates that MIDAs^{b} reports many more terms than expected in its computed FGID. Not expecting any leakage, MIDAs^{b} gains these extra terms mainly due to rounding errors associated with the DFFT numerical procedure.

The difference observed in Δ*τ* for MIDAs^{a} is caused by the pruning and merging procedures employed by the algorithm. All the FGID terms computed by IC and MIDAs^{a} are within 2*∈* from the exact FGID terms, which is shown by the number of unexplained term (U) being zero in Table 7. It is also true that most of the terms computed from MIDAs^{b} are within 2*∈* from the exact FGID terms with the exception of molecules (17)–(20) where the number U ranges from 1 to 47. The computed weighted correlation also shows that for heavier molecules, (18)–(20), both IC and MIDAs^{a} produce FGIDs that are more similar to the exact FGIDs than MIDAs^{b}.

What causes MIDAs^{b} to perform worse here might be related to the fact that pinning the elemental masses to grid points may introduce appreciable mass errors while computing IDs for larger molecules. In the worst case scenario, the mass error introduced is apparently proportional to the number of atoms contained in the molecule. Even though MIDAs^{b} employs a mass rescaling [32] to bring the computed average masses and standard deviations close to their theoretical values, the linear mass rescaling is not sufficient to guarantee the full profile resemblance between the computed ID and the exact ID. The non-negligible discrepancy (indicated by the weighted correlation *ρ* not very close to one) between the computed FGID and the exact FGID for molecules (18)–(20) is reflecting this problem.

### 3.4 MIDAs Web Interface

MIDAs web interface http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html is user-friendly, but at the same time offers considerable flexibility. For example, in terms of the input molecule, the user may type in the box an elemental composition, a molecular formula, a peptide, or even a protein sequence. The program recognizes the input molecule in all formats above and extracts the corresponding elemental compositions for computing CGIDs and FGIDs. The isotopic abundances and elements’ masses can also be customized within the web interface. The user simply clicks on the “change” button to edit the abundance table of all elements. Other fields that can be easily customized and specified by the user are the charge of the input molecule and the cutoff probability. MIDAs displays both CGID and FGID together using user-specified accuracies, one for each. The “algorithm” drop down box allows the user to select either the FFT or the polynomial algorithms. The output, including the lightest mass, theoretical average mass, theoretical mass standard deviation, computed average mass, computed mass standard deviation, FGID peak list, and CGID peak list can be exported to a flat file by clicking on the “download output” button on the result page. There is also a contextual help for every functional button.

## 4 Conclusion and Outlook

^{a}and MIDAs

^{b}, for the 25 molecules tested, seem to be able to compute IDs quickly and accurately. Between the two, MIDAs

^{a}seems slightly more accurate. For CGIDs MIDAs

^{b}appears to be faster (see Table 8), whereas for FGIDs they are of comparable speed. Both algorithms benchmark well with existing methods and stand out because of their ability to compute CGIDs and FGIDs using a user-specified accuracy. These two algorithms were also shown to accurately compute IDs for molecules labeled with stable isotopes, which was not the case for some of the methods evaluated. In summary, in terms of CGIDs derived average masses, MIDAs

^{a}, MIDAs

^{b}, JFC, and Emass yield smaller errors than other methods. In terms of CGIDs derived standard deviation, our investigation shows that MIDAs

^{a}, MIDAs

^{b}, JFC, and Emass yield smaller errors than other methods. When computing the FGID, MIDAs

^{a}computes a FGID that better resembles the exact FGID than MIDAs

^{b}using our evaluation gauges. Both algorithms described here were coded using the C++ programming language in a computer program called MIDAs that is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/MIDAs.html. To make these algorithms widely accessible, we have made them available through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.

Computation Time in Seconds (s) and Number of Terms Reported with MIDAs’s Computed Coarse-Grained (CG) and Fine-Grained (FG) Isotopic Distributions (ID) Using 1.0 Da and 0.01 Da Mass Accuracy, Respectively

MIDAs | ||||

No. | Number of terms CGID | CGID time(s) | Number of terms FGID | FGID time(s) |

(1) | 85 | 0.006 | 42 | 0.001 |

(2) | 154 | 0.01 | 151 | 0.002 |

(3) | 186 | 0.02 | 246 | 0.001 |

(4) | 203 | 0.02 | 301 | 0.001 |

(5) | 290 | 0.04 | 809 | 0.006 |

(6) | 341 | 0.05 | 1269 | 0.01 |

(7) | 423 | 0.1 | 1945 | 0.05 |

(8) | 540 | 0.13 | 3145 | 0.06 |

(9) | 820 | 0.14 | 6579 | 0.3 |

(10) | 956 | 0.2 | 7850 | 0.4 |

(21) | 3022 | 0.35 | 13834 | 0.8 |

(22) | 5908 | 1.0 | 74994 | 1.0 |

(23) | 2706 | 0.23 | 28508 | 2.1 |

(24) | 617 | 0.05 | 2805 | 0.01 |

(25) | 2623 | 0.21 | 18261 | 0.5 |

MIDAs | ||||

No. | Number of terms CGID | CGID time(s) | Number of terms FGID | FGID time(s) |

(1) | 15 | 0.0006 | 29 | 0.025 |

(2) | 29 | 0.001 | 114 | 0.041 |

(3) | 38 | 0.001 | 193 | 0.043 |

(4) | 42 | 0.001 | 241 | 0.08 |

(5) | 78 | 0.002 | 740 | 0.08 |

(6) | 95 | 0.002 | 1166 | 0.14 |

(7) | 123 | 0.004 | 1784 | 0.14 |

(8) | 157 | 0.004 | 2953 | 0.14 |

(9) | 230 | 0.004 | 6527 | 0.3 |

(10) | 257 | 0.01 | 7818 | 0.3 |

(21) | 794 | 0.005 | 12405 | 0.4 |

(22) | 1500 | 0.01 | 74994 | 1.0 |

(23) | 706 | 0.01 | 23367 | 0.7 |

(24) | 188 | 0.003 | 2209 | 0.2 |

(25) | 698 | 0.01 | 16384 | 0.8 |

## Acknowledgments

The authors thank Alfred Yergey for sending them the NeutronCluster code, and Alan Rockwood for providing them with codes of Mercury, Emass, Qmass, and Mercury5. The authors thank the administrative group of the National Institutes of Health Biowulf Clusters, where all the computational tasks were carried out. They also thank the National Institutes of Health Fellows Editorial Board for editorial assistance. This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. Funding for Open Access publication charges for this article was provided by the National Institutes of Health.

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.