Molecular Isotopic Distribution Analysis (MIDAs) with Adjustable Mass Accuracy

In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs’s algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs’s polynomial algorithm, and MIDAs’s Fourier transform algorithm. Among the three, IC and MIDAs’s polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html .Graphical abstract ᅟ


Introduction
M ost biomolecules are composed of hydrogen, carbon, nitrogen, oxygen, and sulphur. It is known that the natural isotopes of these elements occur with different probabilities [1,2], and in some experiments the relative abundances of an element's isotopes can be manipulated by using a technique known as stable isotopic labeling [3,4]. The relative abundances of isotopes determine a molecule's isotopic distribution (ID), which can be measured experimentally using a mass spectrometer. The measured ID constrains the elemental composition when compared with the in-silico computed ID and, hence, helps in identifying the underlying molecule. The realization of this goal, however, demands accurate in-silico ID prediction [5][6][7][8][9][10][11].
The information content in an experimentally measured ID depends on the resolution of the mass spectrometer. An ID generated by a low resolution instrument contains less information than that by an ultra-high resolution instrument [12][13][14][15]. Based on the instrument resolution, three different types of IDs are commonly mentioned in the literature: the aggregated, the fine structure, and the hyper-fine structure IDs [16]. The aggregated ID is computed by merging isotopic variants that have the same nucleon number into one aggregated isotopic variant [17,18] whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. The fine and hyper-fine structure IDs are computed similarly to the aggregated ID, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy.
To make practical use of experimentally measured IDs, it is imperative to have methods that can compute in-silico IDs when given molecular formulas. Rockwood et al. [19] mentioned several criteria for a sound ID-computing method (IDCM): an IDCM must accurately compute in a very short time the masses and intensities without consuming much computational resource. We propose a few additional criteria by which to assess an IDCM's application value: to handle experimentally generated IDs from both low-resolution and high-resolution instruments, an IDCM should allow adjustable mass accuracy; given that customized isotopic labeling has become a common experimental technique for quantitative analyses, an IDCM should be able to handle customized (or user-specified) isotopic abundances (or occurrence probabilities) of all chemical elements considered; finally, an IDCM should be able to compute IDs for a wide mass range and be user-friendly. Although there are several available methods [16] that can compute an aggregated ID [17,18,[20][21][22][23], fine structure ID [19], and hyper-fine structure ID [23][24][25][26], there are not many methods that can satisfy all the requirements mentioned above.
In this manuscript, we present MIDAs, a software tool satisfying all the requirements above. MIDAs provides users with two accurate and efficient algorithms to compute IDs: the first algorithms belongs to the class of polynomial methods [27,28], whereas the other algorithm belongs to the class of Fourier transform methods [29,30]. The latter consists mainly of changes made to the existing Fourier transform method [19], and the changes made are shown to improve significantly the accuracy of the computed ID. Both algorithms can compute low and high resolution IDs, referred to as the coarse-grained isotopic distribution (CGID) and the fine-grained isotopic distribution (FGID), respectively, for the remainder of this manuscript. Also both algorithms implemented in MIDAs are capable of computing CGID and FGID with adjustable mass accuracy.
To evaluate the performance of MIDAs, we have benchmarked it against eight methods: four of these methods-Mercury [19], NeutronCluster (NC) [17], Emass [21], and BRAIN [18,31]-are the four best performing methods taking from a recent publication by Claesen et al. [18]; four other methods included are Mercury5 (a new version of Mercury2) [32], Qmass [20], Isotope Calculator (IC) [33], and a Fouriertransform-based method recently published [34], which we refer to as JFC. JFC is an improved version of Isotopica [35], which incorporates BRAIN's generating function. The program of JFC was downloaded from http://bioinformatica.cigb.edu.cu/ isotopica/centermass.html. The BRAIN code was downloaded from http://www.bioconductor.org/packages/release/bioc/html/ BRAIN.html. The program IC was downloaded from http:// agarlabs.com/. The rest of the programs were provided by the code authors, whom we acknowledge in the Acknowledgment section.
The performance evaluation was conducted using 25 molecules. Ten of these molecules are benchmark proteins previously used to evaluate the accuracy of computed CGIDs [17,18]. Another 10 are hydrocarbon molecules whose CGIDs and FGIDs can be exactly computed, making them ideal for evaluating the accuracy of computed IDs. The remaining five molecules, made of a combination of sulfur, mercury, carbon and hydrogen, are used together with some of the other 20 molecules to evaluate the computational time of MIDAs's algorithms. Results from our investigation show that MIDAs [both the polynomial-based algorithm (MIDAs a ) and the Fourier-transform-based algorithm (MIDAs b )], Emass, and JFC compute CGIDs with equivalent accuracy and are more accurate than the other methods evaluated. When computing the FGIDs, IC and MIDAs a yield FGIDs that are closest to the exact FGIDs. The results also show that MIDAs a and MIDAs b satisfy all aforementioned requirements to be considered a valuable tool, providing the community with two new options for computing accurate IDs.

Methods
In the subsections below we explain in detail the two algorithms implemented in MIDAs. The first subsection explains MIDAs a , a polynomial-based algorithm. The second subsection describes MIDAs b , a fast Fourier transform (FFT) based algorithm. Both algorithms can be used to compute CGIDs and FGIDs.

MIDAs Polynomial Multiplication Algorithm (MIDAs a )
It is well known that the ID of a molecule can be obtained by expanding the corresponding product of polynomials: each expanded term corresponds to an isotopic composition of the molecule's elements. For example, the ID of a molecule having molecular formula (MF) x N y M is given by expanding where I is an indicator variable, x i and y i are the isotopes of elements x and y, respectively, p(x i ) and p(y i ) are normalized probabilities of occurrence, and m(x i ) and m(y i ) are the exact atomic masses. There are several polynomial-based methods designed to compute an ID from the MF. Methods such as the stepwise procedure and its improvement [36,37], symbolic expansion [4], and multinomial expansion [28,38] have been proposed to compute the expansion of the above polynomial. Although these methods have been shown to perform well for small molecules, they fail to handle large molecules, yielding inaccurate IDs, requiring a significant amount of computer memory, and taking a considerable amount of computational time [16].
Here we present MIDAs a , a polynomial-based algorithm that is simple and easy to understand. Our algorithm computes the molecule's CGID by directly performing polynomial multiplication. To simplify the explanation, define the polynomial between the brackets in Equation (1) containing the probabilities and atomic masses of an element's isotopes as the element fundamental polynomial (EFP). Let us represent the EFP of element x by P x , and also define the following recursion operation that multiplies together polynomials Q x and P x and assigns the resulting polynomial back to Q x as Q x ←(Q x ×P x ). (1) with Q x initialized to one gives

Substituting these definitions in Equation
where ⌊z⌋ represents the integer part of z for any positive number z. Using the recursion operation mentioned earlier, all the x-element related polynomials finally merge into Q x and all the y-element related polynomials finally merge into Q y as shown in algorithms 1 and 2. By first computing P x ½ Nx 10 b c in Equation (2), one considerably reduces the computational time needed to obtain the polynomial expansion of an EFP. The logic in computing P that the former requires a smaller number of arithmetic operations. This is due to two heuristic procedures of MIDAS a , prune and merge, which reduce the number of retained terms in the expanded polynomial P x ½ Nx 10 b c . These heuristics are similar to the hyperatom concept [37] and the superatom concept [10], except that the number of atoms ⌊N/10⌋ in a superstructure is not fixed in our case. The choice of using 10 in Equation (2) was somewhat arbitrary but seemed to generate an accurate ID for each molecule used in our investigation in a short amount of computational time. Evidently, one may use a number other than 10. Choosing a smaller number, however, means that we need a larger memory to hold Q. Choosing a larger number, on the other hand, results in a longer computation time. We find using the number 10 seems to provide a good balance between the two. The first heuristic employed by the MIDAs a algorithm prunes terms from the polynomial Q that have probability smaller than a pre-set probability value (η). The second heuristic procedure merges polynomial terms from Q that are within some user specified mass accuracy (ϵ) of each other into a new polynomial term. The new polynomial term is assigned a new mass (m ) that is equal to the probabilityweighted sum of the merged terms where m i and p i stand for the mass and probability of the merged terms, respectively. This new term associated with m is then assigned a probability equal to the sum of the probabilities of the merged terms. The pseudo-code for computing a CGID is given by algorithm 1, which is used by MIDAs a .

Algorithm 1. Computes Coarse-Grained Isotopic Distribution
To compute the FGID for an MF, for every element x, MIDAs a first computes the expected number of occurrences μ[x i ] for each isotope x i of x. MIDAs a then computes σ 2 [x i ], the variance of the number of occurrences. As an example, for the molecular formula x N y M , the expectation and variance in the number of atoms for a given isotope of element x is given by and Using the computed expectation and variance values, we denote the range B x i ð Þ; U x i ð Þ ½ as allowable for N x i ð Þ , the number of atoms of isotope x i . The upper bound U x i ð Þ and the lower bound B x i ð Þ are given by For isotope x i , we choose 10 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi to be the span of sum as this quantity is guaranteed to be simultaneously larger then 10σ [x i ] and 10 daltons. For each element x, the U x i ð Þs and B x i ð Þs are used to construct a polynomial, e P x , by means of the multinomial expansion formula By summing only the contributions bounded by B and U , we direct the calculations to the relevant part of the ID. It has counter-part in FT based method, namely the heterodyning of in [24].
For numerical accuracy and efficiency we employ the following simple identity and ln(n!) = ∑ k = 1 n lnk. This representation reduces computational time of Equation (7) since by tabulation one only enumerates all the logarithmic terms in Equation (8) once. Once all e P x s have been computed, they are used together with a user specified ϵ to compute a FGID using algorithm 2.

MIDAs Fast Fourier Transform Algorithm (MIDAs b )
The MIDAs b algorithm is similar to an early FFT algorithm by Rockwood et al. [19], which was implemented in a computer program called Mercury. These two algorithms differ, however, in a few aspects. First, using the exact isotopic masses in discrete FFT (DFFT) [39,40], Mercury produces IDs with leakages (assigning nonzero probabilities to masses where exactly zero probability is expected) and employs an apodization function to minimize leakage [41]. On the other hand, by assigning each isotope mass to a point on a fixed grid, MIDAs b avoids the leakage problem. Using discrete masses to avoid leakage is not new: Rockwood and Van Orden [32] have written a computer program, whose latest version is called Mercury5, to compute IDs based on the nucleon numbers (or roughly using one dalton mass grid). The improvement we made was to allow the users to specify the mass accuracy other than 1 Da. Second, Mercury uses a fixed number of sample points with the DFFT, whereas in MIDAs b the number of sample points used depends on the mass accuracy, which is a parameter adjustable by the user.
Every FFT based method relies on the convolution theorem, which states that a convolution can be performed as multiplication in the Fourier domain.
As we shall discuss in the Appendix, there are two key conditions in order for the convolution theorem to be used in the discrete case while computing IDs. The first one is that the masses of each isotope must lie on grid points. Using a mass that is not on the grid causes the "leakage" phenomenon [41]. If the masses considered all reside on grid points, the leakage problem no longer exists. The second important condition is that the mass domain must be large enough so that the "folded-back" phenomenon (which is also known as "aliasing", "fold over", or "wrap around" in the signal processing community) near the tail of the distribution is negligible (see Appendix).
Prior to delving into detail constructs of MIDAs b , let us first describe how one may compute the theoretical molecular mass variance σ MM 2 . Using our example molecule x N y M , one note that the molecular mass variance of this molecule can be rigorously written as σ MM is the molecular mass variance associated with element x. Explicitly, one may calculate σ 2 [m x ] as follows where the index i runs over all isotopes of element x and p(x i ) again represents the occurrence probability of isotope x i .
A key constraint of DFFT based ID method is that the total number of sample points, denoted by S, must be an integral power of two [42]. For a given molecule and specified mass accuracy ϵ, the total number of sample points S used in MIDAs b 's DFFT is given by and σ MM 2 is the theoretical variance in MM due to the elements' isotopes [32]. The quantity ⌈z⌉ represents the smallest integer that is larger than z for any positive number z. Again the quantity 15 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ σ 2 MM p 9 max 15; 15σ MM ð Þ is chosen so that S covers on both ends more than 7.5 standard deviations from the mean molecular mass, which prevents folded-back mass regions from having significant probabilities.
In order to avoid the problem of leakage, instead of using exact masses of isotopes and then applying filtering windows, we pin all isotopic masses to grid points. For each isotope mass m(x i ), we first find a corresponding grid index n(x i ) by the following formula Using this discrete approach, the probability function of the mass of element x becomes where n(x i )ϵ is the approximate expression for the exact mass m(x i ), and the Kronecker delta function takes value one if its two indices coincide and zero otherwise.
Consider the mass distribution of our example molecule x N y M . By the convolution theorem, the Fourier transform of the mass distribution, denoted by Ψ(v), can be written as where v takes S discrete values: 0,1,…, S − 1. The sample function Ψ(v) is heterodyned to have zero average mass by multiplying it by e −2πin o v=S , where n o is equal to n (the molecule's probability-weighted average grid index computed using n(x i ) and p(x i )) rounded to the nearest integer.
Once the function Ψ(v) has been calculated, three other operations are performed in order to generate the final FGID and CGID. The first operation performed is the inverse discrete fast Fourier transform (IDFFT), which transforms the sample function Ψ(v) to Φ(n) on the mass grid. Second, we apply a denoising procedure to remove small amplitudes due to rounding errors that occur during IDFFT. The rounding errors are expected to create small positive and negative amplitudes of equal amounts in the mass domain. MIDAs b thus removes all amplitudes whose absolute magnitude are smaller than that of the most negative amplitude. As a matter of fact, to be more conservative, MIDAs b uses an amplitude cutoff value that is twice the absolute value of the most negative amplitude. This means that only terms having amplitude greater than the cutoff value are reported in a computed FGID and CGID, with the amplitude values renormalized to sum to one. Figure 1 shows an example of the overlap between the positive amplitude histogram and the negative amplitude histogram. Right below the cutoff absolute amplitude, we see that the two histograms resemble each other, reflecting the fact that rounding errors have equal probability to be positive and negative. Following Rockwood and Van Orden [32], in the third step, MIDAs b applies a linear transformation to rescale the masses associated with the IDs to ensure a good agreement between the theoretically calculated and the numerically computed mean molecular mass as well as standard deviation of the molecular mass. The procedure described above is summarized in the pseudo code give by algorithm 3.

Results and Discussion
All methods used in our investigation were evaluated using their default parameter settings, except for a few parameter changes made to ensure that the atomic masses and abundances of elements' isotopes were the same for all methods (see Table 1). To conduct the evaluation, we used the 22 biomolecules and three inorganic compounds, all of which are listed in Table 2. Ten of these biomolecules, (1)-(10), are benchmark proteins previously used to evaluate the computed CGID [17,18]; 10 biomolecules, (11)- (20), are hydrocarbon molecules whose FGID can be exactly computed and were employed to evaluate the FGID; the remaining five molecules, (21)-(25), made of a combination of sulfur, mercury, carbon, and hydrogen, are used together with the some of the other 20 biomolecules to evaluate the computational time of MIDAs's algorithms.
IC is suitable for computing FGIDs, not CGIDs. Qmass, BRAIN, NeutronCluster, and Emass are suitable for computing CGIDs, not FGIDs. The remaining three Fouriertransform-based methods are also suitable for computing CGIDs, although Mercury is the only one that has FGID computing capacity. To benchmark the FGIDs computed by MIDAs against those of Mercury, however, would require post-processing of Mercury data files such as removing noise from leakage and rounding errors, as well as compiling output from different specified molecular masses. All of these steps may be done differently and make the benchmark test less meaningful. For these reasons, we only evaluated MIDAs's FGIDs against that of IC, not that of Mercury.

Benchmarking of Computed CGIDs
Following previous publications [18,19,24], the accuracy of a method is gauged by how accurately it yields ID mean, ID standard deviation, lightest and heaviest molecular masses, while computing a CGID. In our evaluation, the lightest mass and heaviest molecular mass are defined as a molecule's molecular mass computed using the masses of the lightest and heaviest isotopes, respectively.
Lightest masses comparisons for biomolecules, numbered (1)- (10) in Table 2, with elements having their naturally occurring isotopic abundances taken from Table 1, are displayed in Table 3. Unexpectedly, the lightest masses for the first six Reference number associated with a molecular formula (biomolecule or inorganic compound). 2 The unified atomic mass unit dalton (Da). molecules, reported by Mercury, Mercury5, and Qmass, are even lighter than their exact lightest masses, which should be the lightest masses possible in these six IDs. This observation was also described by [18]. For Mercury5 (and Mercury), this is caused by the rounding errors (and the leakage) when applying the DFFT. (In principle, both methods can avoid this problem by not reporting any terms in the computed ID that have masses lighter than the exact lightest mass.) For Qmass, this seems to arise from computing ID terms that are outside of the allowed mass range imposed by the biomolecule's MF. This is because in the Qmass output file the reported masses lighter than the exact lightest mass are associated with elemental compositions that differ from the biomolecule's MF used in the evaluation. The software NC reports correct lightest masses for nine out of the 10 molecules. For biomolecule number four, NC reports a mass that is 360 Da heavier. This same result has also been observed independently by others [17,18].
For MIDAs a , BRAIN, and Emass, the differences between exact and computed lightest masses, for small and medium size biomolecules [numbered (1)-(6)], are smaller than 1.0e-08 Da. As for JFC and MIDAs b , although they do not perform as well as the polynomial-based methods above, they are not inferior to other Fourier-transform-based methods such as Mercury and Mercury5. When the biomolecules become heavier [say molecules numbered (7)-(10)], the chance of experimentally observing the exact lightest masses rapidly decreases, and the computed difference between exact and computed lightest masses becomes less important.
The evaluation of getting the correct heaviest mass is not as important under natural conditions. This is because heavy isotopes typically carry very low natural occurrence proba-bilities so that it is impossible to observe the exact heaviest isotopic variant of the molecule. Of course, when artificial isotopic abundances are enforced, obtaining the correct heaviest masses can become important, while obtaining the correct lightest masses can become unimportant. Since the current evaluation is using the natural isotopic abundances, we do not expect any method to provide correct heaviest masses. Indeed, because most methods are computing terms of an ID that are concentrated around a molecule's average molecular mass, which is closer to the exact lightest mass under natural isotopic abundances, the mass range used for computing IDs usually will not include the heaviest masses. For biomolecules numbered (1)-(10), the differences between the exact heaviest masses and the heaviest masses computed by all methods considered are all of the same order of magnitude.
Displayed in the upper (lower) half of Table 4 are the relative differences of computed CGIDs derived molecular mass averages (standard deviations) to their theoretical values. Molecules numbered (1)-(10) in Table 2 are used with elements assuming isotopic abundances shown in Table 1. In terms of the average masses, MIDAs a,b , JFC, and Emass have comparable errors and have slightly smaller errors than the other methods. In terms of mass standard deviations, MIDAs a,b , JFC, and Emass have slightly smaller errors than the other methods. In principle, the accuracy of BRAIN might be improved by increasing the number of aggregated isotopic variants computed for each computed CGID. However, to accomplish this would require changing its default option. As mentioned earlier, to keep the benchmarking test simple, we only use the default option for each method considered. From Table 4, one can also infer that Qmass yields small errors for small and medium size molecules, but the error increases as the molecular mass increases. We have also considered the possibility of deviations from the natural frequencies of occurrence of an element's isotopes. Such customized modifications can be accomplished experi-mentally by a technique known as isotopic labeling [3], which is frequently employed in quantitative proteomics [44]. To mimic such a situation, we have computed CGIDs for various molecules assuming different carbon isotopic abundances: 99% 13 C and 1% 12 C as listed in Table 1. We then derive Reference number associated with a molecular formula (biomolecule). 2 Nothing reasonable reported (NR). from the computed CGIDs the average molecular masses and standard deviations, and compare them to the corresponding theoretical values that can be analytically calculated. The results of using the above mentioned customized carbon abundances are shown in Table 5. The differences displayed in Table 5 show that MIDAs a,b , JFC, and Emass have the smallest errors. However, in terms of ID's standard deviations, Mercury and Mercury5 yield comparable errors to MIDAs a,b , JFC, and Emass. The results for Qmass are similar to the ones obtained in Table 4: in terms of average masses and standard deviations, it yields small errors for small to medium sizes biomolecules. Table 5 also shows that the current versions of BRAIN and NC are not able to compute IDs using the modified isotopic abundances for carbon. However, the developers of NC have mentioned how NC could be modified to handle stable isotope enrichment by partition of the elements of enriched isotopes away from the equatransneutronic isotopes groups [17]. This option is not currently available in NC. Also, the proposed Table 6. Coarse-Grained Isotopic Distribution (CGID) Fidelity Assessment Results τ is the Number of Terms in the Exact CGID Having Probability Greater than 5e -12. Δτ is the Difference Between τ and the Number of Terms of a Computed CGID. Δχ is the Difference Between the Sum of Probability Terms from the Exact CGID and the Sum of Probability terms from the Computed CGID; σ m is the Root-Mean-Square Differences of Masses Between Exact and Computed CGID, see Equation (11)

Assessing Fidelity of Computed CGIDs and FGIDs
To evaluate the fidelity of CGIDs and FGIDs reported, we used 10 hydrocarbon molecules [numbered (11)- (20) in Table 2] because the "exact" CGIDs and FGIDs can be calculated for these molecules. Exact CGID is defined as follows. First, one merges isotopic variants that have the same nucleon number into one aggregated isotopic variant, whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. However, only aggregated isotopic variants having probability greater than 5e-12 were retained for accuracy evaluation. The exact FGIDs were obtained/defined similarly to the exact CGIDs, except that one merges only isotopic variants whose molecular mass differences are within some prespecified mass accuracy, here set to 0.01 Da. The probability cutoff of 5e-12, for typical sample loads, probably already surpasses the detection capability of current mass spectrometer. Furthermore, it is also a small enough cutoff that ignoring terms below the cutoff has negligible effect in the ID profile. Four quantities were then utilized to evaluate the fidelity of computed IDs. The first quantity is the difference in the numbers of terms (Δτ) kept by a computed ID and by its corresponding exact ID, be it the exact CGID or the exact FGID. The second quantity was the difference in the Table 7. Fine -Grained Isotopic Distribution (FGID) Fidelity Assessment Results τ is the number of terms in the exact FGID having probability greater than 5e -12; Δτ is the difference between τ and the number of terms of a computed FGID; Δχ is the difference between the sum of probability terms from the exact FGID and the sum of probability terms from the computed FGID; σ m is the root-mean-square differences of masses between exact and computed FGID, see Equation (11); U is the number of terms from the computed FGID that are not with ±2ϵ (ϵ=0.01 Da) from any terms in the exact FGID; E is the number of terms in the exact FGID that have at least one corresponding term in computed FGID that are with ±2ϵ; ρ is the weighted correlation between computed and exact FGID Reference number associated with a molecular formula (biomolecule). 2 Using a mass accuracy of 2ϵ the equivalent resolution in parts per million.
probability sums (Δχ), one from the computed ID and the other from the exact FGID (or the exact CGID). The third quantity was the root-mean-square differences of masses (σ m ) between computed and the exact CGIDs (or the exact FGIDs).
In the equation above, m i represents a computed mass term while each m j represents a mass term in the exact FGID (or the exact CGID). N is the number of terms retained in the computed ID. That is, for every mass term in a computed ID, the closest mass term within the exact FGID (or the exact CGID) is found and their difference square is summed. The average of such sum of squares constitutes σ m 2 . The fourth quantity computed was the weighted correlation (ρ) between computed and exact IDs. The weighted correlation (ρ) is defined as follows. Let p(m i ) and p(m j ) be the terms of a computed ID and the corresponding exact FGID (or exact CGID), respectively. We first introduce the weight (w ij ) between a computed ID term (index i) and exact FGID (or exact CGID) term (index j) as where, in the above equation, min j |m i −m j |, is the minimum mass difference between a term (m i ) from the computed ID and terms (m j ) from the exact ID. The computed weights (w ij ) are then normalized by the normalization factor, W j = ∑ i w ij , by summing over all i terms from the computed ID that are close to a common term j in the exact FGID (or the exact CGID). The weighted correlation using the above definitions is given by For CGIDs, ϵ is set to one Da, while for FGIDs, ϵ is set to 0.01 Da.
Using molecules numbered (11)- (20) in Table 2, we document the analysis results of the four quantities mentioned above in Table 6 (for CGIDs) and Table 7 (for FGIDs). For CGIDs, we include in Table 6 only four methods that largely satisfy the  Table 7 since they are the only methods that can do FGID-computing reasonably fast and without additional post processing. For fidelity assessment of CGIDs, all four methods shown in Table 6 yield small Δτ and ρ values close to one. In terms of σ m and Δχ, more differences are revealed. Emass always yields small σ m , reflecting good fidelity in terms of mass locations, but seems to give a larger |Δχ|, reflecting less accuracy in amplitudes. JFC and MIDAs b seem to yield less precise mass locations, evidenced by a larger σ m , but seem to provide more accurate amplitudes, evidenced by a smaller |Δχ|. MIDAs a yields both accurate mass locations and accurate amplitudes.
The values of Δχ and σ m in Table 7 indicate that IC, MIDAs a , and MIDAs b report FGID terms with similar mass accuracy and with probability sums that are close to the expected value. For small to medium molecules, numbered (11)-(15), IC, MIDAs a , and MIDAs b have equivalently accurate results. For molecules numbered (16)- (20), IC and MIDAs a have comparable performances, both slightly better than MIDAs b . The values for Δτ indicates that MIDAs b reports many more terms than expected in its computed FGID. Not expecting any leakage, MIDAs b gains these extra terms mainly due to rounding errors associated with the DFFT numerical procedure.
The difference observed in Δτ for MIDAs a is caused by the pruning and merging procedures employed by the algorithm. All the FGID terms computed by IC and MIDAs a are within 2ϵ from the exact FGID terms, which is shown by the number of unexplained term (U) being zero in Table 7. It is also true that most of the terms computed from MIDAs b are within 2ϵ from the exact FGID terms with the exception of molecules (17)- (20) where the number U ranges from 1 to 47. The computed weighted correlation also shows that for heavier molecules, (18)- (20), both IC and MIDAs a produce FGIDs that are more similar to the exact FGIDs than MIDAs b .
What causes MIDAs b to perform worse here might be related to the fact that pinning the elemental masses to grid points may introduce appreciable mass errors while computing IDs for larger molecules. In the worst case scenario, the mass error introduced is apparently proportional to the number of atoms contained in the molecule. Even though MIDAs b employs a mass rescaling [32] to bring the computed average masses and standard deviations close to their theoretical values, the linear mass rescaling is not sufficient to guarantee the full profile resemblance between the computed ID and the exact ID. The non-negligible discrepancy (indicated by the weighted correlation ρ not very close to one) between the computed FGID and the exact FGID for molecules (18)- (20) is reflecting this problem.

MIDAs Web Interface
MIDAs web interface http://www.ncbi.nlm.nih.gov/CBBresearch/ Yu/midas/index.html is user-friendly, but at the same time offers considerable flexibility. For example, in terms of the input molecule, the user may type in the box an elemental composition, a molecular formula, a peptide, or even a protein sequence. The program recognizes the input molecule in all formats above and extracts the corresponding elemental compositions for computing CGIDs and FGIDs. The isotopic abundances and elements' masses can also be customized within the web interface. The user simply clicks on the "change" button to edit the abundance table of all elements. Other fields that can be easily customized and specified by the user are the charge of the input molecule and the cutoff probability. MIDAs displays both CGID and FGID together using user-specified accuracies, one for each. The "algorithm" drop down box allows the user to select either the FFT or the polynomial algorithms. The output, including the lightest mass, theoretical average mass, theoretical mass standard deviation, computed average mass, computed mass standard deviation, FGID peak list, and CGID peak list can be exported to a flat file by clicking on the "download output" button on the result page. There is also a contextual help for every functional button.

Conclusion and Outlook
The two algorithms introduced here, MIDAs a and MIDAs b , for the 25 molecules tested, seem to be able to compute IDs quickly and accurately. Between the two, MIDAs a seems slightly more accurate. For CGIDs MIDAs b appears to be faster (see Table 8), whereas for FGIDs they are of comparable speed. Both algorithms benchmark well with existing methods and stand out because of their ability to compute CGIDs and FGIDs using a user-specified accuracy. These two algorithms were also shown to accurately compute IDs for molecules labeled with stable isotopes, which was not the case for some of the methods evaluated. In summary, in terms of CGIDs derived average masses, MIDAs a , MIDAs b , JFC, and Emass yield smaller errors than other methods. In terms of CGIDs derived standard deviation, our investigation shows that MIDAs a , MIDAs b , JFC, and Emass yield smaller errors than other methods. When computing the FGID, MIDAs a computes a FGID that better resembles the exact FGID than MIDAs b using our evaluation gauges. Both algorithms described here were coded using the C++ programming language in a computer program called MIDAs that is available for download at http:// www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/ MIDAs.html. To make these algorithms widely accessible, we have made them available through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/ index.html.
Funding for Open Access publication charges for this article was provided by the National Institutes of Health.

Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Using Convolution theorem in Discrete Fourier Transform
For completeness, we first review a few important properties of the DFT. Consider a function H sampled at L equally-spaced values. We shall denote H(x = nϵ) by H n with n = 0, 1, 2, …, L−1. One may then consider the DFT of this function by constructing with k = 0, 1, 2, …, L−1. With h k given, one can also invert the expression above to yield where the identity (when d is an integer) X LÀ1 n¼0 e 2πidn=L ¼ Lδ d;0 is used. Evidently, ϵ is the spacing between each pair of sampled values along the variable x. Although we only specify H on L points, using the Fourier expression of H n , we can easily see that H n+L = H n . That is, the DFT effectively makes the function considered, say H, periodic with period L. Given two periodic functions H and G of period L, we may consider the following convolution We can now compute where we have employed the periodic property of H and is the Fourier transform of G. The inverse transform of w k of course leads to W n without any leakage issue involved. When applying the DFT to compute an ID, however, one also need to pay attention to the issue of folded-back. To illustrate this problem, let us consider a toy example where an element E has two isotopes with masses ϵ and 2ϵ. For simplicity, let us also assume that both isotopes occur with equal probability 1/2. If one chooses to use grid size L=4 and compute the ID of the molecule E 2 using DFT, one starts with the following EFP and raise it to the second power to yield w k E 2 ð Þ ¼ 1 2 2 e À4πik=4 þ 2e À6πik=4 þ e À8πik=4 ¼ 1 2 2 e À4πik=4 þ 2e À6πik=4 þ 1 which yields, upon inverting back to the mass domain, probability 1/4 for masses 2ϵ and zero and probability 1/2 for mass 3ϵ. The reason for a zero mass is due to the periodicity. By identifying zero with 4ϵ, one gains the results expected: a probability of 1/4 for both masses 2ϵ and 4ϵ. Absent the molecular masses of 1ϵ and 5ϵ and so on, we see that the mass distribution of the molecule E 2 is resolved and appears correctly within the mass range from ϵ to 4ϵ. However, if one continues to keep the grid size L = 4 while considering the molecule E 4 , the Fourier transform of the molecule's mass distribution becomes e À8πik=4 2 4 1 þ 4e À2πik=4 þ 6e À4πik=4 þ 4e −6πik=4 þ e À8πik=4 ¼ 1 2 4 1 þ 4e À2πik=4 þ 6e À4πik=4 þ 4e À6πik=4 þ 1 ; which upon inversion yields masses zero, ϵ, 2ϵ, and 3ϵ respectively with probabilities 2/2 4 , 4/2 4 , 6/2 4 , and 4/2 4 . Remembering the periodicity, one may recognize that it is the set of masses 4ϵ, 5ϵ, 6ϵ, and 7ϵ (instead of zero, ϵ, 2ϵ, and 3ϵ) that acquires the set of probabilities 2/2 4 , 4/2 4 , 6/2 4 , and 4/2 4 . However, a simple calculation yields the possible masses to be 4ϵ, 5ϵ, 6ϵ, 7ϵ, and 8ϵ with respective probabilities 1/2 4 , 4/2 4 , 6/2 4 , 4/2 4 and 1/2 4 . What has happened is that the mass 8ϵ is now foldedback to 4ϵ due to the inherent periodicity caused by DFT with L=4. With this illustrative example, one can see that in order to avoid the folded-back artifact, one needs to have enough sample points so that the mass range used for the DFT is larger than the mass span of the molecule considered. However, if the tails of the mass distribution have very small probabilities, one might be able to use a smaller number of sample points with only a weak folded-back effect that only causes negligible distortion on the ID profile.
In general, when the number L is fixed, the folded-back problem should be less severe for the CGID when compare to its FGID counter-part. This is because if one keeps L fixed but decreases the mass difference between adjacent points, the effective mass range shrinks and there exists the possibility when regions with significant probabilities are now folded back to a particular mass window, where much smaller probabilities are assumed if no folded-back occurs. It is for this reason that MIDAs does not fix the number of sampled points, but rather increases it in proportion to 1/ϵ.