Abstract
In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs’s algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs’s polynomial algorithm, and MIDAs’s Fourier transform algorithm. Among the three, IC and MIDAs’s polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html
.
![](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs13361-013-0733-7/MediaObjects/13361_2013_733_Figa_HTML.gif)
ᅟ
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Most biomolecules are composed of hydrogen, carbon, nitrogen, oxygen, and sulphur. It is known that the natural isotopes of these elements occur with different probabilities [1, 2], and in some experiments the relative abundances of an element’s isotopes can be manipulated by using a technique known as stable isotopic labeling [3, 4]. The relative abundances of isotopes determine a molecule’s isotopic distribution (ID), which can be measured experimentally using a mass spectrometer. The measured ID constrains the elemental composition when compared with the in-silico computed ID and, hence, helps in identifying the underlying molecule. The realization of this goal, however, demands accurate in-silico ID prediction [5–11].
The information content in an experimentally measured ID depends on the resolution of the mass spectrometer. An ID generated by a low resolution instrument contains less information than that by an ultra-high resolution instrument [12–15]. Based on the instrument resolution, three different types of IDs are commonly mentioned in the literature: the aggregated, the fine structure, and the hyper-fine structure IDs [16]. The aggregated ID is computed by merging isotopic variants that have the same nucleon number into one aggregated isotopic variant [17, 18] whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. The fine and hyper-fine structure IDs are computed similarly to the aggregated ID, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy.
To make practical use of experimentally measured IDs, it is imperative to have methods that can compute in-silico IDs when given molecular formulas. Rockwood et al. [19] mentioned several criteria for a sound ID-computing method (IDCM): an IDCM must accurately compute in a very short time the masses and intensities without consuming much computational resource. We propose a few additional criteria by which to assess an IDCM’s application value: to handle experimentally generated IDs from both low-resolution and high-resolution instruments, an IDCM should allow adjustable mass accuracy; given that customized isotopic labeling has become a common experimental technique for quantitative analyses, an IDCM should be able to handle customized (or user-specified) isotopic abundances (or occurrence probabilities) of all chemical elements considered; finally, an IDCM should be able to compute IDs for a wide mass range and be user-friendly. Although there are several available methods [16] that can compute an aggregated ID [17, 18, 20–23], fine structure ID [19], and hyper-fine structure ID [23–26], there are not many methods that can satisfy all the requirements mentioned above.
In this manuscript, we present MIDAs, a software tool satisfying all the requirements above. MIDAs provides users with two accurate and efficient algorithms to compute IDs: the first algorithms belongs to the class of polynomial methods [27, 28], whereas the other algorithm belongs to the class of Fourier transform methods [29, 30]. The latter consists mainly of changes made to the existing Fourier transform method [19], and the changes made are shown to improve significantly the accuracy of the computed ID. Both algorithms can compute low and high resolution IDs, referred to as the coarse-grained isotopic distribution (CGID) and the fine-grained isotopic distribution (FGID), respectively, for the remainder of this manuscript. Also both algorithms implemented in MIDAs are capable of computing CGID and FGID with adjustable mass accuracy.
To evaluate the performance of MIDAs, we have benchmarked it against eight methods: four of these methods—Mercury [19], NeutronCluster (NC) [17], Emass [21], and BRAIN [18, 31]—are the four best performing methods taking from a recent publication by Claesen et al. [18]; four other methods included are Mercury5 (a new version of Mercury2) [32], Qmass [20], Isotope Calculator (IC) [33], and a Fourier-transform-based method recently published [34], which we refer to as JFC. JFC is an improved version of Isotopica [35], which incorporates BRAIN’s generating function. The program of JFC was downloaded from http://bioinformatica.cigb.edu.cu/isotopica/centermass.html. The BRAIN code was downloaded from http://www.bioconductor.org/packages/release/bioc/html/BRAIN.html. The program IC was downloaded from http://agarlabs.com/. The rest of the programs were provided by the code authors, whom we acknowledge in the Acknowledgment section.
The performance evaluation was conducted using 25 molecules. Ten of these molecules are benchmark proteins previously used to evaluate the accuracy of computed CGIDs [17, 18]. Another 10 are hydrocarbon molecules whose CGIDs and FGIDs can be exactly computed, making them ideal for evaluating the accuracy of computed IDs. The remaining five molecules, made of a combination of sulfur, mercury, carbon and hydrogen, are used together with some of the other 20 molecules to evaluate the computational time of MIDAs’s algorithms. Results from our investigation show that MIDAs [both the polynomial-based algorithm (MIDAsa) and the Fourier-transform-based algorithm (MIDAsb)], Emass, and JFC compute CGIDs with equivalent accuracy and are more accurate than the other methods evaluated. When computing the FGIDs, IC and MIDAsa yield FGIDs that are closest to the exact FGIDs. The results also show that MIDAsa and MIDAsb satisfy all aforementioned requirements to be considered a valuable tool, providing the community with two new options for computing accurate IDs.
2 Methods
In the subsections below we explain in detail the two algorithms implemented in MIDAs. The first subsection explains MIDAsa, a polynomial-based algorithm. The second subsection describes MIDAsb, a fast Fourier transform (FFT) based algorithm. Both algorithms can be used to compute CGIDs and FGIDs.
2.1 MIDAs Polynomial Multiplication Algorithm (MIDAsa)
It is well known that the ID of a molecule can be obtained by expanding the corresponding product of polynomials: each expanded term corresponds to an isotopic composition of the molecule’s elements. For example, the ID of a molecule having molecular formula (MF) x N y M is given by expanding
where I is an indicator variable, x i and y i are the isotopes of elements x and y, respectively, p(x i ) and p(y i ) are normalized probabilities of occurrence, and m(x i ) and m(y i ) are the exact atomic masses.
There are several polynomial-based methods designed to compute an ID from the MF. Methods such as the stepwise procedure and its improvement [36, 37], symbolic expansion [4], and multinomial expansion [28, 38] have been proposed to compute the expansion of the above polynomial. Although these methods have been shown to perform well for small molecules, they fail to handle large molecules, yielding inaccurate IDs, requiring a significant amount of computer memory, and taking a considerable amount of computational time [16].
Here we present MIDAsa, a polynomial-based algorithm that is simple and easy to understand. Our algorithm computes the molecule’s CGID by directly performing polynomial multiplication. To simplify the explanation, define the polynomial between the brackets in Equation (1) containing the probabilities and atomic masses of an element’s isotopes as the element fundamental polynomial (EFP). Let us represent the EFP of element x by P x , and also define the following recursion operation that multiplies together polynomials Q x and P x and assigns the resulting polynomial back to Q x as Q x ← (Q x × P x ). Substituting these definitions in Equation (1) with Q x initialized to one gives
where ⌊z⌋ represents the integer part of z for any positive number z. Using the recursion operation mentioned earlier, all the x-element related polynomials finally merge into Q x and all the y-element related polynomials finally merge into Q y as shown in algorithms 1 and 2. By first computing \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \) in Equation (2), one considerably reduces the computational time needed to obtain the polynomial expansion of an EFP. The logic in computing \( {\left({\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{N}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_x\right]}^{N-10\left\lfloor \frac{N}{10}\right\rfloor } \) (or \( {\left({\left[{\mathbf{P}}_y\right]}^{\left\lfloor \frac{M}{10}\right\rfloor}\right)}^{10}\times {\left[{\mathbf{P}}_y\right]}^{M-10\left\lfloor \frac{M}{10}\right\rfloor } \)) and not [P x ]N (or [P y]M) is that the former requires a smaller number of arithmetic operations. This is due to two heuristic procedures of MIDASa, prune and merge, which reduce the number of retained terms in the expanded polynomial \( {\left[{\mathbf{P}}_x\right]}^{\left\lfloor \frac{ Nx}{10}\right\rfloor } \). These heuristics are similar to the hyperatom concept [37] and the superatom concept [10], except that the number of atoms ⌊N/10⌋ in a superstructure is not fixed in our case. The choice of using 10 in Equation (2) was somewhat arbitrary but seemed to generate an accurate ID for each molecule used in our investigation in a short amount of computational time. Evidently, one may use a number other than 10. Choosing a smaller number, however, means that we need a larger memory to hold Q. Choosing a larger number, on the other hand, results in a longer computation time. We find using the number 10 seems to provide a good balance between the two.
The first heuristic employed by the MIDAsa algorithm prunes terms from the polynomial Q that have probability smaller than a pre-set probability value (η). The second heuristic procedure merges polynomial terms from Q that are within some user specified mass accuracy (∈) of each other into a new polynomial term. The new polynomial term is assigned a new mass (\( \overline{m} \)) that is equal to the probability-weighted sum of the merged terms
where m i and p i stand for the mass and probability of the merged terms, respectively. This new term associated with \( \overline{m} \) is then assigned a probability equal to the sum of the probabilities of the merged terms. The pseudo-code for computing a CGID is given by algorithm 1, which is used by MIDAsa.
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs13361-013-0733-7/MediaObjects/13361_2013_733_Figb_HTML.gif)
Algorithm 1. Computes Coarse-Grained Isotopic Distribution
To compute the FGID for an MF, for every element x, MIDAsa first computes the expected number of occurrences μ[x i ] for each isotope x i of x. MIDAsa then computes σ 2[x i ], the variance of the number of occurrences. As an example, for the molecular formula x N y M , the expectation and variance in the number of atoms for a given isotope of element x is given by
and
Using the computed expectation and variance values, we denote the range \( \left[\mathcal{B}\left({x}_i\right),\mathcal{U}\left({x}_i\right)\right] \) as allowable for \( \mathcal{N}\left({x}_i\right) \), the number of atoms of isotope x i . The upper bound \( \mathcal{U}\left({x}_i\right) \) and the lower bound \( \mathcal{B}\left({x}_i\right) \) are given by
For isotope x i , we choose \( 10\sqrt{\left(1+{\sigma}^2\left[{x}_i\right]\right)} \) to be the span of sum as this quantity is guaranteed to be simultaneously larger then 10σ [x i ] and 10 daltons. For each element x, the \( \mathcal{U}\left({x}_i\right)\mathrm{s} \) and \( \mathcal{B}\left({x}_i\right)\mathrm{s} \) are used to construct a polynomial, \( {\tilde{\mathbf{P}}}_x \), by means of the multinomial expansion formula
By summing only the contributions bounded by \( \mathcal{B} \) and \( \mathcal{U} \), we direct the calculations to the relevant part of the ID. It has counter-part in FT based method, namely the heterodyning of in [24].
For numerical accuracy and efficiency we employ the following simple identity
and ln(n !) = ∑ n k=1 ln k. This representation reduces computational time of Equation (7) since by tabulation one only enumerates all the logarithmic terms in Equation (8) once. Once all \( {\tilde{\mathbf{P}}}_x\mathrm{s} \) have been computed, they are used together with a user specified ∈ to compute a FGID using algorithm 2.
![figure c](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs13361-013-0733-7/MediaObjects/13361_2013_733_Figc_HTML.gif)
Algorithm 2. Computes Fine-Grained Isotopic Distribution 2
2.2 MIDAs Fast Fourier Transform Algorithm (MIDAsb)
The MIDAsb algorithm is similar to an early FFT algorithm by Rockwood et al. [19], which was implemented in a computer program called Mercury. These two algorithms differ, however, in a few aspects. First, using the exact isotopic masses in discrete FFT (DFFT) [39, 40], Mercury produces IDs with leakages (assigning nonzero probabilities to masses where exactly zero probability is expected) and employs an apodization function to minimize leakage [41]. On the other hand, by assigning each isotope mass to a point on a fixed grid, MIDAsb avoids the leakage problem. Using discrete masses to avoid leakage is not new: Rockwood and Van Orden [32] have written a computer program, whose latest version is called Mercury5, to compute IDs based on the nucleon numbers (or roughly using one dalton mass grid). The improvement we made was to allow the users to specify the mass accuracy other than 1 Da. Second, Mercury uses a fixed number of sample points with the DFFT, whereas in MIDAsb the number of sample points used depends on the mass accuracy, which is a parameter adjustable by the user.
Every FFT based method relies on the convolution theorem, which states that a convolution can be performed as multiplication in the Fourier domain.
As we shall discuss in the Appendix, there are two key conditions in order for the convolution theorem to be used in the discrete case while computing IDs. The first one is that the masses of each isotope must lie on grid points. Using a mass that is not on the grid causes the “leakage" phenomenon [41]. If the masses considered all reside on grid points, the leakage problem no longer exists. The second important condition is that the mass domain must be large enough so that the “folded-back" phenomenon (which is also known as “aliasing”, “fold over”, or “wrap around” in the signal processing community) near the tail of the distribution is negligible (see Appendix).
Prior to delving into detail constructs of MIDAsb, let us first describe how one may compute the theoretical molecular mass variance σ 2 MM . Using our example molecule x N y M , one note that the molecular mass variance of this molecule can be rigorously written as σ 2 MM = Nσ 2[m x ] + Mσ 2[m y ], where σ 2[m x ] is the molecular mass variance associated with element x. Explicitly, one may calculate σ 2[m x ] as follows
where the index i runs over all isotopes of element x and p(x i ) again represents the occurrence probability of isotope x i .
A key constraint of DFFT based ID method is that the total number of sample points, denoted by S, must be an integral power of two [42]. For a given molecule and specified mass accuracy ∈, the total number of sample points S used in MIDAsb’s DFFT is given by
and σ 2 MM is the theoretical variance in MM due to the elements’ isotopes [32]. The quantity ⌈z⌉ represents the smallest integer that is larger than z for any positive number z. Again the quantity \( 15\sqrt{1+{\sigma}_{MM}^2}> \max \left(15,15{\sigma}_{MM}\right) \) is chosen so that S covers on both ends more than 7.5 standard deviations from the mean molecular mass, which prevents folded-back mass regions from having significant probabilities.
In order to avoid the problem of leakage, instead of using exact masses of isotopes and then applying filtering windows, we pin all isotopic masses to grid points. For each isotope mass m(x i ), we first find a corresponding grid index n(x i ) by the following formula
Using this discrete approach, the probability function of the mass of element x becomes
where n(x i )∈ is the approximate expression for the exact mass m(x i ), and the Kronecker delta function takes value one if its two indices coincide and zero otherwise.
Consider the mass distribution of our example molecule x N y M . By the convolution theorem, the Fourier transform of the mass distribution, denoted by Ψ(v), can be written as
where v takes S discrete values: 0,1,…, S − 1. The sample function Ψ(v) is heterodyned to have zero average mass by multiplying it by \( {e}^{-2\pi i{n}_ov/S} \), where n o is equal to \( \overline{n} \) (the molecule’s probability-weighted average grid index computed using n(x i ) and p(x i )) rounded to the nearest integer.
Once the function Ψ(v) has been calculated, three other operations are performed in order to generate the final FGID and CGID. The first operation performed is the inverse discrete fast Fourier transform (IDFFT), which transforms the sample function Ψ(v) to Φ(n) on the mass grid. Second, we apply a denoising procedure to remove small amplitudes due to rounding errors that occur during IDFFT. The rounding errors are expected to create small positive and negative amplitudes of equal amounts in the mass domain. MIDAsb thus removes all amplitudes whose absolute magnitude are smaller than that of the most negative amplitude. As a matter of fact, to be more conservative, MIDAsb uses an amplitude cutoff value that is twice the absolute value of the most negative amplitude. This means that only terms having amplitude greater than the cutoff value are reported in a computed FGID and CGID, with the amplitude values renormalized to sum to one. Figure 1 shows an example of the overlap between the positive amplitude histogram and the negative amplitude histogram. Right below the cutoff absolute amplitude, we see that the two histograms resemble each other, reflecting the fact that rounding errors have equal probability to be positive and negative. Following Rockwood and Van Orden [32], in the third step, MIDAsb applies a linear transformation to rescale the masses associated with the IDs to ensure a good agreement between the theoretically calculated and the numerically computed mean molecular mass as well as standard deviation of the molecular mass. The procedure described above is summarized in the pseudo code give by algorithm 3.
Example of rounding errors. The curves plotted above are the histogram for the logarithm of absolute value of the positive (green solid line histogram) and negative (blue long-dashed line histogram) amplitudes obtained after applying the discrete Fourier transform to compute an isotopic distribution for molecule C2023H3208N524O619S20 using a mass accuracy of 0.01 Da (≈0.22 ppm). Absent the leakage, the negative amplitude can only come from rounding errors, among which equal amounts of small positive amplitudes and negative amplitudes are expected. The above histograms overlap for terms that have magnitude in amplitude less than 4.2e–10, displayed above by a dash black line, and at this point is the rounding error cutoff value used by MIDAsb
![figure d](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs13361-013-0733-7/MediaObjects/13361_2013_733_Figd_HTML.gif)
Algorithm 3. Computes Fine-Grained and Coarse-Grained Isotopic Distribution
3 Results and Discussion
All methods used in our investigation were evaluated using their default parameter settings, except for a few parameter changes made to ensure that the atomic masses and abundances of elements’ isotopes were the same for all methods (see Table 1). To conduct the evaluation, we used the 22 biomolecules and three inorganic compounds, all of which are listed in Table 2. Ten of these biomolecules, (1)–(10), are benchmark proteins previously used to evaluate the computed CGID [17, 18]; 10 biomolecules, (11)–(20), are hydrocarbon molecules whose FGID can be exactly computed and were employed to evaluate the FGID; the remaining five molecules, (21)–(25), made of a combination of sulfur, mercury, carbon, and hydrogen, are used together with the some of the other 20 biomolecules to evaluate the computational time of MIDAs’s algorithms.
3.1 Overview of Methods Benchmarked
MIDAs’s performance was evaluated against eight published methods: Mercury [19], Mercury5 [32], JFC [34], Isotope Calculator (IC) [33], Qmass [20], BRAIN [18, 31, 43], NeutronCluster (NC) [17], and Emass [21]. The first three published methods are Fourier-transform-based, IC utilizes a divide-and-recursively-combine algorithm, Qmass has its core based on FFT, BRAIN and NeutronCluster are polynomial-based, whereas Emass is based on a direct convolution approach related to the stepwise procedure and its improvement [36, 37]. BRAIN, Qmass, NC, Emass, JFC, and Mercury5 all use nucleon numbers to classify molecule’s isotopic variants, while all but the last assign to a given nucleon number the average isotopic mass of all variants of that nucleon number.
IC is suitable for computing FGIDs, not CGIDs. Qmass, BRAIN, NeutronCluster, and Emass are suitable for computing CGIDs, not FGIDs. The remaining three Fourier-transform-based methods are also suitable for computing CGIDs, although Mercury is the only one that has FGID computing capacity. To benchmark the FGIDs computed by MIDAs against those of Mercury, however, would require post-processing of Mercury data files such as removing noise from leakage and rounding errors, as well as compiling output from different specified molecular masses. All of these steps may be done differently and make the benchmark test less meaningful. For these reasons, we only evaluated MIDAs’s FGIDs against that of IC, not that of Mercury.
3.2 Benchmarking of Computed CGIDs
Following previous publications [18, 19, 24], the accuracy of a method is gauged by how accurately it yields ID mean, ID standard deviation, lightest and heaviest molecular masses, while computing a CGID. In our evaluation, the lightest mass and heaviest molecular mass are defined as a molecule’s molecular mass computed using the masses of the lightest and heaviest isotopes, respectively.
Lightest masses comparisons for biomolecules, numbered (1)–(10) in Table 2, with elements having their naturally occurring isotopic abundances taken from Table 1, are displayed in Table 3. Unexpectedly, the lightest masses for the first six molecules, reported by Mercury, Mercury5, and Qmass, are even lighter than their exact lightest masses, which should be the lightest masses possible in these six IDs. This observation was also described by [18]. For Mercury5 (and Mercury), this is caused by the rounding errors (and the leakage) when applying the DFFT. (In principle, both methods can avoid this problem by not reporting any terms in the computed ID that have masses lighter than the exact lightest mass.) For Qmass, this seems to arise from computing ID terms that are outside of the allowed mass range imposed by the biomolecule’s MF. This is because in the Qmass output file the reported masses lighter than the exact lightest mass are associated with elemental compositions that differ from the biomolecule’s MF used in the evaluation.
The software NC reports correct lightest masses for nine out of the 10 molecules. For biomolecule number four, NC reports a mass that is 360 Da heavier. This same result has also been observed independently by others [17, 18].
For MIDAsa, BRAIN, and Emass, the differences between exact and computed lightest masses, for small and medium size biomolecules [numbered (1)–(6)], are smaller than 1.0e–08 Da. As for JFC and MIDAsb, although they do not perform as well as the polynomial-based methods above, they are not inferior to other Fourier-transform-based methods such as Mercury and Mercury5. When the biomolecules become heavier [say molecules numbered (7)–(10)], the chance of experimentally observing the exact lightest masses rapidly decreases, and the computed difference between exact and computed lightest masses becomes less important.
The evaluation of getting the correct heaviest mass is not as important under natural conditions. This is because heavy isotopes typically carry very low natural occurrence probabilities so that it is impossible to observe the exact heaviest isotopic variant of the molecule. Of course, when artificial isotopic abundances are enforced, obtaining the correct heaviest masses can become important, while obtaining the correct lightest masses can become unimportant. Since the current evaluation is using the natural isotopic abundances, we do not expect any method to provide correct heaviest masses. Indeed, because most methods are computing terms of an ID that are concentrated around a molecule’s average molecular mass, which is closer to the exact lightest mass under natural isotopic abundances, the mass range used for computing IDs usually will not include the heaviest masses. For biomolecules numbered (1)–(10), the differences between the exact heaviest masses and the heaviest masses computed by all methods considered are all of the same order of magnitude.
Displayed in the upper (lower) half of Table 4 are the relative differences of computed CGIDs derived molecular mass averages (standard deviations) to their theoretical values. Molecules numbered (1)–(10) in Table 2 are used with elements assuming isotopic abundances shown in Table 1. In terms of the average masses, MIDAsa,b, JFC, and Emass have comparable errors and have slightly smaller errors than the other methods. In terms of mass standard deviations, MIDAsa,b, JFC, and Emass have slightly smaller errors than the other methods. In principle, the accuracy of BRAIN might be improved by increasing the number of aggregated isotopic variants computed for each computed CGID. However, to accomplish this would require changing its default option. As mentioned earlier, to keep the benchmarking test simple, we only use the default option for each method considered. From Table 4, one can also infer that Qmass yields small errors for small and medium size molecules, but the error increases as the molecular mass increases.
We have also considered the possibility of deviations from the natural frequencies of occurrence of an element’s isotopes. Such customized modifications can be accomplished experimentally by a technique known as isotopic labeling [3], which is frequently employed in quantitative proteomics [44]. To mimic such a situation, we have computed CGIDs for various molecules assuming different carbon isotopic abundances: 99% 13C and 1% 12C as listed in Table 1. We then derive from the computed CGIDs the average molecular masses and standard deviations, and compare them to the corresponding theoretical values that can be analytically calculated.
The results of using the above mentioned customized carbon abundances are shown in Table 5. The differences displayed in Table 5 show that MIDAsa,b, JFC, and Emass have the smallest errors. However, in terms of ID’s standard deviations, Mercury and Mercury5 yield comparable errors to MIDAsa,b, JFC, and Emass. The results for Qmass are similar to the ones obtained in Table 4: in terms of average masses and standard deviations, it yields small errors for small to medium sizes biomolecules. Table 5 also shows that the current versions of BRAIN and NC are not able to compute IDs using the modified isotopic abundances for carbon. However, the developers of NC have mentioned how NC could be modified to handle stable isotope enrichment by partition of the elements of enriched isotopes away from the equatransneutronic isotopes groups [17]. This option is not currently available in NC. Also, the proposed solution reduces NC back to a polynomial method algorithm, which, if not efficiently implemented, can significantly influence the overall computation time. In BRAIN’s case, there are no reasonable IDs reported and it is difficult to speculate what might have happened.
3.3 Assessing Fidelity of Computed CGIDs and FGIDs
To evaluate the fidelity of CGIDs and FGIDs reported, we used 10 hydrocarbon molecules [numbered (11)–(20) in Table 2] because the “exact” CGIDs and FGIDs can be calculated for these molecules. Exact CGID is defined as follows. First, one merges isotopic variants that have the same nucleon number into one aggregated isotopic variant, whose corresponding molecular mass (MM) and occurrence probability are computed respectively from the probability-weighted sum of masses and from the sum of the probabilities of the isotopic variants merged. However, only aggregated isotopic variants having probability greater than 5e–12 were retained for accuracy evaluation. The exact FGIDs were obtained/defined similarly to the exact CGIDs, except that one merges only isotopic variants whose molecular mass differences are within some pre-specified mass accuracy, here set to 0.01 Da. The probability cutoff of 5e–12, for typical sample loads, probably already surpasses the detection capability of current mass spectrometer. Furthermore, it is also a small enough cutoff that ignoring terms below the cutoff has negligible effect in the ID profile.
Four quantities were then utilized to evaluate the fidelity of computed IDs. The first quantity is the difference in the numbers of terms (Δτ) kept by a computed ID and by its corresponding exact ID, be it the exact CGID or the exact FGID. The second quantity was the difference in the probability sums (Δχ), one from the computed ID and the other from the exact FGID (or the exact CGID). The third quantity was the root-mean-square differences of masses (σ m ) between computed and the exact CGIDs (or the exact FGIDs).
In the equation above, m i represents a computed mass term while each m j represents a mass term in the exact FGID (or the exact CGID). N is the number of terms retained in the computed ID. That is, for every mass term in a computed ID, the closest mass term within the exact FGID (or the exact CGID) is found and their difference square is summed. The average of such sum of squares constitutes σ 2 m . The fourth quantity computed was the weighted correlation (ρ) between computed and exact IDs. The weighted correlation (ρ) is defined as follows. Let p(m i ) and p(m j ) be the terms of a computed ID and the corresponding exact FGID (or exact CGID), respectively. We first introduce the weight (w ij ) between a computed ID term (index i) and exact FGID (or exact CGID) term (index j) as
where, in the above equation, min j |m i − m j |, is the minimum mass difference between a term (m i ) from the computed ID and terms (m j ) from the exact ID. The computed weights (w ij ) are then normalized by the normalization factor, W j = ∑ i w ij , by summing over all i terms from the computed ID that are close to a common term j in the exact FGID (or the exact CGID). The weighted correlation using the above definitions is given by
For CGIDs, ∈ is set to one Da, while for FGIDs, ∈ is set to 0.01 Da.
Using molecules numbered (11)–(20) in Table 2, we document the analysis results of the four quantities mentioned above in Table 6 (for CGIDs) and Table 7 (for FGIDs). For CGIDs, we include in Table 6 only four methods that largely satisfy the criteria for being a sound IDCM of application value. For FGIDs, only IC, MIDAsa and MIDAsb are included in Table 7 since they are the only methods that can do FGID-computing reasonably fast and without additional post processing.
For fidelity assessment of CGIDs, all four methods shown in Table 6 yield small Δτ and ρ values close to one. In terms of σ m and Δχ, more differences are revealed. Emass always yields small σ m , reflecting good fidelity in terms of mass locations, but seems to give a larger |Δχ|, reflecting less accuracy in amplitudes. JFC and MIDAsb seem to yield less precise mass locations, evidenced by a larger σ m , but seem to provide more accurate amplitudes, evidenced by a smaller |Δχ|. MIDAsa yields both accurate mass locations and accurate amplitudes.
The values of Δχ and σ m in Table 7 indicate that IC, MIDAsa, and MIDAsb report FGID terms with similar mass accuracy and with probability sums that are close to the expected value. For small to medium molecules, numbered (11)–(15), IC, MIDAsa, and MIDAsb have equivalently accurate results. For molecules numbered (16)–(20), IC and MIDAsa have comparable performances, both slightly better than MIDAsb. The values for Δτ indicates that MIDAsb reports many more terms than expected in its computed FGID. Not expecting any leakage, MIDAsb gains these extra terms mainly due to rounding errors associated with the DFFT numerical procedure.
The difference observed in Δτ for MIDAsa is caused by the pruning and merging procedures employed by the algorithm. All the FGID terms computed by IC and MIDAsa are within 2∈ from the exact FGID terms, which is shown by the number of unexplained term (U) being zero in Table 7. It is also true that most of the terms computed from MIDAsb are within 2∈ from the exact FGID terms with the exception of molecules (17)–(20) where the number U ranges from 1 to 47. The computed weighted correlation also shows that for heavier molecules, (18)–(20), both IC and MIDAsa produce FGIDs that are more similar to the exact FGIDs than MIDAsb.
What causes MIDAsb to perform worse here might be related to the fact that pinning the elemental masses to grid points may introduce appreciable mass errors while computing IDs for larger molecules. In the worst case scenario, the mass error introduced is apparently proportional to the number of atoms contained in the molecule. Even though MIDAsb employs a mass rescaling [32] to bring the computed average masses and standard deviations close to their theoretical values, the linear mass rescaling is not sufficient to guarantee the full profile resemblance between the computed ID and the exact ID. The non-negligible discrepancy (indicated by the weighted correlation ρ not very close to one) between the computed FGID and the exact FGID for molecules (18)–(20) is reflecting this problem.
3.4 MIDAs Web Interface
MIDAs web interface http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html is user-friendly, but at the same time offers considerable flexibility. For example, in terms of the input molecule, the user may type in the box an elemental composition, a molecular formula, a peptide, or even a protein sequence. The program recognizes the input molecule in all formats above and extracts the corresponding elemental compositions for computing CGIDs and FGIDs. The isotopic abundances and elements’ masses can also be customized within the web interface. The user simply clicks on the “change” button to edit the abundance table of all elements. Other fields that can be easily customized and specified by the user are the charge of the input molecule and the cutoff probability. MIDAs displays both CGID and FGID together using user-specified accuracies, one for each. The “algorithm” drop down box allows the user to select either the FFT or the polynomial algorithms. The output, including the lightest mass, theoretical average mass, theoretical mass standard deviation, computed average mass, computed mass standard deviation, FGID peak list, and CGID peak list can be exported to a flat file by clicking on the “download output” button on the result page. There is also a contextual help for every functional button.
4 Conclusion and Outlook
The two algorithms introduced here, MIDAsa and MIDAsb, for the 25 molecules tested, seem to be able to compute IDs quickly and accurately. Between the two, MIDAsa seems slightly more accurate. For CGIDs MIDAsb appears to be faster (see Table 8), whereas for FGIDs they are of comparable speed. Both algorithms benchmark well with existing methods and stand out because of their ability to compute CGIDs and FGIDs using a user-specified accuracy. These two algorithms were also shown to accurately compute IDs for molecules labeled with stable isotopes, which was not the case for some of the methods evaluated. In summary, in terms of CGIDs derived average masses, MIDAsa, MIDAsb, JFC, and Emass yield smaller errors than other methods. In terms of CGIDs derived standard deviation, our investigation shows that MIDAsa, MIDAsb, JFC, and Emass yield smaller errors than other methods. When computing the FGID, MIDAsa computes a FGID that better resembles the exact FGID than MIDAsb using our evaluation gauges. Both algorithms described here were coded using the C++ programming language in a computer program called MIDAs that is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/MIDAs.html. To make these algorithms widely accessible, we have made them available through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.
References
Rosman, K., Taylor, P.: Isotopic compositions of elements 1997. Pure App. Chem. 70, 217–235 (1998)
Michener, R., Lajtha, K.: Stable Isotopes in Ecology and Environmental Science. Wiley-Blackwell, Massachusetts (2007)
Becker, G.W.: Stable isotopic labeling of proteins for quantitative proteomic applications. Brief. Funct. Genom. Proteom. 7(5), 371–382 (2008)
Yamamoto, H., McCloskey, J.A.: Calculations of isotopic distribution in molecules extensively labeled with heavy isotopes. Anal. Chem. 49(2), 281–283 (1977)
Rockwood, A.L.: Deconvoluting isotopic distributions to evaluate parent/fragment ion relationships. Rapid Commun. Mass Spectrom. 11(3), 241–248 (1997)
Gay, S., Binz, P.A., Hochstrasser, D.F., Appel, R.D.: Modeling peptide mass fingerprinting data using the atomic composition of peptides. Electrophoresis 20(18), 3527–3534 (1999)
Blank, P., Sjomeling, C., Backlund, P., Yergey, A.: Use of cumulative distribution functions to characterize mass spectra of intact proteins. J. Am. Soc. Mass Spectrom. 13, 40–46 (2002)
Goodlett, D.R., Bruce, J.E., Anderson, G.A., Rist, B., Pasa-Tolic, L., Fiehn, O., Smith, R.D., Aebersold, R.: Protein identification with a single accurate mass of a cysteine-containing peptide and constrained database searching. Anal. Chem. 72(6), 1112–1118 (2000)
Polacco, B.J., Purvine, S.O., Zink, E.M., Lavoie, S.P., Lipton, M.S., Summers, A.O., Miller, S.M.: Discovering mercury protein modifications in whole proteomes using natural isotope distributions observed in liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteom. (2011). doi:10.1074/mcp.M110.004853
Roussis, S.G., Proulx, R.: Reduction of chemical formulas from the isotopic peak distributions of high-resolution mass spectra. Anal. Chem. 75(6), 1470–1482 (2003)
Nikolaev, E.N., Jertz, R., Grigoryev, A., Baykut, G.: Fine structure in isotopic peak distributions measured using a dynamically harmonized Fourier transform ion cyclotron resonance cell at 7 t. Anal. Chem. 84(5), 2275–2283 (2012)
Russell, D.H., Edmondson, R.D.: High-resolution mass spectrometry and accurate mass measurements with emphasis on the characterization of peptides and proteins by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. J. Mass Spectrom. 32, 263–276 (1997)
Werlen, R.C.: Effect of resolution on the shape of mass spectra of proteins: some theoretical considerations. Rapid Commun. Mass Spectrom. 8(12), 976–980 (1994)
Marshall, A.G., Hendrickson, C.L.: High-resolution mass spectrometers. Annu. Rev. Anal. Chem. 1(1), 579–599 (2008)
Michalski, A., Damoc, E., Lange, O., Denisov, E., Nolting, D., Muller, M., Viner, R., Schwartz, J., Remes, P., Belford, M., Dunyach, J.J., Cox, J., Horning, S., Mann, M., Makarov, A.: Ultra high resolution linear ion trap Orbitrap mass spectrometer (Orbitrap Elite) facilitates top down LC MS/MS and versatile peptide fragmentation modes. Mol. Cell. Proteom. (2012). doi:10.1074/mcp.O111.013698
Valkenborg, D., Mertens, I., Lemiere, F., Witters, E., Burzykowski, T.: The isotopic distribution conundrum. Mass Spectrom. Rev. 31(1), 96–109 (2012)
Olson, M., Yergey, A.: Calculation of the isotope cluster for polypeptides by probability grouping. J. Am. Soc. Mass Spectrom. 20, 295–302 (2009)
Claesen, J., Dittwald, P., Burzykowski, T., Valkenborg, D.: An efficient method to calculate the aggregated isotopic distribution and exact center-masses. J. Am. Soc. Mass Spectrom. 23, 753–763 (2012)
Rockwood, A.L., Van Orden, S.L., Smith, R.D.: Rapid calculation of isotope distributions. Anal. Chem. 67(15), 2699–2704 (1995)
Rockwood, A.L., Van Orman, J.R., Dearden, D.V.: Isotopic compositions and accurate masses of single isotopic peaks. J. Am. Soc. Mass Spectrom. 15(1), 12–21 (2004)
Rockwood, A., Haimi, P.: Efficient calculation of accurate masses of isotopic peaks. J. Am. Soc. Mass Spectrom. 17, 415–419 (2006)
Senko, M.W.: IsoPro computer program 3.0.
Snider, R.: Efficient calculation of exact mass isotopic distributions. J. Am. Soc. Mass Spectrom. 18, 1511–1515 (2007)
Rockwood, A.L., Van Orden, S.L., Smith, R.D.: Ultrahigh resolution isotope distribution calculations. Rapid Commun. Mass Spectrom. 10(1), 54–59 (1996)
Li, L., Kresh, J.A., Karabacak, N.M., Cobb, J.S., Agar, J.N., Hong, P.: A hierarchical algorithm for calculating the isotopic fine structures of molecules. J. Am. Soc. Mass Spectrom. 19(12), 1867–1874 (2008)
Fernandez-de Cossio, J.: Efficient packing Fourier-transform approach for ultrahigh resolution isotopic distribution calculations. Anal. Chem. 82(5), 1759–1765 (2010)
Brownawell, M.L., San Filippo, J.: Simulation of chemical instrumentation. ii: A program for the synthesis of mass spectral isotopic abundances. J. Chem. Educ. 59(8), 663 (1982)
Yergey, J.A.: A general approach to calculating isotopic distributions for mass spectrometry. Int. J. Mass Spectrom. Ion Phys. 52, 337–349 (1983)
Rockwood, A.L.: Relationship of Fourier transforms to isotope distribution calculations. Rapid Commun. Mass Spectrom. 9(1), 103–105 (1995)
Cooley, J.W.: The rediscovery of the fast Fourier transform algorithm. Mikrochim. Acta 3, 33–45 (1987)
Dittwald, P., Claesen, J., Burzykowski, T., Valkenborg, D., Gambin, A.: BRAIN: a universal tool for high-throughput calculations of the isotopic distribution for mass spectrometry. Anal. Chem. 85(4), 1991–1994 (2013)
Rockwood, A.L., Van Orden, S.L.: Ultrahigh-speed calculation of isotope distributions. Anal. Chem. 68(13), 2027–2030 (1996)
Li, L., Karabacak, N.M., Cobb, J.S., Wang, Q., Hong, P., Agar, J.N.: Memory-efficient calculation of the isotopic mass states of a molecule. Rapid Commun. Mass Spectrom. 24(18), 2689–2696 (2010)
Fernandez-de Cossio Diaz, J., Fernandez-de Cossio, J.: Computation of isotopic peak center-mass distribution by fourier transform. Anal. Chem. 84(16), 7052–7056 (2012)
Fernandez-de Cossio, J., Gonzalez, L.J., Satomi, Y., Betancourt, L., Ramos, Y., Huerta, V., Amaro, A., Besada, V., Padron, G., Minamino, N., Takao, T.: Isotopica: a tool for the calculation and viewing of complex isotopic envelopes. Nucleic Acids Res. 32(Web Server issue), W674–W678 (2004)
Beynon, J.H.: Mass Spectrometry and Its Applications to Organic Chemistry. Elsevier, New York (1960)
Kubinyi, H.: Calculation of isotope distributions in mass spectrometry. A trivial solution for a non-trivial problem. Anal. Chim. Acta 247(1), 107–119 (1991)
Margrave, J.L., Polansky, R.B.: Relative abundance calculations for isotopic molecular species. J. Chem. Educ. 39(7), 335 (1962)
Bohman, H.: Approximate Fourier analysis of distribution functions. Arkiv Matematik 4, 99–157 (1961)
Harris, F.J.: High-resolution spectral analysis with arbitrary spectral centers and arbitrary spectral resolutions. Comput. Electr. Eng. 3(2), 171–191 (1976)
Harris, F.: On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66(1), 5–83 (1978)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. The Art of Scientific Computing, 2nd edn. Cambridge University Press, New York, NY (1992)
Gould, H.: Fibonacci Q. 37(2), 135–140 (1999)
Elliott, M.H., Smith, D.S., Parker, C.E., Borchers, C.: Current trends in quantitative proteomics. J. Mass Spectrom. 44(12), 1637–1660 (2009)
Acknowledgments
The authors thank Alfred Yergey for sending them the NeutronCluster code, and Alan Rockwood for providing them with codes of Mercury, Emass, Qmass, and Mercury5. The authors thank the administrative group of the National Institutes of Health Biowulf Clusters, where all the computational tasks were carried out. They also thank the National Institutes of Health Fellows Editorial Board for editorial assistance. This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. Funding for Open Access publication charges for this article was provided by the National Institutes of Health.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Using Convolution theorem in Discrete Fourier Transform
For completeness, we first review a few important properties of the DFT. Consider a function H sampled at L equally-spaced values. We shall denote H(x = n ∈) by H n with n = 0,1,2,…,L − 1. One may then consider the DFT of this function by constructing
with k = 0,1,2,…,L − 1. With h k given, one can also invert the expression above to yield
where the identity (when d is an integer)
is used. Evidently, ∈ is the spacing between each pair of sampled values along the variable x. Although we only specify H on L points, using the Fourier expression of H n , we can easily see that H n+L = H n . That is, the DFT effectively makes the function considered, say H, periodic with period L.
Given two periodic functions H and G of period L, we may consider the following convolution
We can now compute
where we have employed the periodic property of H and
is the Fourier transform of G. The inverse transform of w k of course leads to W n without any leakage issue involved.
When applying the DFT to compute an ID, however, one also need to pay attention to the issue of folded-back. To illustrate this problem, let us consider a toy example where an element E has two isotopes with masses ∈ and 2∈. For simplicity, let us also assume that both isotopes occur with equal probability 1/2. If one chooses to use grid size L = 4 and compute the ID of the molecule E2 using DFT, one starts with the following EFP
and raise it to the second power to yield
which yields, upon inverting back to the mass domain, probability 1/4 for masses 2∈ and zero and probability 1/2 for mass 3∈. The reason for a zero mass is due to the periodicity. By identifying zero with 4∈, one gains the results expected: a probability of 1/4 for both masses 2∈ and 4∈. Absent the molecular masses of 1∈ and 5∈ and so on, we see that the mass distribution of the molecule E2 is resolved and appears correctly within the mass range from ∈ to 4∈.
However, if one continues to keep the grid size L = 4 while considering the molecule E4, the Fourier transform of the molecule’s mass distribution becomes
which upon inversion yields masses zero, ∈, 2∈, and 3∈ respectively with probabilities 2/24, 4/24, 6/24, and 4/24. Remembering the periodicity, one may recognize that it is the set of masses 4∈, 5∈, 6∈, and 7∈ (instead of zero, ∈, 2∈, and 3∈) that acquires the set of probabilities 2/24, 4/24, 6/24, and 4/24. However, a simple calculation yields the possible masses to be 4∈, 5∈, 6∈, 7∈, and 8∈ with respective probabilities 1/24, 4/24, 6/24, 4/24 and 1/24. What has happened is that the mass 8∈ is now folded-back to 4∈ due to the inherent periodicity caused by DFT with L = 4. With this illustrative example, one can see that in order to avoid the folded-back artifact, one needs to have enough sample points so that the mass range used for the DFT is larger than the mass span of the molecule considered. However, if the tails of the mass distribution have very small probabilities, one might be able to use a smaller number of sample points with only a weak folded-back effect that only causes negligible distortion on the ID profile.
In general, when the number L is fixed, the folded-back problem should be less severe for the CGID when compare to its FGID counter-part. This is because if one keeps L fixed but decreases the mass difference between adjacent points, the effective mass range shrinks and there exists the possibility when regions with significant probabilities are now folded back to a particular mass window, where much smaller probabilities are assumed if no folded-back occurs. It is for this reason that MIDAs does not fix the number of sampled points, but rather increases it in proportion to 1/∈.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
About this article
Cite this article
Alves, G., Ogurtsov, A.Y. & Yu, YK. Molecular Isotopic Distribution Analysis (MIDAs) with Adjustable Mass Accuracy. J. Am. Soc. Mass Spectrom. 25, 57–70 (2014). https://doi.org/10.1007/s13361-013-0733-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13361-013-0733-7