BRAIN 2.0: Time and Memory Complexity Improvements in the Algorithm for Calculating the Isotope Distribution
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s13361-013-0796-5
- Cite this article as:
- Dittwald, P. & Valkenborg, D. J. Am. Soc. Mass Spectrom. (2014) 25: 588. doi:10.1007/s13361-013-0796-5
- 1 Citations
- 723 Downloads
Abstract
Recently, an elegant iterative algorithm called BRAIN (Baffling Recursive Algorithm for Isotopic distributioN calculations) was presented. The algorithm is based on the classic polynomial method for calculating aggregated isotope distributions, and it introduces algebraic identities using Newton-Girard and Viète’s formulae to solve the problem of polynomial expansion. Due to the iterative nature of the BRAIN method, it is a requirement that the calculations start from the lightest isotope variant. As such, the complexity of BRAIN scales quadratically with the mass of the putative molecule, since it depends on the number of aggregated peaks that need to be calculated. In this manuscript, we suggest two improvements of the algorithm to decrease both time and memory complexity in obtaining the aggregated isotope distribution. We also illustrate a concept to represent the element isotope distribution in a generic manner. This representation allows for omitting the root calculation of the element polynomial required in the original BRAIN method. A generic formulation for the roots is of special interest for higher order element polynomials such that root finding algorithms and its inaccuracies can be avoided.
Key words
Isotopic distribution Isotopic abundance’s ratios Mass spectrometry Proteomics BRAIN algorithm1 Introduction
During the last decade there seems to be a revived interest in methods that calculate the isotope distribution of molecules when the molecular formula is given. Numerous publications and discussions in the literature do witness this trend [1-7]. A recent review article by Valkenborg et al. [8] gives an extensive overview of the methodology. However, Claesen et al. [9] introduced a new method called BRAIN (Baffling Recursive Algorithm for Isotopic distributioN calculations) that is able to compute the aggregated isotope distributions and their corresponding center-masses. The BRAIN method is based on the polynomial expansion of the element polynomials as described by [10, 11]. Of note, instead of expanding the polynomial using a symbolic approach [12-15], fast Fourier transform approach (FFT) [16-20], or just explicitly perform the polynomial multiplications [21, 22], the BRAIN method employs an iterative algorithm that exploits the algebraic identities of polynomial power series. For this purpose, BRAIN uses two polynomial generating functions that rely on the theory of Newton-Girard and Viète’s formulae. These two generating functions calculate the aggregated distribution and corresponding center-masses. Interestingly, the generating function approach for center-masses was also used by Fernandez-de-Cossio Diaz and Fernandez-de-Cossio and implemented in a software using the FFT-approach [6].
The advantage of BRAIN lies in its simple implementation and has been shown to be as accurate as existing methods: Emass and SIRIUS [4, 5, 7, 9, 23, 24]. However, it has been found by [6] that the computational complexity is asymptotically suboptimal in comparison with their fast Fourier-based algorithm. A first reason for this suboptimal behavior is that BRAIN requires starting the iteration at the lightest variant because of the nature of Newton-Girard’s identities [25]. In theory, one can start the iteration from the heaviest variant, but this non-standard use of the algorithm will not be discussed here. Second, for each aggregated isotope variant, two additional terms are stored in the memory for further usage during the iterative process. Previous properties result in a BRAIN algorithm that has a computational complexity of order O(N^{2}), as described by [6].
In this paper, we introduce two improvements to the original BRAIN method that optimize the algorithm in terms of memory and time complexity without compromising its simplicity of implementation as an iterative algorithm. The gain in efficiency is especially noticeable when calculating large molecules (e.g., 50 or more aggregated isotope variants to adequately span the isotope distribution). As such, for small molecules, we suggest to revert to the original BRAIN method [7]. It should be noted that the presented improvements are only intended for the calculation of the aggregated isotope distribution and not for the center-masses. Currently, we are investigating whether the improvements are also suitable for the center-mass calculation.
Furthermore, we introduce a new formulation to represent element polynomials in a generic form. Doing so, we avoid the calculation of the roots of the element polynomial, which are required in the original BRAIN approach. This third improvement is especially interesting when the molecular formula includes elements with many isotopes (e.g., platinum). Such a poly-isotopic element will result in a high-order element polynomial for which the roots cannot be calculated explicitly or are complicated to compute.
All proposed improvements are based on mathematical concepts that simplify the original BRAIN approach. We will provide an intuitive reasoning for each of these improvements. Since the BRAIN method has already been extensively validated in the literature, we will compare the impact of the improvements only to the original algorithm.
2 Methods
Before going into detail about the three improvements, we provide the basic concepts of the original BRAIN method. The overview is provided in the section about the standard BRAIN algorithm. The section about BRAIN 2.0 deals with the proposed improvements.
2.1 Standard BRAIN Algorithm
2.2 BRAIN 2.0
BRAIN 2.0 includes two improvements that reduce the complexity of the computation. The first improvement reduces the length of the summation in Equation (2) for accurately calculating the isotope variant q_{j}. The second improvement allows for a user-defined starting peak in the recursive procedure. Both steps lead to less demanding memory requirements and to a gain in computation time. The third improvement, a root omitting algorithm, is proposed to avoid the calculation of the roots of element polynomials used in ψ_{l}. Instead, the sums of powered roots are represented as a function of the coefficients of the element polynomial by using the theorem of Newton-Girard. This representation allows for a generic form and implementation of the elements in BRAIN 2.0.
2.2.1 Recurrence of Constant Length [RCL]
2.2.2 Late Starting Point [LSP]
As already pointed out by [6, 25] a limitation of the original BRAIN method is that the iterative procedure has to start from the lightest variant. This artefact is inconvenient when calculating very large molecules (cfr. human dynein heavy chain; C_{23832}H_{37816}N_{6528}O_{7031}S_{170}) because the light isotope variants are not of interest as they often fall below the normal detection range of a mass spectrometer. The reason why the procedure has to start from the lightest or heaviest isotope variant is that the probability of occurrence has to be calculated exactly to receive the aggregated isotope distribution as a probability distribution and the information about previously calculated aggregated isotope variant is required to accurately calculate a new variant. However, when probabilities are not required, the relative isotope distribution (e.g., maximum peak normalized to 100 %), can be computed from any starting point. This concept is realized by the fact that Equation (2) is a linear function of the recursion starting point. As a consequence, the iterative procedure is independent from the starting values in terms of the ratios of consecutive peaks. Therefore, the starting value can be arbitrarily set, e.g. to 1.
- 1.
The recursion shall start at variant n_{start} − b with coefficient \( {q}_{n_{\mathrm{start}}-b}=1 \) because b burn-in steps are required to stably calculate the coefficients (see heuristic from formula 10 for exemplary values). The starting point n_{start} and stopping point n_{stop} are user-defined parameters;
- 2.
The next values \( {q}_{n_{\mathrm{start}}-b+1},\dots \) are calculated using Equation (2) or Equation (4);
- 3.
The maximum peak is normalized to 1.
As we start from an arbitrary selected value, the burn-in period b is needed for recovering the real proportions between the consecutive peaks. It is crucial that the procedure converges before the calculation of the n_{start} variant because previous results are being propagated in this calculation. The late starting option allows us to focus the calculation on the prominent part of the aggregated isotope distribution, similar in spirit as heterodyning in FFT-based algorithms [6, 16-20].
2.2.3 Root Omitting [RO]
- (a)
vector (v,w,x,y,z) denoting the element composition of the molecule;
- (b)
vector ((r_{C})^{− l},(r_{H})^{− l},(r_{N})^{− l},(r_{O,all,l}),(r_{S,all,l})) indicating the power sum of the element roots.
It should be noted that Equation (7) is a simple linear equation that can be calculated simultaneously with the iterative BRAIN procedure. Indeed, we do not have to calculate roots of polynomial Q_{S}(I) anymore. The root omitting procedure can be combined with the recurrence of constant length [RCL] method to keep the computational requirements constant in time as discussed in the Results and Discussion section.
3 Results and Discussion
The BRAIN method has already been extensively compared with other methods for isotope distribution calculation [4, 6, 7, 9]. For this reason, we will restrict the evaluation of BRAIN 2.0 to the original BRAIN method. To keep the comparison as transparent as possible, we have disabled the computation of center-masses in the original BRAIN method because BRAIN 2.0 cannot calculate this metric. As the presented improvements are mainly useful for large molecules, we restrict the comparison to the four heavy biomolecules displayed in Supplementary Table S1. For small molecules (e.g., peptides), the original BRAIN is better suited because the interest is mainly on the lighter isotope variants. Moreover, for light molecules, the isotope distribution contains too few isotope variants to enable the [RCL] option safely (i.e., arrive at the point that previous calculations of ψ_{l} becomes ignorable). Furthermore, it should be noted that [RCL], [LSP], and [RO] are three innovations that can be implemented independently from each other. Since the focus of the evaluation is on the computational speed and accuracy of the calculated isotope distribution between BRAIN [27] and BRAIN 2.0, we only include [RCL] and [LSP] in the comparison. The root omitting procedure for all elements in the periodic table is implemented in the original BRAIN method in C++ and is available at https://code.google.com/p/brain-isotopic-distribution/. Root omitting [RO] has no impact on the asymptotic algorithmic efficiency as stated by Hu et al. [25], but represent molecules by generic equations that allow calculation of the roots in a recursive manner without numerical root-finding. The Bioconductor package in R does not include the root omitting option as its current version mainly serves the calculation of peptides and proteins that only allow C, H, N, O, and S atoms.
The criterion used to define the burn-in period b is sufficient, since the returned distributions have a good agreement as illustrated by the small values for the Pearson χ^{2} error statistic. For the molecules presented in Supplementary Table S2, at most 11 burn-in steps are required, which indicate that the relative isotope intensities converge quickly to the actual isotope ratios.
[RCL] and [LSP] Improvements Tested for 4 Heavy Biomolecules from [26]. Speed is Measured as Elapsed Time in Seconds and Averaged from 100 Independent runs. For this comparison, we used heuristic from [9] (cf. Equation (9)) for original BRAIN and heuristic from [6] (cf. Equation (11, α = 10)) for BRAIN 2.0 with both [RCL] and [LSP] improvements. Center-masses calculations are disabled in both cases.
i.d. | monoMass(Da) | b | d | χ^{2} | speed_{BRAIN} | speed_{BRAIN 2} | Improvement |
---|---|---|---|---|---|---|---|
1 | 112824 | 11 | 11 | 2.39e-13 | 0.00873 | 0.00473 | 1.85 |
2 | 186387 | 11 | 11 | 9.79e-14 | 0.0138 | 0.0054 | 2.56 |
3 | 398470 | 11 | 11 | 5.02e-14 | 0.0336 | 0.007 | 4.8 |
4 | 533403 | 11 | 11 | 1.87e-14 | 0.0493 | 0.00766 | 6.43 |
It is obvious that the starting point of a calculation (i.e., n_{start} − b) cannot be smaller than the lightest isotope variant peak. The starting value of the algorithm should be at least equal to the lightest variant, as in the original BRAIN method. In contrast, [RCL] can always be applied on the condition that the returned number of peaks exceeds the constant memory d. In the case the iteration is started from the lightest isotope variant, exact values for the isotope probabilities are estimated with [RCL] disabled or enabled.
4 Conclusions
We illustrate that the iterative algebraic approach used in the BRAIN algorithm for calculating the isotope distribution can be optimized to promote a more efficient use of memory and computation time. For this purpose, we propose two developments. First, the recurrence of constant length [RCL] will restrict the number of terms in the summations to a constant value. This development has an impact on the asymptotic complexity of the algorithm. The second development allows for a user-defined starting point [LSP], which enables more efficient heuristics to define the number of peaks returned by the algorithm. For example, the study of one particular isotope ratio (e.g., the ratio between the most abundant isotope peak and its consecutive peak) could be performed accurately by [LSP] and [RCL] switched on. Although the investigated peaks do not necessary cover a large part of the whole distribution, the ratio is estimated very accurately. This approach was not possible in the original BRAIN method, where the iterative calculation had to start from the lightest isotope variant. The implementation of a recurrence of constant length [RCL] and late starting point [LSP] will be added as an option to the existing Bioconductor BRAIN package [27] (http://www.bioconductor.org/packages/release/bioc/html/BRAIN.html). Root omitting [RO] enables an elegant and generic representation of elements and avoids the calculation of roots. However, the procedure for root omitting [RO] is not implemented in the Bioconductor BRAIN package as this version of the package is mainly intended to calculate isotope distributions for peptides and proteins. As mentioned earlier, root omitting is implemented in the C++ software available online for all the elements in the periodic table. We applied the proposed concepts on biomolecules that contain only five elements (i.e., C, H, N, O, S). These concepts can be easily extended to other elements as well; however, caution should be applied when porting these principle to other elements. The numerical properties explained in the recurrence of constant length can differ for other elements as they exhibit a different elemental isotope distribution. For instance, elements such as bromine or chlorine will converge at a slower rate to ignorable values for ψ_{l}. Therefore, depending on the atomic composition of a molecule, the parameter that defines the length of the memory d may vary. Finally, the achieved improvements in computation time are substantial but seem ignorable for the user when looking at a single isotope calculation. Both BRAIN and BRAIN 2.0 are able to quickly calculate the isotope distribution. However, when the isotope distribution is required for large protein databases or BRAIN 2.0 is used to generate hypothetical isotope distributions in an optimization procedure, then the [RCL] and [LSP] improvements will be noticeable by the user.
Acknowledgments
This research is supported in part by the Polish National Science Center grant 2011/01/B/NZ2/00864 and by the EU through the European Social Fund, contract number UDAPOKL. 04.01.01-00-072/09-00. D.V. and P.D. gratefully acknowledge the support of the bilateral FWO-PAS grant VS.005.13N/Innovative algorithms to detect protein modifications in mass spectrometry data. P.D. is supported by a START fellowship from the Foundation for Polish Science. D.V. acknowledges the support of the SBO grant InSPECtor (120025) of the Flemish agency for Innovation by Science and Technology (IWT).
The authors declare no competing financial interests.
Supplementary material
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.