Improved Peak Detection and Deconvolution of Native Electrospray Mass Spectra from Large Protein Complexes
- First Online:
- Cite this article as:
- Lu, J., Trnka, M.J., Roh, SH. et al. J. Am. Soc. Mass Spectrom. (2015) 26: 2141. doi:10.1007/s13361-015-1235-6
- 876 Downloads
Native electrospray-ionization mass spectrometry (native MS) measures biomolecules under conditions that preserve most aspects of protein tertiary and quaternary structure, enabling direct characterization of large intact protein assemblies. However, native spectra derived from these assemblies are often partially obscured by low signal-to-noise as well as broad peak shapes because of residual solvation and adduction after the electrospray process. The wide peak widths together with the fact that sequential charge state series from highly charged ions are closely spaced means that native spectra containing multiple species often suffer from high degrees of peak overlap or else contain highly interleaved charge envelopes. This situation presents a challenge for peak detection, correct charge state and charge envelope assignment, and ultimately extraction of the relevant underlying mass values of the noncovalent assemblages being investigated. In this report, we describe a comprehensive algorithm developed for addressing peak detection, peak overlap, and charge state assignment in native mass spectra, called PeakSeeker. Overlapped peaks are detected by examination of the second derivative of the raw mass spectrum. Charge state distributions of the molecular species are determined by fitting linear combinations of charge envelopes to the overall experimental mass spectrum. This software is capable of deconvoluting heterogeneous, complex, and noisy native mass spectra of large protein assemblies as demonstrated by analysis of (1) synthetic mononucleosomes containing severely overlapping peaks, (2) an RNA polymerase II/α-amanitin complex with many closely interleaved ion signals, and (3) human TriC complex containing high levels of background noise.
KeywordsNative mass spectrometry Deconvolution Protein assemblies Protein complexes
Mass spectrometry (MS) now plays an increasingly important role in characterizing large protein assemblies [1, 2, 3, 4]. Interacting surfaces between the component proteins of a large complex can be mapped using bottom-up proteomics strategies such as hydrogen-deuterium exchange (HDX) , hydroxyl radical protein surface fingerprinting , or chemical cross-linking [7, 8]. In native mass spectrometry (native MS), electrospray is employed to ionize intact noncovalent protein complexes from nondenaturing solutions. Solution interactions between component proteins, ligands, nucleic acids, and other biomolecules are preserved, allowing determination of the intact mass of an assembly and, hence, the stoichiometry of the individual subunits . It is therefore complementary to techniques such as cross-linking MS that measure proteolysis products. Through partial disruption of a protein complex either in solution  or in the gas phases , dissociation pathways can be mapped and the topology of a complex deduced [12, 13, 14].
In native MS measurements, intact protein complexes are introduced to the gas phase by nanoelectrospray ionization from aqueous solutions buffered with volatile salts near neutral pH . Gas-phase ions of the complexes preserve the major structural and topological features of the complex. The resulting ions are present at higher m/z values and are distributed across a narrower range of charge states than typically observed through denaturing conditions. Observation of ions from native complexes requires mass spectrometers capable of detecting signals beyond 10,000 m/z. This has now been achieved using time-of-flight (TOF) , Fourier transform ion cyclotron resonance (FT-ICR) [16, 17, 18], and Orbitrap detectors . The mass resolving power of FT-ICR is inversely proportional to m/z whereas the mass resolving power of TOF and Orbitrap analyzers is inversely proportional to the square root of m/z. In practice however, the instrumental limits of mass resolving power are seldom achieved because of incomplete desolvation of ions from intact protein complexes [2, 15, 20].
Intact protein assemblies are introduced to the vacuum while still partially solvated and require increased collisional energy deposition in either the instrument source region and/or a collision cell to achieve adequately resolved ion signals . However, the ions may still not be fully desolvated at the time of mass measurement. This is demonstrated by the observations that (1) native MS measurements consistently give higher molecular weight values than expected from a given complex, and (2) ion signals from native protein complexes are much wider than expected from the calculated isotopic distributions of known primary sequences. The detected signals therefore represent heterogeneous adducts between the protein complex, buffer ions, and water molecules. Collisional or thermal dissociation of these adducts must be balanced against the need to maintain intact assemblies. Since hydrogen bonding, and electrostatic and hydrophobic interactions are major determinants of protein structure , and water molecules and metal ions can form intrinsic structural elements, it is unlikely that complete desolvation while simultaneously maintaining secondary, tertiary, and quaternary structure of a native protein ion is an achievable or even desirable outcome.
The broad ion signals of native complexes therefore pose a complication to determining the intact mass of an assembly, particularly in spectra containing multiple species with overlapping and/or highly interleaved ion signals. Overlap resulting from the presence of multiple, unresolved ions distorts the signals in both m/z and intensity and confounds accurate measurement of the underlying species. Highly interleaved charge state distributions, on the other hand, can lead to errors in determining charge envelope membership, assigning charge states, and, ultimately, calculating molecular mass. Coupling online ion mobility separation to native MS analysis may reduce the severity of these issues by introducing a dimension of gas-phase separation of different molecular species . However, the availability and performance of suitable peak deconvolution algorithms remains an important problem, as it is likely that improved separation technologies will only prompt the analysis of increasingly complex samples.
Current deconvolution methods can be broadly classified as either working forwards from a theoretical model or backwards from the experimental data. Forward algorithms [24, 25, 26] start with a model describing the behavior of proteins in the mass spectrometer. This is used to determine the most probable set of molecular ions that reconstruct the experimental data when convolved by the model. Backwards algorithms [3, 9, 20, 27, 28, 29, 30, 31] start by detecting peaks in the experimental spectrum, assigning charge state distributions to sets of peaks, and from these inferring the protein composition. Forward algorithms side-step issues involved in explicit peak detection, but in turn are at risk of returning local minima without robust means of generating initial starting guesses at the protein composition.
Massign , developed by the Robinson lab, has been amongst the most effective backwards algorithms. Massign detects peaks and simulates charge envelopes from those peaks using the fact that charge states follow a Gaussian distribution over the charge domain. The modeled envelope that best fits the original spectrum represents the correct charge state. Because this method models both m/z and intensity, it can effectively assign envelope membership in highly interleaved spectra. Furthermore, overlapping signals are handled via a “peeling algorithm” in which envelopes are repeatedly fit to the spectra after subtracting out the previously determined envelopes. The intensity of overlapped peaks is therefore apportioned among the envelopes.
An alternate approach, Automass  uses an intensity-independent method that varies the charge state assignment of a set of peaks and examines the standard deviation of the mass for a minimum value as well as the periodicity of this deviation over different charge states. Overlapping signals are not directly assessed as each peak is assumed to belong to exactly one envelope and boundaries between charge envelopes are modeled by a game theory-based treatment .
Many of the recent developments in deconvolution algorithms have thus focused on improved methods of assigning membership of peaks to charge state distributions. Detection of individual peaks, representing an earlier stage of data analysis, has been somewhat neglected and not tailored to the breadth and heterogeneity of ion signals observed from native complexes. Massign uses a moving average filter as well as a rolling signal threshold to detect shoulder peaks, whereas Automass uses a Savitzky-Golay filter. However, because these methods depend on local maxima detection, neither of these is able to handle more complex cases of overlap in which peak apexes are shifted by the overlapping signals, and the underlying signals cannot simply be apportioned among the charge envelopes. These cases are particularly serious for native mass spectra, in which the high mass, multiple components, and broadened ion signals can all contribute to overlapping peaks. The distorted peak shapes can be assigned to the wrong charge envelope, leading to inaccurate mass and abundance determination.
We focus on addressing the problem of peak overlap through improved peak detection and modeling, prior to and during charge envelope assignment. We evaluate multiple peak detection methods to discern true signals from noise. Overlapping signals are deconvoluted through application of a second derivative-based method for peak detection. The second derivative has been extensively used for peak detection in chromatography , nuclear magnetic resonance , and astronomical spectroscopy , but to our knowledge has not been applied to native mass spectrometry data analysis. Similar to the Robinson approach , our software simulates charge envelopes in order to best fit the peaks in the mass spectrum. The goodness of fit is determined by a scoring function that combines both mass error and intensity error. Moreover, to further account for overlapping signals, our method fits linear combinations of charge envelopes to the raw data simultaneously, rather than fitting charge state envelopes sequentially as in Massign. Our software is a comprehensive package that can be operated in either automated or manual mode, with processing options at every step to provide great flexibility in addressing spectra with varying amounts of noise and complexity. Furthermore, unlike Massign or Automass, PeakSeeker is freely distributed under an open source license, allowing users the freedom to improve upon the algorithm or adapt it as needed for specific experiments.
We demonstrate the software’s capability to deconvolute native mass spectra of: synthetic nucleosomes containing the core histone octamer surrounded by a tight binding DNA segment, a complex of RNA-polymerase II (pol II) with multiple copies of its inhibitor a-amanitin, and a megaDalton-sized human TCP-1 ring complex (TRiC). These spectra have been chosen to demonstrate the three main difficulties of deconvoluting native mass spectra: overlapping peaks, interleaved peaks, and poor signal-to-noise ratio.
Sample Preparation and Mass Spectrometry
Mononucleosomes with site-specific methyl lysine analogs (MLA) at histone 3 Lys 9 (H3K9) were prepared as described [35, 36]. Briefly, individual core histones from Xenopus laevis were expressed recombinantly with a histone 3 K9C point mutation. The non-native cysteine residue was alkylated by treatment with (2-chloroethyl)-methylammonium chloride to form specific methylated lysine analogs and the core histones were reconstituted to nucleosomes by addition of double stranded DNA with a tight binding sequence of 147 base-pairs prepared by Taq polymerase catalyzed PCR extension. The expected mass value of the MLA nucleosome was calculated from the measured masses of denatured H3K9 Me1 analog and H4, the UniProt sequences of H2A and H2B with the status of the N-terminal Met residue assigned from MS measurement in proteolytic digests (Glu-C and trypsin), and the calculated mass of the 147 bp Widom 601 sequence  including an additional 3′-adenine on both strands from the Taq polymerase reaction.
RNA polymerase II (pol II) was purified from Saccharomyces cerevisiae as previously described . Pol II/α-amanitin binding samples were prepared by keeping the pol II concentration constant while adding increased concentration of α-amanitin, from 1 to 50 μM. The expected mass of pol II was calculated from the primary amino acid sequences obtained from the UniProt database. The presence or absence of N-terminal methionine and acetylation was determined from previous proteomics analysis of these preparations. In addition to the primary sequence, eight zinc ions and the active site magnesium were assumed to remain bound to the complex as in the crystal structure .
Human TRiC was purified from HeLa cells in a similar way to bovine TRiC  except for cell lysis, which was performed as previously described . To achieve extra purity, TRiC fractions were incubated with 1 mM ATP for 15 min at 37°C and then reprocessed by Mono-Q HR 16/10 (GE Healthcare, USA) and Superose 6 10/300 GL columns (GE Healthcare) in sequence. TRiC’s folding activity was assessed by luciferase refolding as described . The theoretical mass of TRiC was calculated as described above for pol II, except that metal adducts were not considered.
Native MS measurements were made on an Exactive Plus EMR instrument (Thermo Scientific, Bremen, Germany). The instrument was calibrated in the extended mass range mode (m/z 350–20,000) by reference to an infused CsI solution, which forms well-defined clusters up to 12,000 m/z. Samples were buffer-exchanged into 100 mM ammonium acetate buffer at pH 6.8 by four spins on a 10 kDa cutoff centrifugal filter (Millipore). Samples were adjusted to a concentration between 1 and 5 μM and then introduced to the gas phase by static infusion nanospray ionization using a nanospray source. Spectra were acquired with source collision energy setting generally between 5 and 35 and HCD setting between 150 and 200. Operating pressures in the instrument were typically 1–2 mbar in the S-lens region, 10–4 mbar in the source chamber, and 10–9 mbar in the analyzer chamber. Five to 20 scans were averaged prior to data analysis.
Algorithms were implemented in Python using the scipy, numpy, and matplotlib libraries. Spectra are read as text files or from the clipboard exported from the mass spectrometry vendor software, then processed using the workflow illustrated in Supplementary Figure S1. This workflow consists of: (1) spectrum preprocessing, (2) initial peak detection, (3) deconvolution of peak overlap using second derivative, (4) fitting Gaussian functions to the deconvoluted peaks, (5) charge state assignment by fitting to simulated charge envelopes, (6) repeating step 5 for additional masses, and (7) fitting a linear combination of the simulated charge envelopes to the spectrum. In addition to the summary given below, a detailed description of each step is provided in the Supplemental Information while Supplementary Figure S1 gives an overview of the algorithm. Additionally, Supplemental Table S1 provides a feature comparison between PeakSeeker and several of the other primary deconvolution algorithms.
Spectrum Processing and Peak Detection Methods
Spectra are first aggregated across several scans to enhance the signal-to-noise ratio before being exported to PeakSeeker. After optional smoothing by either a moving average or Savitsky-Golay filter and optional background subtraction, peaks are detected using two levels of processing algorithms. At the first level, one of the following three methods is used to identify the apparent peaks with or without overlapping. Method 1 simply detects local maxima above a fixed signal-to-noise ratio threshold. Method 2, adapted from the Massign peak detection algorithm , adjusts this threshold based on the intensity of the most recently detected peak to allow for detection of shoulder peaks. Method 3 uses the continuous wavelet transform to process noisy spectra. Continuous wavelet transform convolves the spectrum with a Mexican Hat wavelet across a range of peak widths. Narrow and high frequency noise is filtered out while true peaks register as local maxima across multiple widths. This method does not require prior smoothing or background subtraction [43, 44]. At the second level, overlapped peaks are detected using the second derivative across a peak range, defined as the set of consecutive data points whose intensities are above the noise level. The second derivative of a Gaussian shaped peak consists of two zero-crossings surrounding a central minimum. Hence, the number of zero-crossings can be used to derive the number of underlying peaks.
Using the second derivative derived parameters as a starting guess, we then use a Levenberg-Marquardt algorithm to fit Gaussian functions to the peak range. This method has advantages over other peak detection methods by finding peaks that do not have a local maximum and by better estimating the parameters of adjacent peaks (see the Results and Discussion section). All fitting and subsequent deconvolution are performed on the original (preprocessed) spectrum.
Charge State Determination and Further Analysis
Possible charge states are iterated based on the most intense peak and signals matching the corresponding charge state series are determined. These are modeled by a Gaussian distribution over the charge domain [9, 45]. The quality of the fit is assessed with an automated scoring function that examines both mass error and height error (Supplementary Information Section 5). The charge state is assigned either automatically by the scoring function, or optionally by the user, after inspection of the best fitting options.
This is repeated up to four more times using the most intense peak that has not already been assigned envelope membership. Aside from this initial peak, other peaks may still be assigned to the current charge state series even if they already have membership in a different envelope. This accounts for peak overlap that was not detected by the second derivative search. A linear combination of up to five simulated charge envelopes is fit to the original raw spectrum using least squares regression. This method of deconvoluting peak signals is distinct from Massign, which subtracts envelopes from the spectrum one at a time and refits the remaining envelopes. Our method reduces the bias introduced by the sequential subtraction procedure. Finally, the masses and abundances of the molecular species are reported and the residual spectrum is evaluated for the quality of the fit. This process is iterated over the residuals until all peaks are determined.
Results and Discussion
Peak Widening Due to Incomplete Desolvation
Peaks in native mass spectra of large complexes are typically much wider than theoretically predicted based on their isotope distributions. This is typically not a limitation of the instrumental mass resolving power. For instance, the Orbitrap Exactive Plus EMR instrument used in these studies was operated at a nominal resolution setting of 8750 determined at 200 m/z. Signals acquired from clusters of CsI ions, which lack isotope peaks, were consistent with these nominal resolution settings (data not shown). Inadequate desolvation and/or additional adduct formation are the primary limiting factors affecting the observed resolution of native mass spectra.
In the presence of 2% DMSO (Fig. 1b), the width of the peak is decreased to 350 Da, and the mass shift decreased to 220 Da. This observation provides further evidence that adduction determines the effective peak resolution in native MS measurements, as a decreased mass shift represents fewer adducts that are more narrowly distributed. This finding was true across all charge states of pol II. DMSO enhanced the resolution and increased the mass accuracy of native MS and also shifted the observed charge state distribution to higher z while preserving intact decameric and dodecameric pol II complexes (Supplementary Figure S2). The charge state range was shifted from 43–51+ without DMSO to 55–77+ with DMSO for decameric pol II with similar findings for dodecameric pol II. In pure water, the theoretical upper charge state limit (Zmax=0.078X√M ) for a spherical complex can be estimated from the Rayleigh charging model [46, 47], in which M is the mass of the complex in Da. The Rayleigh limit for the decamer is 54+, near the maximum charge observed in aqueous ammonium acetate (51+). The observation that the maximum charge is shifted to 77+ in the presence of 2% DMSO is consistent with previous findings in which DMSO promotes partial unfolding of proteins in the electrospray droplet inducing a shift to higher charge states [48, 49, 50]. Furthermore, these findings suggest that water molecules and other adducts are intrinsic structural elements necessary to maintain the solution-state conformation of the protein complex in the gas phase.
Limitation of Peak Deconvolution Algorithms
Figure 2e shows the range in which the second derivative procedure provides additional deconvoluting power when the relative height and separation (plotted relative to the peak width) of the two simulated peaks is varied. The largest improvement is in finding low intensity shoulder peaks (Fig. 2a, b, e). The second derivative requires ~20% less separation to resolve peak overlap than the local maxima method when the second peak is less than half the height of the first. As the intensity of the two signals approach each other, the advantage of the second derivative declines, and the two methods are comparable for equally sized peaks (Fig. 2c, d, e). In this case, overlapping peaks may also be detected by the unusually large width of a peak compared with its neighbors.
There is a substantial range (separation <~1.0) in which neither method can deconvolve the simulated peaks. However, when the two peaks are very closely spaced (separation <0.48) no distortion is introduced by summing the overlaid signals. Therefore, such overlap does not inhibit charge state assignment. This overlap can be accounted for without explicitly detecting two peaks by apportioning such signals across to multiple charge envelopes in the peak range (see the Experimental section). Between a separation of 0.48 and approximately 1.0, however, summed peaks will both distort the original peak parameters and escape detection by the second derivative.
Additionally, we tested a method in which peaks are initially detected and fit using local maxima detection, followed by examination of the residuals in the subtracted spectrum for any new local maxima. When we tested the residuals procedure by itself, we found it difficult to determine if peaks in the subtracted spectrum are indeed true peaks or simply consequences of an imperfect peak line fit. On the other hand, the second derivative bypasses these confounding factors by searching the original signal for changes in concavity. The first derivative is as sensitive to changes in concavity as the second derivative, but is more difficult to interpret because peaks and valleys represent inflection points in the original spectrum. In contrast, the second derivative shows peaks in the original spectrum clearly as minima separated by zero-crossings (inflection points in the original spectrum). This convenient feature provides a starting guess for the centroid m/z to seed the least squares fitting procedure and simplifies user interpretation.
Native MS Spectrum of Nucleosome
The utility of the second derivative in deconvoluting overlapping ion signals is demonstrated by application to a spectrum obtained from synthetic mononucleosomes. Nucleosomes are the smallest packing units of DNA consisting of approximately two turns of DNA wrapped around an octomeric histone core particle. The post-translational state of the histone tails and the chromatin structural state of a gene are both fundamental determinants of gene expression, and direct observation of nucleosomes is a first step in developing MS-based structural analysis for chromosome architecture.
For comparison, we analyzed the spectrum using MagTran (Amgen Inc.), an implementation of the ZSCORE algorithm , Massign (University of Oxford), and Protein Deconvolution (Thermo Scientific), which implements the ReSpect (Positive Probability) maximum entropy algorithm. MagTran finds only the single charge state envelope deconvoluting corresponding to the 199392 Da peak (Supplementary Figure S3). Massign provides more flexible peak fitting parameters for addressing shoulder peaks, and is consequently able to detect either the lower m/z ions or the higher m/z ions but not both (Supplementary Figure S3). The inability of these algorithms to deconvolute the overlapping signals is predicted from our analysis of the required peak separation and spacing (Fig. 2). The spacing of the signal centroids (9–15 m/z units) is approximately equal to the peak widths (FWHM) in the range where second derivative peak processing outperforms detection of local maxima. When the same spectrum is processed with Protein Deconvolution, which performs deconvolution in the forward direction and bypasses explicit peak detection, only charge state signals corresponding to the two higher mass species are detected (Supplementary Figure S3).
Native MS Spectrum of RNA-Polymerase II/a-Amanitin Complex
PeakSeeker deconvolutes eight distinct masses from the spectrum of pol II bound α-amanitin (Fig. 4). These correspond to states of both the 12 mer and the 10 mer with 1–4 molecules of amanitin bound. The average difference between the interleaved masses was 983 ± 12 Da, which suggests that α-amanitin (919 Da) forms a complex with additional metal ions or buffer components. The deconvoluted masses are within 0.12% of the theoretical values and showed positive deviations because of incomplete desolvation. The relative abundance of 10 mer versus 12 mer was determined to be 57% 10 mer to 42% 12 mer.
Using Massign, we were able to retrieve the same masses and abundances as found with PeakSeeker, indicating that both peak detection methods perform equally well on well-resolved peaks in interleaved envelopes (Supplementary Figure S4). MagTran, on the other hand, produced no output because of the complexity of the spectrum (Supplementary Figure S4). This spectrum poses difficulty for algorithms not designed for native MS. The large number of closely spaced peaks in any given charge state requires discriminating between many combinations of charge state envelopes. PeakSeeker and Massign both extract charge state distributions efficiently from the original data by modeling the intensity of the series as Gaussian distributions. Both methods also allow a user to select the charge state distribution manually.
Thermo Protein Deconvolution was able to find five of the eight interleaved charge envelopes, but only after extensive tuning of the parameters (Supplementary Figure S4). For example, the software did not find any of the 12-mer species when the mass range was set from 400 to 600 kDa, but was able to find them when the mass range was extended to 700 kDa. Thus, current implementations of maximum entropy methods such as Protein Deconvolution can give inconsistent output depending on parameter settings.
Native MS Spectrum of Human TRiC Complex
Protein assemblies purified from human cell lines, although clearly more relevant to health related studies, are typically more heterogeneous than complexes reconstituted from individual components (e.g., our nucleosome preparations) or purified from yeast (e.g., our pol II preparation). The heterogeneity can manifest in native spectra as poorly resolved signals amidst high levels of noise and background. We tested our peak detection methods on one such noisy spectrum: a preparation of the human TCP1 ring complex (TRiC), isolated from a HeLa cell line. TRiC is a 1 megaDalton sized chaperonin containing two stacked hetero-octameric rings whose central chambers facilitate folding of client proteins. A broad range of newly synthesized polypeptides from diverse cellular functions are substrates of TRiC .
Deconvolution and charge state assignment of electrospray mass spectra from native assemblies presents distinct challenges from those of denatured proteins, including the observation of wide peak widths that are caused by the presence of adducts and incomplete desolvation. We have developed software to specifically address problems we encountered while analyzing native MS spectra from large complexes using other freely and commercially distributed software packages available within our Research Resource. PeakSeeker is capable of mass assignment from native spectra that contain wide ion signals in the context of overlapping and highly interleaved charge distributions and low signal to noise ratio. Multiple methods of peak detection and other signal processing options as well as automated and user-dependent modes provide users with great flexibility in addressing the challenges of a particular analysis. We demonstrate the applicability of this software to the analysis of several large protein complexes with relevance to transcriptional regulation (nucleosomes and pol II) and protein folding (TRiC) and representing leading-edge protein purification techniques. Although other software has been developed to address native MS spectra specifically, we present a flexible algorithm that is distributed under the principles of free software, allowing other users to improve upon the algorithms and adapt them for their specific applications.
The work was supported by the Bio-Organic Biomedical Mass Spectrometry Resource at UCSF (A. L. Burlingame, Director) supported by the Biomedical Technology Research Centers program of the NIH National Institute of General Medical Sciences, NIH NIGMS 8P41GM103481. The ThermoScientific Exactive Plus EMR instrument was supported by NIH NIGMS 8P41GM103481, and UCSF 2013 ETAC Shared Equipment Grant. Initial native MS experiments were also supported by ThermoScientific (Bremen). Additional funding is from NIH P41GM103832, PN2EY016525, GM49985, GM36659, AI21144, AG021601, AG002132, AG010770.
We thank Dr. Zhongqi Zhang for providing the updated MagTran software with extended mass range.
The software is licensed under the terms of the GNU General Public License as published by the Free Software Foundation. The software is available for download at https://github.com/lujonathanh/PeakSeeker.