From Raw Data to Biological Discoveries: A Computational Analysis Pipeline for Mass Spectrometry-Based Proteomics
- First Online:
- 1.1k Downloads
In the last two decades, computational tools for mass spectrometry-based proteomics data analysis have evolved from a few stand-alone software solutions serving specific goals, such as the identification of amino acid sequences based on mass spectrometry spectra, to large-scale complex pipelines integrating multiple computer programs to solve a collection of problems. This software evolution has been mostly driven by the appearance of novel technologies that allowed the community to tackle complex biological problems, such as the identification of proteins that are differentially expressed in two samples under different conditions. The achievement of such objectives requires a large suite of programs to analyze the intricate mass spectrometry data. Our laboratory addresses complex proteomics questions by producing and using algorithms and software packages. Our current computational pipeline includes, among other things, tools for mass spectrometry raw data processing, peptide and protein identification and quantification, post-translational modification analysis, and protein functional enrichment analysis. In this paper, we describe a suite of software packages we have developed to process mass spectrometry-based proteomics data and we highlight some of the new features of previously published programs as well as tools currently under development.
KeywordsProteomics Bioinformatics Computational biology Algorithms Statistics Peptide identification Protein identification Quantitative proteomics Functional enrichment analysis Database
Computer science and bioinformatics played a crucial role in the development of the field of mass spectrometry (MS)-based proteomics. As MS instruments evolved, the data generated grew increasingly intricate. In response, computational strategies were developed to dramatically increase our capacity to extract knowledge from MS data. Foremost among these is the SEQUEST algorithm  that allowed the high-throughput identification of thousands of tandem MS (MS/MS) spectra in a reasonable time; such a feat was unthinkable 20 years ago. Since then, multiple peptide and protein sequence database search engines have been proposed (Mascot , OMSSA , X!Tandem , MS-GF , Andromeda , and, more recently, Morpheus ). However, protein identification still remains an important challenge, and MS-based proteomics now involves a larger variety of computational and data analysis problems.
MS now allows, among other things, the identification and quantification of post-translationally modified proteins as well as interacting proteins and the differential quantification of proteins in samples under different experimental conditions [8, 9, 10, 11]. These analyses are achieved with a large collection of software packages for processing MS data and for performing statistical analyses. A large number of MS-based proteomics analyses can be typically broken down into three initial steps: first, the raw data produced by mass spectrometers need to be processed and transformed into a suitable input for programs performing data analysis. Second, the spectra are matched to peptides using protein sequence database searching, spectral library searching, or de novo sequencing. Third, the statistical significance of peptide and protein identifications needs to be assessed. Subsequent steps may vary based on the goals of the experiment. A variety of computational tools may be used to quantify peptides and proteins (e.g., ProRata , pQuant , and MaxQuant ), as well as identify PTMs (e.g., Ascore , InsPecT ). Of note, the three previously enumerated initial steps may differ in MS-based proteomics workflows that involve methods such as selected reaction monitoring (SRM) and data-independent acquisition (DIA) [17, 18]. There is also a collection of publicly available software packages that can be used to filter out nonspecific protein–protein interactions (PPIs) and identify those of high confidence in datasets produced using affinity purification coupled to MS. While we do not explicitly describe these tools here, we count among them software packages such as SAINT , Decontaminator [20, 21], Mist , and Compass .
The large number of complex software packages required to perform the computational analysis transforming the mass spectrometer raw data into meaningful proteomics and biological discoveries can make the processing of large-scale datasets quite challenging, especially when relying on multiple applications from different sources. Since the output of a given program often serves as input for another computational tool, paired outputs and inputs have to remain compatible over time from a file format and conceptual perspective throughout the computational pipeline. A number of computational proteomics pipelines that include the complete set of tools necessary to perform all or a subset of the analysis of an MS-based proteomics experiment have been proposed. Some of these pipelines present a comprehensive software solution for MS-based proteomics data analysis and include among their modules computational tools for peptide and protein identification and quantification (e.g., MaxQuant , pFind Studio , PEAKS , OpenMS Proteomics Pipeline (TOPP) , and Trans-Proteomic Pipeline (TPP) ). Others focus on the quantitative analysis and biological interpretation of MS-based proteomics data (e.g., PatternLab ) or on the protein identification and the statistical validation of the results (e.g., Mascot-Percolator interface package ). In addition, a set of modular open-source cross-platform computational tools and libraries processing and analyzing MS data are available under the ProteoWizard software project . ProteoWizard includes among other things the Skyline software package  that allows the analysis of quantitative MS-based proteomics data acquired using a variety of methods such as SRM and DIA.
Mass Spectrometry Raw Data Processing
The first step of the vast majority of MS-based proteomics experiments still remains peak extraction from MS raw data. Our laboratory previously developed a tool named RawExtract that converts binary Thermo Fisher Scientific RAW files to text-based files in MS1/MS2 format , which can be used for peptide identification. RawExtract has evolved over time and its latest version, named RawConverter, introduces a set of novel features. RawConverter can convert RAW files to MS1/MS2, but also to MGF (Mascot Generic Format), or mzXML  formats. It also now provides file format conversions from mzXML and mzML files  to MS1/MS2 and MGF files, and from MGF files to MS2 files. These conversions conveniently fulfill the MS file format requirements of the vast majority of downstream data analysis tools that are publically available, including those of our computational pipeline.
In addition to extracting the MS or MS/MS data from RAW files generated by Thermo Fisher Scientific instruments, RawConverter addresses problems regarding the selection of the monoisotopic peaks of precursor ions. The inaccurate precursor mass assignment of an MS/MS spectrum significantly increases the probability of introducing false positives in subsequent peptide and protein identifications. Mass spectrometers from certain manufacturers tend to label the peak with the highest intensity in a given isolation window as the precursor peak when spectra are collected using data-dependent acquisition (DDA). Generally, this precursor peak can be considered as the monoisotopic peak for short peptides (with mass equal to or less than 1,500 Da) since their monoisotopic peaks typically have the highest intensities . However, the monoisotopic ions of longer peptides (with mass greater than 1,500 Da) may not be those with the highest intensities , making the above straightforward peak picking strategy far from ideal for large peptides. Existing tools available in the ProteoWizard software project  and pParse  have recently made progress on tackling this issue. The RawConverter algorithm also addresses this problem by combining the Averagine model  and the information of successive MS1 scans to select the monoisotopic peak for each precursor ion.
Another problem arises in the context of DIA methods. Most MS instruments do not supply precursor m/z and charge information of the peptides in a given wide acquisition window; they merely provide the middle m/z and the size of the isolation windows . Providing more accurate precursor information to database search engines would allow a reduction of the search space and, therefore, an increased discriminative power when scoring peptide-spectrum matches (PSMs). This would benefit the majority of the current tools assessing the confidence of protein and peptide identifications [32, 42, 43]. RawConverter attempts to address this issue by enumerating and evaluating all possible precursor isotopic envelopes in a given DIA isolation window. Users can define the maximum number of precursor isotopic envelopes reported. The most confident envelopes are selected to determine the possible precursor m/z values and charge states. MS/MS spectra can then be duplicated for each possible precursor, thereby allowing each spectra to be individually searched for peptide identification.
Peptide and Protein Identification and Statistical Significance Assessment
Once extracted, the MS data output by RawConverter (MS2 file format) is searched by our protein sequence database search engine ProLuCID  to identify PSMs. ProLuCID, which is implemented as a platform-independent Java program, is inspired by the SEQUEST algorithm and, therefore, uses a modified cross-correlation (XCorr) calculation to score a spectrum against the theoretical spectrum of a given peptide sequence. ProLuCID supports low- and high-resolution MS data. It also introduces the computation of a binomial probability serving as a preliminary score to pre-filter spectra and to improve computational performances. In addition, it reports a Z-score for the PSM with the highest XCorr for each spectrum, which can be used to filter search results.
The significance of the set of PSMs produced by most database search engines needs to be statistically assessed by computational tools, such as Percolator . In our pipeline, PSMs found with ProLuCID are statistically assessed using DTASelect [32, 43]. DTASelect also identifies the set of proteins present in the analyzed sample based on ProLuCID’s PSMs. Over the years, DTASelect’s PSM filtering accuracy has been improved. Furthermore, the latest version (DTASelect 2) now outputs the set of spectra, peptides, or proteins identified under a user-selected false discovery rate for a given experiment. Thanks to recent improvements, DTASelect 2 provides a great filtering flexibility by allowing users to define up to 150 different parameters. For example, a filter can be set so that the only reported proteins are those identified by a certain number of peptides and that at least one of these peptides obtained a XCorr above a given threshold. These new filters allow the generation of high-confidence datasets by increasing the number of protein identifications under a given false discovery rate.
While the vast majority of the samples analyzed by MS/MS in our laboratory involve organisms for which most protein sequences are known, we sometimes process samples originating from organisms for which the genomes are not yet sequenced. This renders a typical sequence database search impossible. For such samples, peptides are usually identified using the de novo sequencing software package pNovo .
Peptide and Protein Quantification
The recently released Census ver. 2  comprises novel features for the analysis of peptides quantified using tandem mass tag (TMT) reporter ions. Included among these are a reporter ion impurity correction, a reporter ion minimum intensity threshold filter, and an optional weighted normalization that corrects mixing errors. Census 2 can also process MS experiments performed using HCD, CID/HCD double-play, HCD MS3, or MultiNotch MS3 [54, 55] data. Additional features have also been included for quantification using metabolic labeling (e.g., 15N, SILAC). Among these are the calculation of a reverse ratio for the label swap experimental setup [56, 57] and the assessment of the statistical significance of differentially expressed peptides. More recently, improvements were implemented in Census 2 for the calculation of 15N enrichment ratios. Census 2 uses the elemental composition of amino acids to calculate the isotopic distributions of 15N enriched peptides. As 15N labeling shifts the mass of peptides based on the number of nitrogen atoms they contain, Census 2 uses all possible theoretical isotope distributions and maps them to the experimental ones to find the best match using a linear regression. Census 2 computes the atomic enrichment for each peptide independently, as it can vary based on a given protein’s turnover rate.
Post-Translational Modifications Discovery and Assessment
While ProLuCID’s database search algorithm allows the identification of post-translationally modified peptides, some proteomics studies might require a more in-depth analysis of certain PTM events. Hence, our computational pipeline also includes the probability-based phosphorylation localization tool Ascore, which was developed by Beausoleil et al. . In addition, we developed an algorithm named Debunker  that uses a support vector machine binary classifier to assess the phosphorylation status of a given peptide. Debunker may be used in combination with Ascore to assess the confidence of phosphorylation status and site localization for a given peptide.
Functional Enrichment Analysis
Functional enrichment analyses are often used to generate hypotheses regarding the underlying mechanisms revealed in MS-based proteomics datasets. Such analyses can consist of identifying gene ontology (GO)  annotations that are over-represented in a set of differentially expressed proteins or finding biological pathways that are significantly implicated in a set of protein–protein interactions (PPIs). Our computational pipeline currently includes the use of the Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, CA, USA www.qiagen.com/ingenuity) and of DAVID  as well as Ontologizer  for GO enrichment analyses.
We also recently developed our own functional enrichment analysis algorithm for MS-based proteomics quantification experiments (PSEA-Quant ). Enrichment analyses of gene/protein sets, such as those obtained from the GO and the Molecular Signature databases  were often developed for genomic datasets  and are poorly adapted to some of the particularities of MS-based proteomics data, including the level of reproducibility of MS quantitative measurements. They also tend to require the use of arbitrary thresholds to define the subset of proteins in a dataset on which the enrichment analysis should be performed. PSEA-Quant addresses both of these problems. This web-based user-friendly algorithm, inspired by GSEA , entails a novel protein set enrichment analysis for label-free and labeled MS-based quantitative proteomics. Unlike GSEA, PSEA-Quant allows the analysis of proteomics samples originating from single or multiple conditions. This Java program uses Census’ output, while supporting other file formats, to identify protein sets that are statistically enriched significantly among abundant proteins that are quantified with high reproducibility across a set of replicates. PSEA-Quant was used to highlight putative underlying mechanisms of cystic fibrosis , among other things.
Algorithms and Tools Currently in Development
This article presents a snapshot of the state of our current computational pipeline. However, our suite of software packages is under constant evolution. It continuously adapts to novel technologies with the goal of producing high impact discoveries. For instance, we are currently developing algorithms for the identification of cross-linked peptides. In addition, we are designing a novel blind search algorithm that allows the identification of peptides with any unspecified PTMs or amino acid substitutions using MS/MS spectra. We also continue to improve our intact protein (top-down) computational analysis pipeline, which includes tools such as ProSightPC (Thermo Fisher Scientific) and ByOnic , by developing algorithms that improve protein identification sensitivity and estimate the number of protein species in a dataset analyzed through top-down MS .
A recent addition to our computational pipeline is a tool named PrOntoNet, which infers PPIs from a list of identified proteins in organisms for which there are very few known PPIs in the literature and for which the proteins are largely uncharacterized functionally. PrOntoNet uses the output of DTASelect to recover known PPIs from identified proteins in the STRING  and BioGRID  databases. It then uses blast2GO  to associate GO annotations to the proteins in the dataset. PrOntoNet finally uses the proportion of overlap between the GO annotations of all protein pairs in the dataset in order to infer interactions between them.
The tools described above provide a large amount of information on the PSMs and the proteins identified in a given proteomics dataset. The results from several bioinformatics analyses and data from public external resources and databases are typically processed and combined in order to convey to the scientific community a comprehensive and meaningful message about a proteomics study. This task often involves the manual integration of in-house software and script results, as well as data from spreadsheets, into customized summary tables that are later published in an article. The large-scale datasets and results produced with MS-based proteomics pose important challenges in terms of long-term storage and data reusability. Reproducing the creation process of complex summary tables can be difficult after a long period of time, especially when attempted by someone other than the original author. Databases such as PRIDE , MassIVE (massive.ucsd.edu), and PeptideAtlas , which are all included in the ProteomeXchange consortium , allow the long-term storage and distribution of data. However, for the most part they do not permit the full integration of custom computational and statistical significance results. We are currently tackling both of these challenges simultaneously by designing a software package named Proteomics INTegrator (PINT). PINT is an open-source platform-independent Java program built on top of a MySQL database engine that provides an environment to store data and integrate results from various computational analyses (including ProLuCID, DTASelect, and Census), which may originate from different proteomics approaches. The protein annotations available in the UniProtKB repository  are automatically appended to all datasets stored in PINT. PINT allows users to easily import, visualize, and download data though a user-friendly web interface, thereby facilitating its dissemination and accessibility. In addition, our tool also provides a powerful and flexible query system allowing the retrieval of specific elements and values (e.g., quantification measurements, confidence scores, statistical significance results, manual annotations, UniprotKB protein annotations) in the data of a given experiment or across several experiments simultaneously.
Our computational pipeline has come a long way since the publication of the SEQUEST algorithm. We strive to continuously produce and use novel algorithms that take advantage of new MS technological improvements. We strongly believe that computational approaches that aim to make the most out of cutting-edge technologies will continue to yield high impact biological discoveries.
The authors are grateful to Claire M. Delahunty for helpful discussions and comments. They acknowledge funding from the following National Institute of Health grants: P41 GM103533, R01 MH067880, R01 MH100175, UCLA/NHLBI Proteomics Centers (HHSN268201000035C), and 1U54GM114833. M.L.A. holds a postdoctoral fellowship from the Fonds de recherche du Québec – nature et technologies (FRQNT).