Journal of The American Society for Mass Spectrometry

, Volume 26, Issue 11, pp 1820–1826

From Raw Data to Biological Discoveries: A Computational Analysis Pipeline for Mass Spectrometry-Based Proteomics

  • Mathieu Lavallée-Adam
  • Sung Kyu Robin Park
  • Salvador Martínez-Bartolomé
  • Lin He
  • John R. YatesIII
Focus: 20 Year Anniversary of SEQUEST: Account & Perspective

Abstract

In the last two decades, computational tools for mass spectrometry-based proteomics data analysis have evolved from a few stand-alone software solutions serving specific goals, such as the identification of amino acid sequences based on mass spectrometry spectra, to large-scale complex pipelines integrating multiple computer programs to solve a collection of problems. This software evolution has been mostly driven by the appearance of novel technologies that allowed the community to tackle complex biological problems, such as the identification of proteins that are differentially expressed in two samples under different conditions. The achievement of such objectives requires a large suite of programs to analyze the intricate mass spectrometry data. Our laboratory addresses complex proteomics questions by producing and using algorithms and software packages. Our current computational pipeline includes, among other things, tools for mass spectrometry raw data processing, peptide and protein identification and quantification, post-translational modification analysis, and protein functional enrichment analysis. In this paper, we describe a suite of software packages we have developed to process mass spectrometry-based proteomics data and we highlight some of the new features of previously published programs as well as tools currently under development.

Graphical Abstract

Keywords

Proteomics Bioinformatics Computational biology Algorithms Statistics Peptide identification Protein identification Quantitative proteomics Functional enrichment analysis Database 

Introduction

Computer science and bioinformatics played a crucial role in the development of the field of mass spectrometry (MS)-based proteomics. As MS instruments evolved, the data generated grew increasingly intricate. In response, computational strategies were developed to dramatically increase our capacity to extract knowledge from MS data. Foremost among these is the SEQUEST algorithm [1] that allowed the high-throughput identification of thousands of tandem MS (MS/MS) spectra in a reasonable time; such a feat was unthinkable 20 years ago. Since then, multiple peptide and protein sequence database search engines have been proposed (Mascot [2], OMSSA [3], X!Tandem [4], MS-GF [5], Andromeda [6], and, more recently, Morpheus [7]). However, protein identification still remains an important challenge, and MS-based proteomics now involves a larger variety of computational and data analysis problems.

MS now allows, among other things, the identification and quantification of post-translationally modified proteins as well as interacting proteins and the differential quantification of proteins in samples under different experimental conditions [8, 9, 10, 11]. These analyses are achieved with a large collection of software packages for processing MS data and for performing statistical analyses. A large number of MS-based proteomics analyses can be typically broken down into three initial steps: first, the raw data produced by mass spectrometers need to be processed and transformed into a suitable input for programs performing data analysis. Second, the spectra are matched to peptides using protein sequence database searching, spectral library searching, or de novo sequencing. Third, the statistical significance of peptide and protein identifications needs to be assessed. Subsequent steps may vary based on the goals of the experiment. A variety of computational tools may be used to quantify peptides and proteins (e.g., ProRata [12], pQuant [13], and MaxQuant [14]), as well as identify PTMs (e.g., Ascore [15], InsPecT [16]). Of note, the three previously enumerated initial steps may differ in MS-based proteomics workflows that involve methods such as selected reaction monitoring (SRM) and data-independent acquisition (DIA) [17, 18]. There is also a collection of publicly available software packages that can be used to filter out nonspecific protein–protein interactions (PPIs) and identify those of high confidence in datasets produced using affinity purification coupled to MS. While we do not explicitly describe these tools here, we count among them software packages such as SAINT [19], Decontaminator [20, 21], Mist [22], and Compass [23].

The large number of complex software packages required to perform the computational analysis transforming the mass spectrometer raw data into meaningful proteomics and biological discoveries can make the processing of large-scale datasets quite challenging, especially when relying on multiple applications from different sources. Since the output of a given program often serves as input for another computational tool, paired outputs and inputs have to remain compatible over time from a file format and conceptual perspective throughout the computational pipeline. A number of computational proteomics pipelines that include the complete set of tools necessary to perform all or a subset of the analysis of an MS-based proteomics experiment have been proposed. Some of these pipelines present a comprehensive software solution for MS-based proteomics data analysis and include among their modules computational tools for peptide and protein identification and quantification (e.g., MaxQuant [14], pFind Studio [24], PEAKS [25], OpenMS Proteomics Pipeline (TOPP) [26], and Trans-Proteomic Pipeline (TPP) [27]). Others focus on the quantitative analysis and biological interpretation of MS-based proteomics data (e.g., PatternLab [28]) or on the protein identification and the statistical validation of the results (e.g., Mascot-Percolator interface package [29]). In addition, a set of modular open-source cross-platform computational tools and libraries processing and analyzing MS data are available under the ProteoWizard software project [30]. ProteoWizard includes among other things the Skyline software package [31] that allows the analysis of quantitative MS-based proteomics data acquired using a variety of methods such as SRM and DIA.

Our group has also developed a collection of computational tools to perform the computational analysis of MS-based proteomics datasets from the raw data to its biological interpretation [1, 32, 33, 34, 35]. Our tools were developed using a set of requirements (of which we have a complete control) in order to ensure the compatibility of each tool throughout the evolution of our pipeline. Figure 1 illustrates our main computational pipeline. While all of the tools developed by our group are publically available as stand-alone applications (http://fields.scripps.edu/researchtools.php), a large number of these software packages are also available in a web-based comprehensive environment, called Integrated Proteomics Pipeline (IP2) (Integrated Proteomics Applications) that allows the analysis of large-scale MS-based proteomics datasets. While some users prefer executing each computational tool in the pipeline independently, IP2 facilitates the analysis of large-scale complex datasets by providing a single interface and work environment for the implementation of the complete analysis pipeline. In this article, we discuss numerous tools that compose our computational pipeline for MS-based proteomics analysis, with an emphasis on some of their new features. We also present our most recent software packages and the algorithms that are currently under development in our laboratory.
Figure 1

Schematic representation of the main computational pipeline of our laboratory. For a given experiment, the computational tools in the large black box may be used based on the goals of the experiment. Green boxes represent software packages developed in our laboratory and yellow boxes illustrate those designed by third parties

Computational Pipeline

Mass Spectrometry Raw Data Processing

The first step of the vast majority of MS-based proteomics experiments still remains peak extraction from MS raw data. Our laboratory previously developed a tool named RawExtract that converts binary Thermo Fisher Scientific RAW files to text-based files in MS1/MS2 format [36], which can be used for peptide identification. RawExtract has evolved over time and its latest version, named RawConverter, introduces a set of novel features. RawConverter can convert RAW files to MS1/MS2, but also to MGF (Mascot Generic Format), or mzXML [37] formats. It also now provides file format conversions from mzXML and mzML files [38] to MS1/MS2 and MGF files, and from MGF files to MS2 files. These conversions conveniently fulfill the MS file format requirements of the vast majority of downstream data analysis tools that are publically available, including those of our computational pipeline.

In addition to extracting the MS or MS/MS data from RAW files generated by Thermo Fisher Scientific instruments, RawConverter addresses problems regarding the selection of the monoisotopic peaks of precursor ions. The inaccurate precursor mass assignment of an MS/MS spectrum significantly increases the probability of introducing false positives in subsequent peptide and protein identifications. Mass spectrometers from certain manufacturers tend to label the peak with the highest intensity in a given isolation window as the precursor peak when spectra are collected using data-dependent acquisition (DDA). Generally, this precursor peak can be considered as the monoisotopic peak for short peptides (with mass equal to or less than 1,500 Da) since their monoisotopic peaks typically have the highest intensities [39]. However, the monoisotopic ions of longer peptides (with mass greater than 1,500 Da) may not be those with the highest intensities [39], making the above straightforward peak picking strategy far from ideal for large peptides. Existing tools available in the ProteoWizard software project [30] and pParse [40] have recently made progress on tackling this issue. The RawConverter algorithm also addresses this problem by combining the Averagine model [41] and the information of successive MS1 scans to select the monoisotopic peak for each precursor ion.

Another problem arises in the context of DIA methods. Most MS instruments do not supply precursor m/z and charge information of the peptides in a given wide acquisition window; they merely provide the middle m/z and the size of the isolation windows [18]. Providing more accurate precursor information to database search engines would allow a reduction of the search space and, therefore, an increased discriminative power when scoring peptide-spectrum matches (PSMs). This would benefit the majority of the current tools assessing the confidence of protein and peptide identifications [32, 42, 43]. RawConverter attempts to address this issue by enumerating and evaluating all possible precursor isotopic envelopes in a given DIA isolation window. Users can define the maximum number of precursor isotopic envelopes reported. The most confident envelopes are selected to determine the possible precursor m/z values and charge states. MS/MS spectra can then be duplicated for each possible precursor, thereby allowing each spectra to be individually searched for peptide identification.

Peptide and Protein Identification and Statistical Significance Assessment

Once extracted, the MS data output by RawConverter (MS2 file format) is searched by our protein sequence database search engine ProLuCID [44] to identify PSMs. ProLuCID, which is implemented as a platform-independent Java program, is inspired by the SEQUEST algorithm and, therefore, uses a modified cross-correlation (XCorr) calculation to score a spectrum against the theoretical spectrum of a given peptide sequence. ProLuCID supports low- and high-resolution MS data. It also introduces the computation of a binomial probability serving as a preliminary score to pre-filter spectra and to improve computational performances. In addition, it reports a Z-score for the PSM with the highest XCorr for each spectrum, which can be used to filter search results.

The significance of the set of PSMs produced by most database search engines needs to be statistically assessed by computational tools, such as Percolator [42]. In our pipeline, PSMs found with ProLuCID are statistically assessed using DTASelect [32, 43]. DTASelect also identifies the set of proteins present in the analyzed sample based on ProLuCID’s PSMs. Over the years, DTASelect’s PSM filtering accuracy has been improved. Furthermore, the latest version (DTASelect 2) now outputs the set of spectra, peptides, or proteins identified under a user-selected false discovery rate for a given experiment. Thanks to recent improvements, DTASelect 2 provides a great filtering flexibility by allowing users to define up to 150 different parameters. For example, a filter can be set so that the only reported proteins are those identified by a certain number of peptides and that at least one of these peptides obtained a XCorr above a given threshold. These new filters allow the generation of high-confidence datasets by increasing the number of protein identifications under a given false discovery rate.

While the vast majority of the samples analyzed by MS/MS in our laboratory involve organisms for which most protein sequences are known, we sometimes process samples originating from organisms for which the genomes are not yet sequenced. This renders a typical sequence database search impossible. For such samples, peptides are usually identified using the de novo sequencing software package pNovo [45].

Peptide and Protein Quantification

Peptide and protein quantification is the next step of the data analysis in several MS-based proteomics pipelines. Computational tools such as ProRata [12], pQuant [13], and MaxQuant [14] have been proposed to calculate accurate proteomics quantitative measurements. We have also previously introduced a software tool named Census [33]. Census is capable of using the vast majority of quantitative MS-based proteomics strategies and can take as its input DTASelect’s output as well as pepXML and mzXML files. Census can quantify peptides and proteins labeled using a variety of labeling strategies (e.g., 15N [46], SILAC [47], iTRAQ [48], TMT [49], dimethyl [50, 51], 18O [52]), as well as label-free strategies for both high- and low-resolution MS data [spectral counting and extracted-ion chromatogram (XIC)-based quantification] (see Figure 2). Census can be used to analyze large-scale quantitative datasets and, when coupled to high-resolution MS data, shows improved quantification efficiency [33]. Among its numerous features, Census quantifies peptides originating from DIA data with high sensitivity by filtering out noisy fragment ions and also builds chromatograms from peak lists to quantify stable isotope-labeled peptides.
Figure 2

Main MS-based quantitative proteomics approaches supported by Census

The recently released Census ver. 2 [53] comprises novel features for the analysis of peptides quantified using tandem mass tag (TMT) reporter ions. Included among these are a reporter ion impurity correction, a reporter ion minimum intensity threshold filter, and an optional weighted normalization that corrects mixing errors. Census 2 can also process MS experiments performed using HCD, CID/HCD double-play, HCD MS3, or MultiNotch MS3 [54, 55] data. Additional features have also been included for quantification using metabolic labeling (e.g., 15N, SILAC). Among these are the calculation of a reverse ratio for the label swap experimental setup [56, 57] and the assessment of the statistical significance of differentially expressed peptides. More recently, improvements were implemented in Census 2 for the calculation of 15N enrichment ratios. Census 2 uses the elemental composition of amino acids to calculate the isotopic distributions of 15N enriched peptides. As 15N labeling shifts the mass of peptides based on the number of nitrogen atoms they contain, Census 2 uses all possible theoretical isotope distributions and maps them to the experimental ones to find the best match using a linear regression. Census 2 computes the atomic enrichment for each peptide independently, as it can vary based on a given protein’s turnover rate.

Post-Translational Modifications Discovery and Assessment

While ProLuCID’s database search algorithm allows the identification of post-translationally modified peptides, some proteomics studies might require a more in-depth analysis of certain PTM events. Hence, our computational pipeline also includes the probability-based phosphorylation localization tool Ascore, which was developed by Beausoleil et al. [15]. In addition, we developed an algorithm named Debunker [34] that uses a support vector machine binary classifier to assess the phosphorylation status of a given peptide. Debunker may be used in combination with Ascore to assess the confidence of phosphorylation status and site localization for a given peptide.

Functional Enrichment Analysis

Functional enrichment analyses are often used to generate hypotheses regarding the underlying mechanisms revealed in MS-based proteomics datasets. Such analyses can consist of identifying gene ontology (GO) [58] annotations that are over-represented in a set of differentially expressed proteins or finding biological pathways that are significantly implicated in a set of protein–protein interactions (PPIs). Our computational pipeline currently includes the use of the Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, CA, USA www.qiagen.com/ingenuity) and of DAVID [59] as well as Ontologizer [60] for GO enrichment analyses.

We also recently developed our own functional enrichment analysis algorithm for MS-based proteomics quantification experiments (PSEA-Quant [35]). Enrichment analyses of gene/protein sets, such as those obtained from the GO and the Molecular Signature databases [61] were often developed for genomic datasets [62] and are poorly adapted to some of the particularities of MS-based proteomics data, including the level of reproducibility of MS quantitative measurements. They also tend to require the use of arbitrary thresholds to define the subset of proteins in a dataset on which the enrichment analysis should be performed. PSEA-Quant addresses both of these problems. This web-based user-friendly algorithm, inspired by GSEA [62], entails a novel protein set enrichment analysis for label-free and labeled MS-based quantitative proteomics. Unlike GSEA, PSEA-Quant allows the analysis of proteomics samples originating from single or multiple conditions. This Java program uses Census’ output, while supporting other file formats, to identify protein sets that are statistically enriched significantly among abundant proteins that are quantified with high reproducibility across a set of replicates. PSEA-Quant was used to highlight putative underlying mechanisms of cystic fibrosis [35], among other things.

Algorithms and Tools Currently in Development

This article presents a snapshot of the state of our current computational pipeline. However, our suite of software packages is under constant evolution. It continuously adapts to novel technologies with the goal of producing high impact discoveries. For instance, we are currently developing algorithms for the identification of cross-linked peptides. In addition, we are designing a novel blind search algorithm that allows the identification of peptides with any unspecified PTMs or amino acid substitutions using MS/MS spectra. We also continue to improve our intact protein (top-down) computational analysis pipeline, which includes tools such as ProSightPC (Thermo Fisher Scientific) and ByOnic [63], by developing algorithms that improve protein identification sensitivity and estimate the number of protein species in a dataset analyzed through top-down MS [64].

A recent addition to our computational pipeline is a tool named PrOntoNet, which infers PPIs from a list of identified proteins in organisms for which there are very few known PPIs in the literature and for which the proteins are largely uncharacterized functionally. PrOntoNet uses the output of DTASelect to recover known PPIs from identified proteins in the STRING [65] and BioGRID [66] databases. It then uses blast2GO [67] to associate GO annotations to the proteins in the dataset. PrOntoNet finally uses the proportion of overlap between the GO annotations of all protein pairs in the dataset in order to infer interactions between them.

The tools described above provide a large amount of information on the PSMs and the proteins identified in a given proteomics dataset. The results from several bioinformatics analyses and data from public external resources and databases are typically processed and combined in order to convey to the scientific community a comprehensive and meaningful message about a proteomics study. This task often involves the manual integration of in-house software and script results, as well as data from spreadsheets, into customized summary tables that are later published in an article. The large-scale datasets and results produced with MS-based proteomics pose important challenges in terms of long-term storage and data reusability. Reproducing the creation process of complex summary tables can be difficult after a long period of time, especially when attempted by someone other than the original author. Databases such as PRIDE [68], MassIVE (massive.ucsd.edu), and PeptideAtlas [69], which are all included in the ProteomeXchange consortium [70], allow the long-term storage and distribution of data. However, for the most part they do not permit the full integration of custom computational and statistical significance results. We are currently tackling both of these challenges simultaneously by designing a software package named Proteomics INTegrator (PINT). PINT is an open-source platform-independent Java program built on top of a MySQL database engine that provides an environment to store data and integrate results from various computational analyses (including ProLuCID, DTASelect, and Census), which may originate from different proteomics approaches. The protein annotations available in the UniProtKB repository [71] are automatically appended to all datasets stored in PINT. PINT allows users to easily import, visualize, and download data though a user-friendly web interface, thereby facilitating its dissemination and accessibility. In addition, our tool also provides a powerful and flexible query system allowing the retrieval of specific elements and values (e.g., quantification measurements, confidence scores, statistical significance results, manual annotations, UniprotKB protein annotations) in the data of a given experiment or across several experiments simultaneously.

Conclusion

Our computational pipeline has come a long way since the publication of the SEQUEST algorithm. We strive to continuously produce and use novel algorithms that take advantage of new MS technological improvements. We strongly believe that computational approaches that aim to make the most out of cutting-edge technologies will continue to yield high impact biological discoveries.

Acknowledgments

The authors are grateful to Claire M. Delahunty for helpful discussions and comments. They acknowledge funding from the following National Institute of Health grants: P41 GM103533, R01 MH067880, R01 MH100175, UCLA/NHLBI Proteomics Centers (HHSN268201000035C), and 1U54GM114833. M.L.A. holds a postdoctoral fellowship from the Fonds de recherche du Québec – nature et technologies (FRQNT).

Copyright information

© American Society for Mass Spectrometry 2015

Authors and Affiliations

  • Mathieu Lavallée-Adam
    • 1
  • Sung Kyu Robin Park
    • 1
  • Salvador Martínez-Bartolomé
    • 1
  • Lin He
    • 1
  • John R. YatesIII
    • 2
  1. 1.Department of Chemical PhysiologyThe Scripps Research InstituteLa JollaUSA
  2. 2.Department of Chemical Physiology and Molecular and Cellular NeurobiologyThe Scripps Research InstituteLa JollaUSA

Personalised recommendations