Introduction

Targeted proteomics experiments focus on the measurement of an explicit subset of peptides instead of the broader measurements employed in global proteomics experiments [1]. The most common method for targeted proteomics is selected reaction monitoring (SRM), where peptides are detected and measured via the precursor m/z and a limited set of fragment ions. Parallel reaction monitoring (PRM), a high resolution analog of SRM in which all fragment ions are measured simultaneously, has recently become popular. Unlike SRM, fragment ions used for quantification are chosen post hoc, based upon relative intensity and lack of interferences. Designing a targeted proteomics experiment, the process of identifying the most responsive peptides and transitions for a given protein and sample of interest is time-consuming, and numerous publications have reported important considerations for assay design [24]. Most of these methods use previously identified peptide/spectrum matches, both to determine which peptides are most reliably observed for a given protein and to identify highly abundant fragment ion peaks. Although several large repositories store tandem mass spectrometry data for proteomics [57], they are typically built from a mass spectrometry perspective, without higher-order biological context.

Data-independent acquisition (DIA) is an untargeted data acquisition method [8]. For DIA, a MS scan for the entire mass range of interest is followed by a series of MS/MS scans that fragment all peptides within a specific m/z window (typically 10–25 Da). Successive MS/MS spectra tile consecutive m/z windows to cover the entire range. Thus, all ions are fragmented and measured. To analyze the data, a spectral library is leveraged to generate extracted ion chromatograms for m/z values corresponding to parent and fragment ions of expected peptides. Therefore, DIA data files provide a digital fingerprint for samples, allowing for repeated hypothesis-driven querying of data without requiring reacquisition of MS data.

Skyline is a robust and powerful software tool for analyzing mass spectrometry data. Although originally designed for targeted proteomics measurements [9], it has expanded to become a general tool box for SRM, PRM [10], data dependent acquisition (DDA) [11], and DIA [12, 13] mass spectrometry data. An important feature of Skyline is the intuitive visualization of results to assist users in manual data curation. These key strengths have led to the broad adoption of Skyline within the community. In both SRM/PRM assay design and DIA data analysis, users benefit greatly from working in Skyline with a spectral library. Currently, however, users manually enter peptide and protein data and upload a spectral library. The PNNL Biodiversity Plugin provides an intuitive interface for uploading data in a pathway-oriented interface. In addition to the vast spectral libraries for >100 organisms, users can point the plugin to additional libraries and use the same pathway navigation to push their own personal data into Skyline.

Experimental

Methods

Software and Availability

The PNNL Biodiversity Plugin is programmed in C# using .NET 4.5.2 framework. The software is open source and is available at our group’s GitHub account, https://github.com/PNNL-Comp-Mass-Spec/PNNL-Biodiversity-Library-Plugin. The plugin executable is available through Skyline as an external tool and can be found through the tool store or online at https://skyline.gs.washington.edu/. Due to changes in the Skyline API for external tools, the PNNL Biodiversity Library plugin does not work with Windows XP. The version of Skyline used with the plugin must be version 3.1.1.7490 or greater.

Kyoto Encyclopedia of Genes and Genomes (KEGG) Mapping

The mapping of protein identifications from the PNNL Biodiversity Library and the Draft Human Proteome are as previously described [14]. This mapping is kept within the plugin as a SQLite file noting observed proteins. Mapping of user uploaded data is done as follows. The Uniprot identifiers present in the Bibliospec library uploaded by users can be easily indexed for uploads of type ‘supplement’ or ‘replace’ by updating the SQLite file. For adding a new organism, we first obtain all the genes for a specific organism using KEGG’s REST API (http://rest.kegg.jp/link/ko/<org_code>). We then update the SQLite tables with this information and the protein identifiers from the Bibliospec library.

Spectral Data

All spectral data presented by the plugin and transferred to Skyline is stored in Bibliospec libraries. There are two sources of data: the PNNL Biodiversity Library and the Pandey Lab Draft Human Proteome. All PNNL data were aggregated and searched as described [14], and are posted at the UCSD server MassIVE, http://massive.ucsd.edu/, ProteomeXchange identifier PXD001860, MassIVE identifier MSV000079053. Data for each organism is stored separately and contain raw, mzML, mzIdentML, and Bibliospec formats.

The human data available in the plugin is the Pandey Lab Draft Human Proteome. We downloaded the raw spectral files from ProteomeXhange (data set identifier PXD000561). These RAW files were converted to mzML and searched with MSGF+ using the following parameters: semi-tryptic protease specificity, 10 ppm parent ion tolerance, HCD fragmentation type, cysteine alkylation as a static modification, and oxidized methionine as a dynamic modification. The database for the search was the human reference proteome from Uniprot downloaded on 05_20_2015. The resulting .mzIdentML files were converted to the Bibliospec library format using the following command line parameters:

  • ~>BlibBuild.exe –c 0.9999 C:\path_to\file1.mzid Library.blib

  • ~>BlibBuild.exe –c 0.9999 C:\path_to\file2.mzid Library.blib

  • #as many BlibBuild commands as there are identification files

  • ~>BlibFilter.exe –b 1 Library.blib FinalLibrary.blib

Because of the interest in proteomes of specific human tissues, we have included with the plugin a separate bibliospec library for each of the 30 tissues studied in the Pandey Lab manuscript. In addition, we have created a single library for ‘Homo sapiens all tissues’.

User uploaded data is taken as a Bibliospec file. It is important to use the version of Bibliospec that embeds protein identifications within the file. We suggest the following command line for file creation:

  • ~>BlibBuild.exe –c 0.9999 C:\path_to\file1.mzid Library.blib

  • ~>BlibBuild.exe –c 0.9999 C:\path_to\file2.mzid Library.blib

  • #as many BlibBuild commands as there are identification files

  • ~>BlibFilter.exe –b 1 Library.blib FinalLibrary.blib

The plugin does no additional filtering of this file. While data from the PNNL Biodiversity Library have been filtered to q < 0.0001, users are free to upload any quality of data and are responsible for understanding the limitations of their data quality. The plugin only cross-references protein identifications with KEGG orthologs for convenient mapping on the interface.

FASTA File Creation

The Skyline program organizes peptides around their parent protein, as specified in a protein sequence. To obtain a user’s specific subset of proteins, we dynamically create a FASTA file via a web API from uniprot.org. This customized FASTA file is transferred to Skyline utilizing the Tool Client object API.

Proteotypic Peptide Viewer

This software program is also available at our group’s GitHub account and at the Skyline tool store. After users have chosen proteins and organisms, we dynamically create a FASTA file as described above and use the EBI web-API for the MUSCLE alignment program. Peptides are then mapped onto the aligned sequences and displayed.

Results and Discussion

The primary goal of the Biodiversity plugin is to help users get MS/MS data into Skyline for review and use during targeted proteomics assay design or hypothesis-driven DIA data analysis. There are numerous public resources with proteomics data, but they are not indexed in an easily navigable interface. To provide a more intuitive navigation, we created a pathway-oriented interface that allows users to think first and foremost about the biological question at hand and be presented with relevant resources. We seeded the plugin with data from the PNNL Biodiversity Library, containing 3 million peptides from 230,000 proteins in 112 bacteria and archaea [14]. Numerous biomedical and environmental organisms are contained in the library, including most model microbial organisms. Additionally, we included the draft human proteome, which has spectra collected from 30 different tissues [15]. Thus, the plugin comes prepackaged with an incredible breadth of publicly available data.

Pathway-Centric Data Browsing

The plugin walks users through data selection in a wizard to promote intuitive and efficient browsing (Figure 1). The plugin is designed around the concept of biological pathways and networks. Since most biological investigations focus on cellular functions that are typically described and categorized as pathways, it is a natural entry point for researchers when they begin to set up their assay. The first step is to select an organism. The second tab assists users in their selection of pathway(s) of interest. Pathways are listed in categories as organized by KEGG, which contains hundreds of pathways annotated across thousands of organisms, including pathways for metabolism, signaling, regulation, and disease-related processes [16]. The pathway coverage, calculated as the number of proteins in a pathway for which proteomics data is present in the Library, is provided alongside the name to help indicate the depth of proteomics data for a specific pathway.

Figure 1
figure 1

Biodiversity Library Plugin. The wizard interface of the plugin walks users through the steps of choosing data and importing it into Skyline

The pathway visualization tab puts the peptide and protein data on top of KEGG pathway maps. This dynamically created image quickly conveys the proteomic coverage of specific proteins in the pathway (Figure 2). The goal of the plugin is to rapidly get users to this view, where they can determine the extent of proteome coverage for their biological question of choice. Here users can deselect individual proteins in the pathway image to remove them from consideration.

Figure 2
figure 2

Pathway display. Pathway images put the proteomics data in the library into biological context. Proteins for which there exists data in the library are shown in red. Proteins that are annotated in the organism’s genome but lack proteomics data are shown in blue. Proteins that are not annotated in the organism’s genome are left white. All annotations come from KEGG

The Review and Export tab gives a quick overview of the selected data, listing the proteins in specified pathways along with the accessions and identifiers. By confirming this selection, the plugin will load the spectral library into Skyline and create a custom FASTA file corresponding to the selected proteins. Using Skyline, they can now browse spectra and identify the best transitions for SRM assay design or use the Bibliospec library for DIA data analysis.

Using Personal Data

By uploading personal data into the plugin, users have the ability to use the same pathway-oriented interface to load their data directly into Skyline. Given a properly formatted Bibliospec file using Uniprot identifiers, any local data can leverage the tool and be imported into Skyline. There are three options for customization: to replace data, to supplement data for an existing organism, or to add an entirely new organism. When replacing, all data for an existing organism will be removed and replaced with the custom data provided by the user. This would be appropriate if, for example, the user needed data from a different instrument type than is currently available in the library. In addition, using a local spectral library provides the retention time information that is critical for DIA data analysis. For supplementing, the data provided by the Biodiversity Plugin is combined with the custom data that the user provides. Supplementing is an appropriate choice when users have an experimental condition or sample with proteins that were not seen in the library. The final option of adding a new organism simply adds a new organism with the data provided by the user. The plug-in wizard walks through these options, verifies the input data and gives an overview of the changes that will take place. Once the user confirms the changes, their personal copy of the plugin database will be updated. Any changes made will persist on the user’s machine only, ensuring that other users’ data are not modified.

Hypothesis of DIA Data

Many investigators are interested in designing targeted assays for specific biochemical or disease pathways. However, SRM and PRM assays can be limited in terms of the number of proteins/peptides that can be analyzed in a single injection. This number varies based upon the scan speed of the instrument and also in the ability to perform scheduled analysis for peptides of interest. In addition, if the analysis of data from a targeted experiment implicates other proteins/pathways, additional injections of the same sample are required in order to quantify new proteins of interest. DIA does not suffer from these limitations, since all peptide species are fragmented in a single injection and therefore can be repeatedly inspected to verify new hypotheses. The Biodiversity Plugin can be very useful in facilitating repeated, hypothesis-driven interrogation of DIA data files. To accomplish this analysis, users would simply need to select a different set of proteins with the plugin and re-import them into Skyline. In those cases when follow-up studies are needed in order to analyze a greater number of samples with the higher sensitivity provided by SRM, the DIA data can be used to facilitate SRM experimental design [17].

For this type of analysis to be most effective, a custom in-house spectral library is necessary in order to provide retention time information for peptides corresponding to proteins of interest. However, the iRT algorithm [18], using synthetic or common internal [19] retention time standard peptides, can provide a mechanism for inclusion of peptides from the external spectral library by estimating a predicted retention time window. This approach could be valuable in those instances where additional peptides are needed for a protein of interest, or when proteins of interest are not represented in the user’s spectral library.

Identifying Proteotypic Peptides Across Species

Given the wide diversity of organisms used in various biological experiments, it is impossible to have available mass spectrometry data for each. Biomedical and environmental research frequently identifies novel strains or species. To assist such investigations in understanding what peptides are likely to be observed in proteomics data of a new organism, we have created an additional plugin, the Proteotypic Peptide Viewer, which helps compare a given protein sequence with the data present in the Biodiversity Library. The plugin aligns homologous proteins from the Biodiversity Library with a user specified protein and then displays the observed peptides (Figure 3).

Figure 3
figure 3

Peptide homology. Using the Proteotypic Peptide Viewer plugin, users can view peptides observed from a single protein across several organisms. In the image, the aconitate hydratase protein is shown for nine alpha proteobacteria. Eight organisms have proteomics data from the Biodiversity Library. The Wolbachia protein was uploaded separately for comparison

One use case for this viewer is to help identify peptides likely to be observed in a sample consisting of a multi-organism consortium. In such a metaproteomics experiment, the exact species present in the consortium may not be known beforehand. In this scenario, one may wish to use a marker protein that is reliably observed in many organisms and build a targeted assay to determine the presence/abundance of an organism of interest. The proteotypic peptide viewer helps to facilitate this design by showing which regions of a protein are reliably observed as peptides across taxa.

Conclusions

For mass spectrometry to be broadly adopted in biological laboratories as a routine experimental protocol, we must improve the ease of experimental design and the utility of the results. Targeted proteomics is an appealing approach for proteomic characterization, from the standpoint of always returning a quantified abundance for every requested target. This is in sharp contrast to global or shotgun proteomics, which all too often fails to quantify one or more of the proteins that are critical to the hypotheses under investigation. One of the significant drawbacks to SRM-based proteomics is the complexity and time involved in designing assays before running the experiment. For non-mass spectrometry experts, this barrier is still too high. We have attempted to address this usability problem with the PNNL Biodiversity Plugin for Skyline, which summarizes available mass spectrometry data in a familiar pathway-centric visualization and seamlessly interacts with Skyline so that users can approach the data in the same way that they approach the biological experiments.