Introduction

Peptidomics is defined as the comprehensive characterization of “native” peptides in a biological sample [1]. Without digesting proteins into peptides using trypsin or other proteases as applied in conventional bottom-up proteomics, peptidomics is able to preserve the endogenous information of the peptidome peptides from a biological sample, including post-translational modifications (PTMs) and proteolytic products revealing the natural proteases participated in the proteolytic processes [2]. In contrast to limited study of small neuropeptides [36], broad and large-scale studies have been increasingly conducted to characterize endogenous peptides in different biological samples, including cell lines [7, 8], body fluids [9, 10], as well as tissues [11], for biomarker screening or clinically-related studies.

The peptidome coverage and effectiveness for quantification provided by a peptidomics pipeline is dependent on each step of the pipeline, including peptide extraction and/or enrichment/fractionation, separation, LC-MS data acquisition, as well as the subsequent informatics analysis [12]. Despite many new advances in both sample preparation [13, 14] and instrumentation [15], very limited progress has been made toward improved data analysis for peptidomics studies. Many proteomics software tools have been applied directly to peptidomics studies, such as MS-GF+ [16], SEQUEST [17], Mascot [18], MS-Align+ [19], which was reviewed recently [20]. However, to the best of our knowledge, there is no detailed study reported for comparing the performance of these proteomics software tools for peptide identification for peptidomics analysis.

Different stable isotope labeling strategies such as 18O-labeling [21] and other chemical labeling [8] have been applied for quantification of peptidomic changes. Although the labeling approaches can provide accurate quantification, they are typically associated with increased cost, sample loss, and increased sample processing time. In contrast, label-free quantification, a simple yet effective method, can be highly reliable in well-controlled experiments [22]. In addition, there are also well-established methods, such as the accurate mass and time (AMT) tag approach [23], which provide both improved measurement throughput and reliable label-free quantification. More recently, we have developed a new software tool, informed quantification (IQ), which capitalizes on peptide LC elution and high-accuracy mass information, and is capable of accurate de-isotoping, peak matching, as well as label-free quantification, independent of MS/MS data [11]. Such MS/MS-independent proteomics analysis strategies have the potential of providing both increased coverage and reliable quantification in large-scale peptidomics studies.

In this study, we applied both the AMT tag and the augmented IQ informatics pipelines for analyzing data sets from our recent peptidomics study on potential ischemia effects in ovarian cancer tumors [11], and evaluated their performance in greater detail. Both the AMT tag and IQ analyses use a database consisting of peptides identified from conventional database searching of the MS/MS data from each individual peptidomics data set from the entire study. The study included evaluation of the performance of different search engines, including MS-GF+, SEQUEST, and MS-Align+, for effectiveness in peptidomic peptide identification. The results showed that MS-GF+ could identify many more unique peptidome peptides than SEQUEST and MS-Align+. Both AMT tag and IQ approaches were shown to provide more unique peptide identification than the database searching methods, which greatly reduced missing data across the entire data sets. In addition to the good correlation that was observed between AMT tag and IQ quantification results, IQ also provided slightly higher peptidome coverage and less missing data than AMT tag approach. Taken together, our results demonstrate that integration of MS-GF+ database research and IQ analysis for label-free quantification drastically improves peptidome coverage, reduces missing data, and represents an optimized informatics pipeline for large-scale, comprehensive and quantitative peptidomics analysis.

Experimental

Tumor Samples, Peptidomics Sample Preparation, and LC-MS/MS Analysis

The ovarian tumor samples, sample preparation methods, and LC-MS/MS instrument analysis methods used for generating the peptidomics data sets have been described in detail previously [11]. Briefly, tumor tissues collected from three patients with high-grade serous ovarian carcinoma (A, B, and C) were rapidly dissected into four contiguous and adjacent specimens strips, and placed into cryovials and frozen in liquid nitrogen at four different time points (0, 5, 30, and 60 min, at room temperature). The ovarian cancer tumor samples were further processed by cryopulverization and acid extraction (using 0.25% acetic acid and protease inhibitor cocktail) for peptidomic peptides, followed by LC-MS/MS analysis using nanoACQUITY UPLC system (Waters Corporation, Milford, MA, USA) coupled on-line to a LTQ Orbitrap Velos mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA). A 110 cm × 75 μm i.d. (flow rate 200 nL/min) fused-silica capillary column packed with 3 μm Jupiter C18 bonded particles (Phenomenex, Torrance, CA, USA) was used for analysis of the ovarian tumor samples. Mobile phases consisted of 0.1% formic acid in water (A) and 0.1% formic acid acetonitrile (B) were operated with effective gradient profiles as follow (min:%B): 0:1, 6:8, 60:12, 225:35, 291:45, 300: 95. The LTQ Orbitrap Velos mass spectrometer was operated in the data-dependent mode acquiring high-resolution CID scans (R = 15,000, 5 × 104 target ions) after each full MS scan (R = 60,000, 1 × 106 target ions) for the top six most abundant ions within the mass range of 400 to 2000 m/z. An isolation window of 2 Th and a normalized collision energy of 35 were used for CID. The dynamic exclusion time was 60 s.

Mass Spectrometry Data Analysis

The resulting MS data was first subjected to DTARefinery [24] to correct overall mass measurement deviation before database searching. After that, the corrected spectrum was further searched against human protein sequences from UniProt (UniProt Knowledgebase release 2013_09) using MS-GF+ [16], with the following parameters: no enzyme digestion, precursor mass tolerance 50 ppm, methionine oxidation as variable modification, and target-decoy strategy adopted for false discovery rate (FDR) calculation. The database searching result was finally filtered by spectrum level FDR less than 1% and precursor mass error less than 10 ppm, and only these confidently identified peptides were kept for next step analysis.

After the initial database searching, the data sets were further analyzed using the Pacific Northwest National Laboratory (PNNL)-developed AMT tag approach [25]. Confidently identified peptides from all the individual data sets were assembled into an AMT tag database containing both theoretical masses (calculated from the peptide sequence) and LC elution times of those peptides. LC-MS features in each individual data set were then matched against the AMT tags in the database for peptide identification using VIPER [26] with a mass error tolerance of <10 ppm and a normalized elution time (NET) error tolerance of <2.5%. The AMT tag matching result was further filtered by uniqueness probability (UP) higher than 0.5 to reduce ambiguous match to multiple AMT tags, and finally by Statistical Tools for AMT tag Confidence (STAC) value filtering to ensure FDR less than 2.5% [27]. Integrated MS peak area was used to derive changes in abundance for the peptide identifications.

The MS data sets were also analyzed using a recently in-house developed IQ approach [11], a derivative and improved approach, which provides better de-isotoping and peak selection. Briefly, in IQ the m/z values of the theoretical isotopic profile (derived based on the peptide sequences that were included in the AMT tag database) are used to guide the extraction of the observed isotopic profile from the summed mass spectra. Least-squares fitting of the theoretical isotopic profile on the observed profile is then performed [28], providing a measure of how well the observed isotopic profile matches the theoretical isotopic profile. This metric is called the “fit score” and is a key metric for resolving correct versus incorrect features. A key step in IQ, as with the AMT tag approach, is the alignment of observed mass and LC elution times to database values in order to correct for variations in mass and elution time measurements taken across multiple datasets. Alignment of mass and the LC elution time makes it possible to narrow the mass tolerance used in generating extracted ion chromatograms (XICs) and the elution time window for selecting the correct chromatographic peak. Currently, VIPER is also used in a first-pass analysis to output mass and NET alignment information, which is then loaded into IQ and used for mass and NET correction during subsequent processing. Data processed by IQ approach was initially filtered by fit score (<0.1), NET tolerance (<2.5%), and mass accuracy (<10 ppm), and followed by manual validation to eliminate false positives [29] (development on computing FDR for IQ is currently in progress). If a chromatographic peak has been selected for a given peptide/charge state target, IQ then performs a final step of extracting the abundance information. Currently, this is comprised of summing a total of five mass spectra, centering around the apex scan of the elution profile. The abundance from different charge states is then added up for the specific peptide for quantification.

The peptide to protein mapping was performed using IDPicker3 [30]. All the quantification result from AMT tag and IQ analyses were imported into DanteR program [31] for processing and plotting: the data was first log10 transformed followed by median normalization, and used for further analysis; hierarchical clustering analysis was performed with Euclidean distance as distance metrics and average linkage for clustering; principal component analysis was performed with default parameters.

Results and Discussion

MS-GF+ Outperforms SEQUEST and MS-Align+ in Peptide Identification for Peptidomics Analysis

Totally 6845 unique peptides were confidently identified using MS-GF+ after 1% FDR filtering. As shown in Figure 1a (white), the distribution of precursor monoisotopic mass ranged from 785 to ~6000 Da with a median of 2059.3 Da. These peptides were further mapped to 1136 non-redundant protein groups using IDPicker3 (Supplementary Table S1). In our previous study [11], both SEQUEST and MS-Align+ were used for database search for the same peptidomics data, resulting in 3977 and 2843 unique peptides after filtering (FDR <1%), respectively. MS-GF+ was able to identify many more unique peptides than SEQUEST and MS-Align+ (Figure 1b). Similar results were obtained for the breast tumor peptidomics data in our previous study (data not shown). Furthermore, 91.05% of SEQUEST identified peptides (3621) were overlapped with MS-GF+ identifications, and the distribution of which is depicted in Figure 1a (Blue); similarly, 61.66% of MS-Align+ identified peptides (1753) were covered by MS-GF+, and the distribution of which is depicted in Figure 1a (Red). In agreement with a previous report [19], SEQUEST appears to better identify peptides with relatively lower molecular weight, whereas MS-Align+ has a preference for peptides with higher molecular weight (Figure 1a). In comparison, peptide identifications from MS-GF+ covered a larger dynamic range of molecular weight distribution, almost as large as the combination of those from the other two methods. To confirm the quality of the MS-GF+ search results, we manually inspected the spectra with the worst scores for 50 peptides that were identified only by MS-GF+. The results suggest that the majority of the spectra are of acceptable quality for peptide identification (Supplementary Figure 1). Taken together, our data suggested that MS-GF+ outperforms SEQUEST and MS-Align+ in terms of peptide identification for peptidomics analysis by providing significantly more unique peptide identifications and better molecular weight range coverage.

Figure 1
figure 1

Comparison of the peptidome peptide identification results from MS-GF+, SEQUEST, and MS-Align+. (a) Distribution of precursor monoisotopic mass for MS-GF+ (white), SEQUEST (blue), and MS-Align+ (red). (b) Venn diagram showing the overlap of the identifications resulted from MS-GF+, SEQUEST, and MS-Align+ analyses

AMT Tag and IQ Provide Significantly Higher Peptidome Coverage and Fewer Missing Data Than MS-GF+

Owing to the low stoichiometry of peptidome as well as the undersampling nature of typical data-dependent acquisition of MS/MS data, very low consistency is usually observed in MS/MS-based peptide identification from run to run, leading to poor peptidome coverage and a significant amount of missing data in label-free quantification [32]. Indeed, although there were a total of 6845 unique peptides identified by MS-GF+ from all 12 datasets, on average only approximately 2500 unique peptides were detected for each data set (Blue in Figure 2a; also see Table 1). Furthermore, only about 500 peptides were consistently identified and quantifiable across all 12 samples, whereas more than 2000 peptides were only detected from one sample (Figure 2b, Blue). This led to much smaller peptidome coverage and a much larger number of missing data in the individual sample analysis, both well known and significant issues for label-free quantification.

Figure 2
figure 2

Comparison of identification and quantification results from MS-GF+, AMT tag, and IQ analyses. (a) Comparison of number of identified unique peptides in each sample for MS-GF+ (blue), AMT (red) and IQ (green). (b) Distribution of quantification frequency among all samples for MS-GF+ (blue), AMT tag (red), and IQ (green)

Table 1 Number of Unique Peptidome Peptides Identified by Different Informatics Approaches in Each Tumor Sample

In order to improve data quality in identification and quantification for each individual analysis, we next utilized both the AMT tag and IQ approaches for analysis of the same 12 data sets, taking advantage of a LC-MS database created using the MS-GF+ peptide identifications. Both approaches are expected to obtain more comprehensive quantification results attributable to the LC elution time alignment, and hence effective LC-MS peak matching. As depicted in Figure 2a, the average numbers of unique peptides identified from individual dataset via AMT tag and IQ approaches were much larger than that from MS-GF+: 4630 and 5000, respectively, as opposed to 2500. Moreover, the number of unique peptides detected across all 12 datasets via AMT tag (2112) and IQ (2182) approaches was significantly increased compared with MS-GF+ (421), and peptides only detected in one sample greatly decreased (243 for AMT tag and 133 for IQ), significantly reducing missing data (Figure 2b and Table 2). The peptide identifications resulted from MS-GF+, AMT tag, and IQ analysis for each individual peptidomic dataset are provided in Supplementary Tables 24.

Table 2 The Distribution of Quantification Frequency (Number of Unique Peptide Identifications Common Across the Different Samples) for the Three Different Informatics Approaches

When the performance of the two MS/MS-independent LC-MS analysis approaches were compared, IQ provided slightly more unique peptide identifications (2.34%–14.69%, average 8.02%), and thus less missing data than did the AMT tag approach. This is likely because IQ employs all isotopic peaks, whereas the AMT tag approach uses individually de-isotoped spectra, resulting in improved sensitivity, better distinguished overlapping features, and better reproducibility. Taken together, our results demonstrate that benefiting from significantly reduced undersampling, the direct LC-MS analysis pipelines including the AMT tag and IQ approaches provide significantly higher peptidome coverage and, more importantly, less missing data across the entire peptidomics data sets compared with even the best performing MS/MS-dependent analysis approaches such as MS-GF+.

Both the AMT Tag and IQ Approaches Provide Robust Label-Free Quantification for Peptidomics

The AMT tag is a well-established approach for label-free quantification [23, 33]; it is interesting to compare its performance in quantification with that of the relatively newer IQ approach. Altogether, there were 1503 unique peptides overlapped between IQ and AMT tag analyses with no missing data across all 12 samples. Pearson correlation was first calculated for all 1503 unique peptides to assess the consistency between the quantification results of AMT tag and IQ analyses. As shown in Figure 3a, 91.62% of peptides displayed a correlation coefficient no less than 0.8, whereas only 50 peptides (3.32%) had a correlation lower than 0.5. Pearson correlations between the AMT tag and IQ quantification results for each sample were also calculated. With no Pearson correlation coefficient less than 0.93, the quantification results from both AMT tag and IQ analyses were well-correlated (Table 3). As an example, the AMT tag and IQ quantification results of one sample (A_0) shown in Figure 3b displayed excellent consistency, with most of the data points aligned along the diagonal line with an overall correlation of 0.94.

Figure 3
figure 3

Comparison of the label-free quantification results from AMT tag and IQ analyses. (a) Distribution of the Pearson correlation values for peptides common in all samples quantified by AMT tag and IQ approaches. (b) Scatter plot showing the correlation of all peptides quantified by AMT tag and IQ approaches from sample A_0

Table 3 Pearson Correlation of the AMT Tag and IQ Label-Free Quantification Results for Each Tumor Sample

The IQ and AMT tag label-free quantitation results also produced very similar hierarchical clustering analysis (HCA) and principle component analysis (PCA) plots (Supplementary Figure 2). Consistent with previously reported results [11], in both IQ and AMT tag analyses of the same peptidomics data sets the HCA heatmaps and PCA plots showed that the peptidomic profiles from the four time points of the same patient sample were clustered together. This indicated that potential changes in the peptidomes due to post-excision delay (up to 1 h) were much smaller than that from patient heterogeneity, and that both IQ and AMT tag informatics pipelines provide robust quantitation for peptidomics analysis.

Conclusion

We describe an improved LC-MS-based informatics workflow for comprehensive and quantitative peptidomics analysis, which consists of MS-GF+ for initial database searching and IQ (or AMT tag) approach for improved identification and more robust label-free quantification. MS-GF+ provides significantly more peptide identifications spanning a broader molecular weight range than the frequently used SEQUEST and MS-Align+ MS/MS search engines for identifying peptidome peptides. Owing to the direct LC-MS analysis strategy employed, both the AMT tag and IQ approaches significantly alleviate the undersampling issue and provide much better peptidome coverage and much less missing data for each sample in comparison to even the best MS/MS-based analysis methods such as MS-GF+. In addition to the excellent correlation with the quantification results provided by the AMT tag approach, the IQ approach showed further improvement of peptidome coverage and reduced missing data across the entire data sets, likely due to better peak picking and retention time alignment. We believe that the powerful combination of MS-GF+ and IQ (or AMT tag) approach represent an optimal peptidomics informatics pipeline and expect broad application of this pipeline for large-scale, highly effective, and robust peptidomics analysis.