Evaluation of linear models and missing value imputation for the analysis of peptide-centric proteomics
Several methods to handle data generated from bottom-up proteomics via liquid chromatography-mass spectrometry, particularly for peptide-centric quantification dealing with post-translational modification (PTM) analysis like reversible cysteine oxidation are evaluated. The paper proposes a pipeline based on the R programming language to analyze PTMs from peptide-centric label-free quantitative proteomics data.
Our methodology includes variance stabilization, normalization, and missing data imputation to account for the large dynamic range of PTM measurements. It also corrects biases from an enrichment protocol and reduces the random and systematic errors associated with label-free quantification. The performance of the methodology is tested by performing proteome-wide differential PTM quantitation using linear models analysis (limma). We objectively compare two imputation methods along with significance testing when using multiple-imputation for missing data.
Identifying PTMs in large-scale datasets is a problem with distinct characteristics that require new methods for handling missing data imputation and differential proteome analysis. Linear models in combination with multiple-imputation could significantly outperform a t-test-based decision method.
KeywordsPost-translational modifications Redox proteome Mass spectrometry Multiple imputation Linear regression models
False positive rate
Logarithmic fold change
Linear Models for Microarray Data
Missing at random
Missing completely at random
Missing not at random
Receiver operating characteristics
True positive rate
Universal Proteomics Standard Set 1
Covalent post-translational modifications (PTMs) have a significant impact on protein function and activity, while greatly increasing proteome complexity of an organism. Enzyme-catalyzed PTMs, such as acetylation, phosphorylation, or ubiquitination, can occur at one or multiple amino acid residues in a nascent protein following translation and folding, or on mature proteins as part of signal transduction pathways or regulatory/control processes . PTMs modify the activation state of enzymes, change the subcellular localization, or modify the stability of proteins. The frequency with which PTMs occur, stoichiometry, timing, and location within a protein and the cell are crucial aspects for understanding the function of proteins and linking the dynamics of the whole proteome with physiological and pathological phenotypes of an organism .
Advances in mass spectrometry (MS) have accelerated the identification of PTM sites with high resolution in proteomes, where thousands of modifications are now routinely discovered following enrichment and quantitative methods . Technological progress has also prompted the development of new protocols for the quantitative analysis of PTMs. Most published work has focused on detecting and quantifying well-known, classical PTMs such as phosphorylation, glycosylation, ubiquitination, and acetylation [4, 5, 6, 7, 8, 9]. However, research on new types of PTMs is emerging; specifically, the study of redox-mediated PTMs is currently the focus of numerous recent studies as a basis to understand the roles that thiol oxidation play in cellular signaling [10, 11, 12]. Reactive oxygen species (ROS) produced intracellularly under physiological conditions as metabolic byproducts or in response to various environmental stress factors can directly affect proteins. The sulfur atom in cysteine and methionine residues of proteins are primarily susceptible to oxidation by ROS . Oxidized intermediates of cysteine and methionine have important catalytic roles in the active site of some enzymes, or can affect the functions of ROS-sensitive proteins .
Developing approaches to analyze mass spectrometry data both accurately and comprehensively is a challenge in proteomics and a considerable bottleneck in determining biological significance. In particular, quantitative analysis of protein PTMs via bottom-up proteomics in various organisms or experimental systems remains a challenge. The main issues in PTM quantification and analysis are twofold. Biological factors including a low abundance of modified proteins and the transitory nature of PTMs  decrease experimental reproducibility and hamper their comprehensive spatial and temporal analysis. Also, technical factors, such as the intrinsic variability of PTM enrichment protocols and sensitivity in PTM detection where modified peptides may be more difficult to identify from their fragmentation spectra than unmodified counterparts , further influence the reproducibility of PTM identification and quantification. These challenges result in measurements with a large proportion of missing data, technical errors, batch biases, and data sets difficult to normalize and variance-stabilize. Additional constraints are identified in redox proteomics, including technical problems like sample handling and preparatory issues that can artificially shift the oxidation status of proteins  and difficulties in the post-MS quantitative analysis, where precise stoichiometric information is required to characterize dynamic and transient redox proteomes .
Protein extraction for background lysate
Steps taken to culture wild-type Chlamydomonas reinhardtii CC-2895 and extract proteins were identical to our previous study . For samples used to enrich reversible oxidation, iodoacetamide (IAM) was added to the lysis buffer to alkylate reduced cysteines. Final lysates in the global proteomics and reversible oxidation studies were suspended in 50 mM Tris, pH 8.0 with 0.5% SDS and 8 M urea at 1 mg/mL concentration.
Spiking standard proteins/proteomes into background lysate
The Universal Proteomics Standard Set 1 (UPS1) was purchased from Sigma-Aldrich (St. Louis, MO, USA). For the global proteomics study, a vial of UPS1 containing 5 pmol each of 48 human proteins was resuspended in 50 μL of lysate to make an initial stock concentration of 100 fmol UPS1 per 1 μg lysate. Serial dilutions were performed twice to prepare both 50 and 25 fmol/μg UPS1. Each sample had a final volume of 25 μL and corresponded to roughly 25 μg total protein in subsequent processing. There were four technical replicates for each UPS1 concentration.
Intact Mass Spec-Compatible Yeast Protein Extract was from Promega (Madison, WI, USA). For the reversible oxidation study, a vial containing 1 mg of yeast proteome was resuspended in 1 mL of lysate to make an initial stock concentration of 1000 ng yeast per 1 μg lysate. A ten-fold dilution was performed by adding 100 μL stock to 900 μL lysate to make 100 ng/μg yeast. This sample was then two-fold diluted in 500 μL of lysate for a 50 ng/μg yeast. Each sample had a final volume of 500 μL and corresponded to roughly 500 μg total protein in subsequent processing. There were three technical replicates for each yeast proteome concentration.
Preparation of samples for bottom-up proteomics
Steps taken for protein-level reversibly oxidized cysteine enrichment and subsequent LC-MS/MS analysis have been described previously . For global proteomics, each sample was held at 30 °C and reduced using 10 mM dithiothreitol (DTT) for 1 h, alkylated with 30 mM IAM for 1 h, and then diluted four-fold with 75 μL of 50 mM Tris, pH 8. Proteins were digested with 1 μg of Trypsin Gold (Promega) for 16 h at 25 °C before quenching with 5 μL of 5% TFA. Following solid-phase extraction and vacuum centrifugation, samples were resuspended in 125 μL of water with 0.1% TFA before LC-MS/MS analysis identical to the reversible oxidation study.
Database searching and label-free quantification
Acquired spectral files (*.wiff) were imported into Progenesis QI for proteomics (Nonlinear Dynamics, version 2.0). A reference spectrum was automatically assigned, and total ion chromatograms were then aligned to minimize run-to-run differences in peak retention time. Each sample received a unique factor to normalize all peak abundance values resulting from systematic experimental variation. Alignment was validated (> 80% score) and a combined peak list (*.mgf) for all runs was exported for peptide sequence determination and protein inference by Mascot (Matrix Science, version 2.5.1). Database searching was performed against a combined database (19,603 entries total) containing C. reinhardtii JGI v5.5 proteins (https://phytozome.jgi.doe.gov/pz/portal.html; downloaded June 2016) and entries from the NCBI chloroplast (BK000554.2) and mitochondrial (NC_001638.1) databases. Sequences for either the 48 UPS1 proteins (www.sigmaaldrich.com/content/dam/sigma-aldrich/life-science/proteomics-and-protein/ups1-ups2-sequences.fasta; downloaded May 2016) or 6721 yeast proteins from UniProtKB (UP000002311; downloaded April 2016) were appended to the database. Searches of MS/MS data used a trypsin protease specificity with the possibility of two missed cleavages, peptide/fragment mass tolerances of 15 ppm/0.1 Da, and variable modifications of acetylation at the protein N-terminus and oxidation at methionine. Carbamidomethylation at cysteine was a fixed modification for global proteomics and variable for the reversible oxidation study. Significant peptide identifications above the identity or homology threshold were adjusted to less than 1% peptide FDR using the embedded Percolator algorithm  and uploaded to Progenesis for peak matching. Identifications with a score less than 13 were removed from consideration in Progenesis before exporting ‘peptide measurements’ and ‘protein measurements’ from the ‘Review Proteins’ stage. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org/cgi/GetDataset) via the PRIDE partner repository .
Upon analyzing raw spectral files in Progenesis, we then prepared data from the global proteomics experiment (UPS1) for statistical analysis following the steps described below (implemented in Additional file 1). Our dataset initially consisted of 15,234 peptides matched to MS1 features in the peptide measurements export. After filtering for Mascot score above 13 and removing hits to the contaminant database, 13,452 peptides remained. Some features were matched with peptides having identical sequence, modifications, and Mascot score, but alternate protein accessions. These groups were reduced to satisfy the principle of parsimony  and represented by the protein accession with the highest number of unique peptides, else the protein with the largest confidence score assigned by Progenesis. Additionally, certain features were duplicated with differing peptide identifications and were reduced to a single peptide with the highest Mascot score. These steps reduced the dataset to 12,792 peptides. Identifiers were formed by concatenating protein accession to the peptide sequence, and duplicate identifiers were binned together. The final dataset consisted of 10,599 identifiers (10,207 from Chlamydomonas and 392 from UPS1 proteins).
Our dataset from the reversible oxidation study consisted of 4388 peptides initially and underwent similar processing steps (implemented in Additional file 2). Filtering for Mascot score and removing contaminants left 3752 peptides. We then consolidated groups with duplicate peak features, reducing the dataset to 3547 peptides. For this study, results were limited to only peptides with one or more Cys-sites of reversible oxidation, defined here as the absence of carbamidomethylation on at least one cysteine residue in the peptide sequence. This filter left 3162 peptides with previous sites of reversible cysteine oxidation. An identifier was then made by joining the protein accession of each feature with the particular site of Cys-oxidation in the protein sequence. Data was then reduced to unique identifiers by summing the abundance of all contributing features (i.e., peptide charge states, missed cleavages, and combinations of additional variable modifications). Each group was represented by the peptide with the highest Mascot score, leaving 2235 identifiers for statistical evaluation (1786 from Chlamydomonas and 449 from yeast).
Data analysis pipeline
We used multiple imputations in conjunction with binomial testing to decide on statistically significant changes in peptide abundance. Imputation was performed n times generating n datasets. Relative changes between conditions in peptide abundance were analyzed using limma’s  function lmFit with or without the method = “robust” flag, followed by eBayes using the default settings  and false-discovery rate correction. The logarithmic fold change (LFC) in base two was finnaly calculated for each feature, comparison, and dataset.
We modeled the outcome of the data imputation using a binomial distribution, where each trial of an imputed peptide in a comparative analysis could have two outcomes: significantly changed (having the p-value below some set alpha level and the LFC above some cut-off level), or insignificantly changed. After counting the outcomes of the n imputations, we performed a right-tailed binomial test (using R’s core feature binom.test ) for observing an outcome (probability of success) significantly higher than 0.5, at a significance level 0.05.
Results and discussion
PTM quantification by MS has largely been peptide-centric , which involves digesting proteins into peptides using protease(s) before enriching samples for a particular modification and detecting with liquid chromatography-tandem mass spectrometry (LC-MS/MS). Although methods available for high-throughput analysis of intact proteoforms are steadily growing [31, 32], a majority of laboratories currently use bottom-up proteomics and peptide-centric quantitation for PTM studies. To develop a robust data analysis workflow for such datasets, we analyzed two distinct experiments featuring: (1) UPS1 standards and (2) enrichment of protein cysteines from the yeast proteome. Both experiments were spiked into Chlamydomonas protein extract at different concentrations to determine our ability to identify changing abundances correctly.
We compared our analysis with an FDR corrected t-test using the core R  functions (t.test with default settings and p.adjust with the method = “fdr” flag). We also included a hybrid decision method where an LFC cut-off threshold was added to the statistical decision at significance level alpha (default 0.05). The alpha and LFC cut-off were varied ten times for both the pipeline and the t-test (alpha between 0.05 and 0.001, and LFC between 0 and 2). The LFC and alpha were changed separately or both at the same time. For the t-test, a single imputed dataset was used.
Comparison with random forest imputation (which has been shown to display high performance in MS metabolomics data ) was performed using the R package missForest . Default settings were used except for maxiter which was set to allow for 20 iterations. A single dataset was imputed using random forest and processed through limma and using the same hybrid decision method (LFC cut-off and alpha).
Performance analysis using Chlamydomonas total proteome-UPS1 dataset
The Chlamydomonas-UPS1 (hereafter referred to as just UPS1) dataset is a global proteomics dataset containing 10,952 features (after filtering) and three different concentrations of spiked-in UPS1: 25, 50, or 100 fmol per 1 μg Chlamydomonas lysate (referred to as 25, 50, and 100, respectively) with four replicates each. In each condition, there were 403 confident peptides of UPS1 which were considered as true positives (TP). Since each replicate in each condition was from the same technical Chlamydomonas lysate, they were considered as true negatives (TN).
For the largest difference in concentration (100/25), there was clear discrimination between TP and TN relative to comparisons made between 100/50 and 50/25 (Additional file 3: Figure S1). The comparison with an LFC = 2 had better discrimination between TN and TP and a larger variation of TP (Additional file 3: Figure S1A). At LFC = 1 (100/50 and 50/25), the TP had lower variation, but it was harder to discriminate from TN (Additional file 3: Figure S1B and S1C). There were a total of 1083 (0.82%) missing values in 471 (4.3%) features for which imputation was performed.
Performance analysis using redox-enriched Chlamydomonas-yeast dataset.
The Chlamydomonas-yeast dataset is from an enrichment method for proteins bearing reversibly oxidized cysteine residues . It contained six technical replicates of Chlamydomonas lysate, where three were spiked with 50 ng yeast proteome per 1 μg Chlamydomonas lysate while the other three received 100 ng yeast per 1 μg Chlamydomonas lysate. In total, it contained 2229 peptides with previously oxidized Cys, of which 449 were TP yeast features. Compared to the UPS1 dataset, discrimination between TP and TF in this dataset was more difficult due to increased overlap of TP and TF distributions in compared conditions (see scatterplots in Additional file 3: Figure S1 and Additional file 5: Figure S3). It had a total of 963 (7.2%) missing values in 435 (20%) features for which imputation was performed.
Computationally, our method is inexpensive since it samples normal distributions to impute missing data. Improved missing data models can be developed from information on the missing-ness mechanisms related to the PTM quantification protocols. Stochastic sampling methods  can be used for left-censored missing value imputation for MNAR in combination with bootstrapping and other statistical resampling methods for MAR/MCAR imputation when the percentage of missing data is expected to be large.
PTM detection and analysis in comparative assays introduce new challenges in data imputation. Enrichment protocols may exhibit high variation among technical replicates and structure dependent-bias, leading to an increase in the percentage of missing data. Here we set up a benchmark dataset to analyze the performance of a pipeline developed in the R programming language including data imputation, limma analysis, and multiple imputation binomial testing, in comparison to a traditional pipeline including statistical testing using t-test and FDR correction for multiple testing. Robust regression methods were expected to outperform typical statistical tests used in MS data analysis . Here we conducted a performance evaluation of the pipelines for total proteome quantitation and differential analysis of redox proteome. Our results indicate that a significant improvement in performance can be obtained when using a robust estimation of missing data distribution parameters combined with linear modeling and binomial testing.
We also compared our imputation method with random forest imputation and found that our method had a slightly better performance at high FPRs, while performing similarly at lower FPRs. Given that our multiple-imputation method can use a learning strategy to optimize performance over the entire pipeline, further gains can be obtained when compared with a bootstrap imputation that is sampling only the input data.
This research was supported by NSF-MCB 1552522 and NSF-MCB 1714405 awarded to L.M.H. and NSF-MCB 1714157 to S.C.P and G.V.P.
Availability of data and materials
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org/cgi/GetDataset) via the PRIDE partner repository with identifiers PXD009694 for analysis of the Chlamydomonas proteome with UPS1 spike-in and PXD009693 for analysis of reversibly oxidized cysteines from the Chlamydomonas proteome with yeast spike-in. The data preprocessing implementation is described in Additional files 1 and 2. The R script used in data preprocessing can be found in the Additional file 7 (EWM_ProgenesisLFQ_redox_v2.R). The R scripts of the computational pipeline can be found in the Additional file 7 archive; a tutorial on using the pipeline is included in Additional file 8.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 2, 2019: Proceedings of the 15th Annual MCBIOS Conference. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-2.
PB implemented the computational pipeline and performed the analysis. EWM generated the experimental datasets and analyzed the MS data using Progenesis and the R data processing script. GVP designed the computational pipeline and supervised the data analysis. LMH and SCP supervised experimental design and analysis. All authors contributed to the manuscript. All authors have read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 25.Tukey JW. Exploratory data analysis. Addison-Wesley series in behavioral science. Reading: Addison-Wesley Pub. Co. xvi; 1977. p. 688.Google Scholar
- 29.Team, R.C. R: A language and environment for statistical computing: R Foundation for Statistical Computing; 2018.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.