Introduction

Single-run proteomic analysis using liquid chromatography coupled with high resolution mass spectrometry has evolved as an important technology for studying large number of proteins in an unbiased manner using less amount of starting material, no requirement for pre-fractionation or biochemical fractionation techniques and involving less measuring time in mass spectrometry. It has been used to study deep proteome of yeast and mammalian cell line [1]. Mass spectrometry-based proteomics technology have been invariably used to study protein-protein interactions [2,3,4], cell signaling [3,4,5,6], subcellular organelles [2, 3, 5], and complexes involving proteins [2, 3, 5].

The liquid chromatography has a great power for separating analytes [7], and it works in a wonderful way when combined with high-resolution mass spectrometry leading to their identification. Different types of columns of various lengths 66-cm [8], 80-cm [9], 350-cm [10], 70-cm × 20-μm inner diameter silica-based monolithic columns [11], 60-cm long triphasic capillary column [12], and bead sizes 1.5-μm nonporous octadecylsilane-modified silica particles [8], 1.7 μm C18 particles [13], 3-μm C18-bonded particles [9, 13], and 100-μm inner diameter monolithic silica-C18 [10] were used to improve chromatography separation and resolution. For deep proteome analysis of complex mixtures, generally protein or peptide fractionation is involved before liquid chromatography-tandem mass spectrometry runs. However, this involves large amount of starting materials and measurement time. The in-depth coverage of the E. coli proteome was tried by several groups including Wiśniewski and Rakus [14], Macek and coworkers [15] and Heinemann and coworkers [16]. Moreover, the use of expensive filtration devices for sample preparation technique, Filter-Aided Sample Preparation (FASP) [14], different protein or peptide fractionation techniques requiring more starting materials, more measuring time, and also higher version of linear trap quadrupole (LTQ)-Orbitrap Velos mass spectrometry accompanied by various quantification techniques such as label-free quantification [16], selected reaction monitoring (SRM) [16], stable isotope dilution (SID) [16], Super-Stable Isotope Labeling by/with Amino acids in Cell culture (SILAC) approach [15], and total protein approach [14] were reported by them [14,15,16].

Earlier, we had used liquid chromatography runs of 8-h duration, using in-house made column oven, column length of 50 cm packed with 1.8-μm C18 beads, coupled with high-resolution LTQ-Orbitrap Velos mass spectrometry to identify 2990 yeast proteins and 5376 mammalian cell line proteins in triplicate analysis without any pre-fractionation steps [1].

Here, we have increased the duration of liquid chromatography runs to 12 h instead of 8 h [1]. Now, by using LTQ-Orbitrap Velos mass spectrometry coupled with our earlier described method for preparation of column [1], column oven [1], along with intensity-based absolute quantification (iBAQ) [17] approach, we set to investigate the in-depth proteome of E. coli using single enzyme-trypsin and without any pre-fractionation techniques. We will demonstrate the coverage of E. coli cellular pathways, biological process, signaling pathways, and protein-protein interaction network in an unbiased manner. Further, we will investigate how the single-run of 12 h has almost the same coverage ability as the quadruplicate runs. Here, in our study, we have employed iBAQ [17] approach to study dynamic range of the proteins identified in E.coli. The iBAQ algorithm [17] is used to determine each protein’s approximate abundance by normalizing the added intensities of the peptides by the theoretical number of the peptides observed for the proteins.

Material and Methods

Preparation of E. coli Cell Lysates

The Escherichia coli (strain K12) was grown in 10 ml LB media at 37 °C to an OD600 = 0.7. Cells were harvested by centrifugation and re-suspended in 500 μL 8 M Urea, 100 mM Tris-HCl pH 7.4. Five hundred microliters of Zirconia beads were added and cells were lysed in a FastPrep-24 (MP Biomedical) three times for 45 s. Lysates were cleared by centrifugation at 17000×g for 5 min and protein concentration was measured. Fifty micrograms of total proteins were reduced with 1 mM DTT and alkylated with iodacetamide.

In-Solution Digestion

E coli proteins were digested overnight with Trypsin at a concentration of 1:50 (w/w) at room temperature. After digestion sample were acidified with TFA and loaded on Stage Tips (5 μg/Stage Tip) as described previously [18]. From Stage Tips, the peptides were eluted using buffer B (80% acetonitrile, 0.5% acetic acid).

LC-MS Analysis

The peptides were loaded on in-house made 50-cm column length (75-μm inner diameter, 1.8-μm C18 beads (Reprosil-AQ Pur, Dr.Maisch) using nano-HPLC with a flow rate of 75 nl/min. This was coupled with linear trap quadrupole (LTQ)-Orbitrap Velos mass spectrometer (Thermo Fisher Scientific). Peptides were loaded onto the column with buffer A (0.5% acetic acid) and eluted with a 12-h linear gradient. Data was recorded in data-dependent mode. The acquirement of survey scans was carried out in the Orbitrap mass analyzer using 25 most intense peaks as described earlier [1]. The fragmentation of the peptides was done using HCD method [19].

Data Analysis

The acquired raw data was processed using MaxQuant [15] version 1.2.2.5 and database used was (ncbi.Ecoli_K12_substrDH10B.25-Jan-2010.fasta) with Andromeda search engine [20]. The search included cysteine carbamidomethylation as a fixed modification, and N-acetylation of protein and oxidation of methionine as variable modifications. Up to two missed cleavages were allowed for protease digestion. Only peptides with minimum six amino acid length were considered for identification. The false discovery rate (FDR) for the identification of peptides and proteins was 1%. The iBAQ algorithm was used for quantification of proteins [17]. The identified proteins were filtered to remove the reverse hits and contaminants. The annotations were provided with Gene Ontology (GO) biological process, molecular function, cellular component [21, 22], and Kyoto Encyclopedia of Genes and Genome (KEGG) [23] pathway. We used STRING database version 9.1 (http://string-db.org/) for protein-protein interactions. We considered only those protein-protein interactions, for our dataset, whose confidence score was ≥ 0.7 (high confidence). We also visualized these interactions using Cytoscape [24].

Results and Discussion

iBAQ Quantification of E.coli Proteome

We chose to investigate E. coli K-12 because it is genetically best understood prokaryotic organism and biologically safe vehicle to conduct biological experiments [25]. To investigate the number of proteins identifiable in single-run analysis in E. coli without pre-fractionation, we used long column of 50 cm with 1.8-μm C18 particle sizes with 12-h gradient in LTQ-Orbitrap-Velos that has high scanning speed. We performed quadruplicate analysis of E. coli peptides in order to have uniformity in the analysis and detected 2068 quantified proteins with about 2 days measurement time (Supplementary Table S1). Here, we refer quadruplicate runs as 4 single-runs × 12 h, while single-run is 1 single-run × 12 h.

Each single-run of 12 h identifies and quantifies 2038, 2025, 2031, and 1950 proteins respectively. The Pearson’s correlation coefficient of iBAQ protein quantification between quadruplicate runs and each single-runs as Experiment 1 containing 2038 proteins is 0.984, Experiment 2 containing 2025 proteins is 0.985, Experiment 3 containing 2031 proteins is 0.985 and Experiment 4 with 1950 proteins is 0.956 (Supplementary Figure S1). Notably, it covered 98.5%, 97.9%, 98.2%, and 94.3% respectively of 2068 quantified proteins in quadruplicate run analysis. It is clear; each single-run is almost equivalent to quadruplicate runs (Table 1). Interestingly, it has identified and quantified about 52% of the total proteins present in E. coli genome according to Uniprot database. Notably, during 8-h single LC-MS/MS run, we were able to achieve 87% coverage of triplicate LC-MS/MS runs of HEK 293 cell line proteome [1], while in this 12-h single-LC-MS/MS run, we are able to achieve 98% coverage of quadruplicate LC-MS/MS runs of E. coli proteome [Table 1]. This experiment suggests each single-run of 12 h is sufficient to identify enough proteins of E. coli, so as to decipher the signaling and proteome information. The confidence of single LC-MS/MS run data was increased by running replicates such as duplicate, triplicate, and quadruplicate runs and identifying the same proteins again. In order to improve the dynamic range of the single-run analysis, “match between runs” was performed in MaxQuant environment [26] by matching peptide intensities of each single-runs of quadruplicate analysis with each other. This shows each single-run of 12 h has achieved enough saturation point and dynamic range for the detection and identification of maximum number of proteins that are achieved by quadruplicate run analysis.

Table 1 Identification of proteins in single- and quadruplicate runs

To roughly calculate the abundance of each protein detected in a proteome, we used intensity-based absolute quantification (iBAQ) algorithm where the intensity of the peptide identifying the protein is summed-up followed by the normalization of these total peptide intensities by the theoretical number of peptides observed for the proteins. These protein intensities normalized are converted to copy number of the proteins based on the total amount of the proteins in the sample analyzed [17]. The values of iBAQ tell about the absolute amount of each protein [27]. We are not claiming to get absolute quantification without spiked standards. However, we do not expect any global deviation from the reported values. To obtain global view of dynamic range of E. coli proteome, the graph was plotted to show the cumulative benefaction of each proteins to the total proteins. The 22 most abundant proteins correspond to 25% of the proteome (Figure 1) with outer membrane protein A, DNA-binding protein HU1, braun lipoprotein, and glyceraldehyde-3-phosphate dehydrogenase helping to constitute more than 1% each of the proteome.

Figure 1
figure 1

Quantitation of identified E. coli K-12 proteins by iBAQ. Cumulative proteins abundance distribution from high abundant proteins to low abundant proteins. Some of the high abundant proteins are labeled in the graph

We also calculated the median iBAQ values of the E. coli proteome and plotted the absolute abundance of quantified 2068 E. coli proteins. The iBAQ values vary by over five orders of magnitude in E. coli proteome (Figure 2A). The most highly expressed 25 proteins include glycyl radical cofactor, outer membrane protein, DNA-binding protein, braun lipoprotein, glyceraldehyde-3-phosphate dehydrogenase, ribosomal proteins, chaperonin (Figure 2B), and 25 most low abundant proteins with their low intensity are ATP-binding protein, glycerate kinase, ABC transporter periplasmic-binding protein, uncharacterized proteins, acriflavine resistance protein, cell filamentation protein, heat-shock regulatory protein, lipoprotein, and other proteins (Figure 2C). The above-stated valves are the approximate estimation of abundance of the expressed proteins, and there might be an underestimation for low expressed proteins. Here, the proteome recorded by LC-MS/MS is not complete probably due to low expression levels of some of E. coli proteins, and also no pre-fractionation of E. coli peptides and proteins were performed. In spite of the limitations, the single-run were able to cover considerable depth of high abundant and substantial number of low abundant proteins.

Figure 2
figure 2

(a) Dynamic distribution of E. coli proteins based on their iBAQ intensity. (b) Distribution of 25 most high abundant proteins. (c) 25 most low abundant protein distribution

Coverage of Cellular Components, Biological Processes and Pathways

To check the coverage of our dataset with the known cellular pathways, cellular compartments, biological processes, we overlapped our quadruplicate dataset with the database of the Kyoto Encyclopedia of Genes and Genome (KEGG) [23], and Gene Ontology (GO) annotations [21, 22] (Figure 3). Interestingly, 93 of 158 KEGG pathways were presented with 80% of their members while 139 of 158 with 50% members (Figure 3C, Supplementary Table S2). Some examples of signaling pathways include citrate cycle–TCA cycle (26 of 27 proteins) and mismatch repair (19 of 22 proteins) as well. Moreover, a comparison of some of the important KEGG pathways covered by quadruplicate runs with each single-runs such as Experiment 1, Experiment 2, Experiment 3 and Experiment 4 shows that single-run is almost as good as quadruplicate runs (Supplementary Figure S2).

Figure 3
figure 3

Distribution of different GO annotations and KEGG pathways for the observed proteins in E. coli in relation to the total number of proteins. (a) Cellular compartments (b) biological processes (c) KEGG pathways

We looked at folate biosynthesis (Figure 4) and found that our dataset covered almost all proteins except two proteins. The proteins (folE, moaC, moaD, moaE, sscR, ygcF, queC, queF, phoA, folB, folk, folP, folC, folA, pabC, pabA) are detected and identified by LC-MS/MS run analysis with 9, 7, 3, 4, 4, 7,7, 11, 2, 4, 2, 10, 13, 6, 3, and 2 unique peptides respectively (Supplementary Table S3). Interestingly, each single-runs covered almost all of the proteins in folate biosynthesis as quadruplicate runs (Supplementary Figure S3).

Figure 4
figure 4

Schematic representation of proteins involved in folate biosynthesis. Proteins that were identified in our analysis are in yellow color while the proteins not found in our analysis are white in color

For the GO annotations, quadruplicate run dataset covered almost all cellular compartments and processes. Remarkably, 675 of 1262 biological processes and 64 of 119 cellular components were represented with 80% of their members (Supplementary Tables S4 and S5). Some of the major coverage of cellular compartments and biological processes are shown in Figure 3A, B. This proves that almost all the KEGG pathways, cellular compartments, and biological processes are covered by this dynamic range of quadruplicate LC-MS/MS runs. Furthermore, each single-runs has ability to cover mostly all cellular compartment and biological processes as quadruplicate runs (Supplementary Figures S4 and S5). Note that all the members of signaling pathways, biological process are not expressed in E. coli grown in laboratory standard conditions.

Protein-Protein Interactions

We wanted to investigate known protein-protein interactions network in some of the important biological processes in E. coli by our quadruplicate run LC-MS/MS dataset. Therefore, we took the proteins of some of the biological process from Figure 3B and used Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database that provides physical and functional interactions [28]. The STRING generated protein-protein interaction network with tight connected clusters. This is seen in Figures 5 and 6 with some of the important biological processes such as cell cycle (Figure 5A), DNA repair (Figure 5B), ion transport (Figure 5C), ubiquinone biosynthetic process (Figure 5D), pseudouridine synthesis (Figure 6A), peptidoglycan biosynthetic process (Figure 6B), RNA processing (Figure 6C), and translation (Figure 6D). These protein-protein interactions are detected and identified by different number of unique peptides (Supplementary Table S6). Interestingly, we also compared quadruplicate runs with each of the single-runs of some of the biological process from Figure 3b to demonstrate that single-run is almost as good as quadruplicate run (Supplementary Figures S6 and S7).

Figure 5
figure 5

Protein-protein interactions network in (a) cell cycle (b) DNA repair (c) ion transport (d) ubiquinone biosynthetic process identified by our LC-MS/MS dataset of E. coli. These were analyzed by STRING and visualized by Cytoscape

Figure 6
figure 6

Protein-protein interaction network in (a) pseudouridine synthesis, (b) peptidoglycan biosynthetic process, (c) RNA processing, and (d) translation identified by our LC-MS/MS dataset of E. coli. These were analyzed by STRING and visualized by Cytoscape

To investigate the common proteins and protein-protein interaction between different biological processes, we combined the proteins of cell cycle and peptidoglycan biosynthetic process in STRING and found that there is dense interaction network between these two processes with presence of common proteins mraY, murF, murE, murG, murD, and murC that are identified by 2, 6, 16, 12, 12, and 15 number of unique peptides respectively (Supplementary Figure S8, Supplementary Table S6). The proteins from pseudouridine synthesis, RNA processing and translation process also forms protein interaction network among them when combined together in STRING (Supplementary Figure S9). The proteins common to both pseudouridine synthesis and RNA processing are truA, truD, and truB that are identified by 6, 13, and 13 number of unique peptides respectively (Supplementary Figure S9, Supplementary Table S6). The rplA protein is common between RNA processing and translation process and is identified by 21 unique peptides. Similarly, transcription and translation processes also form protein-protein interaction network when combined together in STRING, and these interaction networks are identified by different unique peptides (Supplementary Figure S10, Supplementary Table S6). This shows that different protein-protein interaction network of E. coli are also covered by our quadruplicate run dataset and identified with different unique peptides.

Conclusions and Outlook

Here, we see that without pre-fractionation, E. coli proteome quantified by iBAQ was covered to a considerable depth in quadruplicate LC-MS/MS run analysis using single enzyme-trypsin. These quantified proteins are about 52% of the total proteins present in E. coli genome according to Uniprot database. Our dataset captured a large part of GO annotations and signaling pathways along with protein-protein interactions. Most of the proteins of folate biosynthesis have been observed in single-run. Here, the protein-protein interaction networks that are covered in this quadruplicate run analysis are derived from various active interaction sources using STRING. We saw biological processes sharing common proteins among themselves along with protein-protein interaction network among different biological processes. Our data shows great sensitivity and coverage of quantified E. coli proteome using LTQ-Orbitrap Velos mass spectrometry and Proxeon-Nano ESI liquid chromatography pumps where there was no requirement of expensive UPLC with very high pressure.

This single-run LC-MS/MS analysis involves considerable reduced sample preparation, a few micro gram E. coli proteins, economical in-house made long chromatography column, column oven and reduced measurement time in mass spectrometry compared to extensive fractionation. The assurance of single LC-MS/MS run is increased by running replicates such as duplicate, triplicate, and quadruplicate runs and identifying the same proteins again. Our single-run method using liquid chromatography runs of 12 h was able to achieve 98% coverage of quadruplicate runs of E. coli proteome. This method is very useful for quick understanding of complex cellular system of E. coli. The protein-protein interaction network shown here in an unbiased way is an important step in the destination of interactome mapping by capturing important properties. It can serve as a platform for many hypothesis-driven experiments by cumulative effect of interplay of different signaling pathways, protein interaction network, and quantified proteins. Moreover, mutations can also be carried out without affecting the functions of networking proteins. Maybe in the future, advancement in technology will help to identify whole of E. coli proteome in single-run without any requirement for replicates and pre-fractionation experiments.

Accession Informations

All described nano-LC-MS/MS data (Raw files) may be downloaded from ProteomeXchange (http://proteomecentral.proteomexchange.org). The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier Project accession: PXD010139. Login for reviewers is provided under Username: reviewer22855@ebi.ac.uk and Password: 8fraiZ3Z.