Protein–protein interactions are an important element in the understanding of protein function, and chemical cross-linking shotgun mass spectrometry is rapidly becoming a routine approach to identify these specific interfaces and topographical interactions. Protein cross-link data analysis is aided by dozens of algorithm choices, but hindered by a lack of a common format for representing results. Consequently, interoperability between algorithms and pipelines utilizing chemical cross-linking remains a challenge. pepXML is an open, widely-used format for representing spectral search algorithm results that has facilitated information exchange and pipeline development for typical shotgun mass spectrometry analyses. We describe an extension of this format to incorporate cross-linking spectral search results. We demonstrate application of the extension by representing results of multiple cross-linking search algorithms. In addition, we demonstrate adapting existing pepXML-supporting software pipelines to analyze protein cross-linking results formatted in pepXML.
Protein interactions and topologies are increasingly investigated with mass spectrometry (MS). Because of speed, simplicity, and affordability, chemical cross-linking followed by reversed-phase liquid chromatography and shotgun mass spectrometry (LC-MS/MS) has become one of the most common approaches for the analysis of protein interactions and three-dimensional structure [1–3]. In this approach, proteins and protein complexes are linked together using any of a myriad of available cross-linking reagents. The cross-linked proteins are then enzymatically digested to peptides and analyzed by LC-MS/MS, where cross-linked peptide sequences can be identified using specialized database search algorithms [4, 5]. These cross-linked peptide spectrum matches (PSMs) contain two distinct peptide sequences that are used to infer proximal regions of protein structure from linearly distal domains, or domains from different proteins.
One of the biggest challenges when performing protein cross-linking and mass spectrometry (XL-MS) is data analysis. Parallel to the historic advances of shotgun MS, early efforts in cross-linked spectral analysis have given rise to a wealth of database search algorithms, too numerous to list fully, but include popular programs such as xQuest , pLink , Crux , Protein Prospector , Kojak , and StavroX . Each software package has devised its own means of reporting results and, where applicable, visualizing them. Frequently, results and visualization tools for different algorithms are not readily transferrable, leaving further algorithm development and downstream analyses to proceed in isolation from the rest of the field.
A significant development in shotgun data analysis was the community desire to present results first in open common formats, such as pepXML , and then followed by open standardized formats, such as mzIdentML . Through the utilization of common and standard formats, spectral search results from any algorithm could be easily contrasted, visualized, and plugged into any analytical pipeline that makes use of these common formats. Indeed, similar adoption of open formats (e.g., mzXML  and mzML ) allowed for analysis of data from multiple instruments and vendors because a common format was provided to represent the data. However, the existing formats for spectral search results do not currently allow for the representation of cross-linked PSMs, limiting the utility of results obtained from any particular cross-linking algorithm.
There are numerous examples of mass spectrometry software tools that become cornerstones for data analysis in many laboratories. Program suites such as the Trans-Proteomic Pipeline (TPP)  and Skyline , for example, have enjoyed widespread adoption in large part due to their free availability, open source architecture, and adherence to open data standards and formats. Supporting these open formats makes the software accessible to a wider audience, promotes sharing and collaboration across laboratories, and drives community development of technology innovations. Thus, a critical step in the advancement of cross-linking analysis algorithms and pipelines is the establishment of a common format for passing information between software and laboratories.
Here we present an extension to the widely used, open data format, pepXML, for the storage and exchange of cross-linked PSMs. The extension supplements the current format, without impacting existing pepXML-based results from standard search algorithms. To illustrate the utility of the pepXML extension, we also present a new spectrum viewing application for cross-linked PSMs which, using the pepXML format, displays cross-linked search results from several different algorithms. Further use of pepXML is explored by demonstrating use of existing tools supporting pepXML to analyze cross-linking results, opening the possibility to incorporate multiple cross-linking algorithms into a common pipeline.
Materials and Methods
pepXML Format Extension
pepXML is an open file format for the storage of PSMs and subsequent peptide-level analysis . The format makes use of the XML framework, which encapsulates elements of a shotgun MS/MS run with corresponding data analysis results. The schema for pepXML is provided at: http://tools.proteomecenter.org/wiki/index.php?title=Formats:pepXML. Within the format, spectrum_query elements contain MS/MS scan events, and encapsulate search_result elements that contain one or more search_hit elements relating to database search algorithm PSMs. The search_hit elements contain required information, such as peptide sequence, and further encapsulate other information, such as static and differential modification masses and search algorithm scores.
To represent cross-linked PSMs in pepXML, multiple updates to the schema were required, shown in Figure 1 and listed in detail in Table 1. Because the search_hit element allows for only a single peptide sequence, a new attribute, xlink_type, was added. Valid values for the attribute are “xl” for two cross-linked peptides, “loop” for loop-linked (self-linked) PSMs, or “na” for single peptide PSMs. For backwards compatibility, the absence of the xlink_type attribute assumes a value of “na”. For “loop” and “na” values, all attributes pertaining to the PSM in the search_hit are used as normal. However, for “xl” values, these attributes are ignored in favor of two linked_peptide elements contained inside the search_hit element, as described below. Two exceptions, the calc_neutral_pep_mass and massdiff attributes, describe the full cross-linked PSM mass and its difference in mass from the observed precursor ion, respectively (Supplementary Figure 1a).
When the xlink_type attribute is set to “loop” or “xl”, a novel xlink element must be provided inside the search_hit. The xlink must contain two attributes: an identifier for the cross-linker, and its mass contribution to the PSM. For self-linked peptides, xlink_score elements are used to describe the linked amino acid sites, and are contained within the xlink element (Supplementary Figure 1b). For cross-linked PSMs, two linked_peptide elements are used to describe both peptides. Many of the attributes of linked_peptide elements are the same as the search_hit element (e.g., peptide sequence, protein name, peptide neutral mass, etc.). Additional attributes include complement_mass, which is the difference in mass between the precursor ion and this peptide, and designation, which is used to label the peptide sequence as “alpha” or “beta”. The linked_peptide element encapsulates one or more xlink_score elements that are used to describe the linked amino acid site, plus any additional algorithm-specific scores that are only applicable to the individual peptides of the cross-linked PSM. Finally, any search algorithm scores applicable to the entire PSM, and not specifically to either cross-linked peptide, are specified in the typical search_score element contained within the search_hit.
Reference information for the cross-linker that is relevant to the search parameters or downstream analysis is also included in the pepXML extension. The new cross_linker element resides in the existing msms_run_summary element of pepXML. An identifier attribute names the cross-linker and should match the identifier attribute of xlink elements contained within search_hit elements. Additional attributes describe cross-linker mass, site reactivity, and isotopic labeling. The cross_linker element also contains a list of zero or more cross_linker_info elements for additional information specific to any particular cross-linker (e.g., spacer length) to be stored.
Analysis of MS/MS Spectra
LC-MS/MS data of human TFIIH samples cross-linked with BS3 have been previously described and were obtained from the authors . Briefly, samples were digested with trypsin, fractionated by strong cation exchange (SCX) and analyzed on a Thermo Fisher Scientific Orbitrap Velos with HCD fragmentation. The data were searched using multiple cross-linking database search algorithms: Kojak ver. 1.4.3 , pLink ver. 1.23 , and Protein Prospector ver. 5.16.0 . Search parameters common to all algorithms included the 10 TFIIH protein sequences, static and differential modification masses, search mass tolerances, and enzyme cleavage rules. Detailed listings of all parameters are provided in the Supplementary Information. Kojak results were exported to pepXML using the configuration parameter. Protein Prospector and pLink results were exported in tab-delimited text and converted to pepXML using conversion applications developed in house. Briefly, the conversion tools parse the tab-delimited text to identify precursor ion mass, precursor ion charge, peptide sequences, cross-linked sites, protein inferences, and PSM score metrics. These values are then exported for all PSMs to the proposed pepXML format extension. The conversion applications are freely available at http://www.kojak-ms.org/.
Visualization of cross-linked PSMs
A graphical spectrum viewing application was designed to highlight matched fragment ions from cross-linked PSMs against acquired MS/MS spectra. The application accepts multiple data file formats (including mzXML and mzML) and PSMs from any search algorithm results formatted in pepXML. The software is written in C++, and is open source and freely available at http://www.kojak-ms.org/. For convenience, pre-compiled binary formats are also provided for both Windows and Linux operating systems. Though initially designed to be packaged and used with the Kojak cross-linking algorithm , the viewer is available in stand-alone format for use with results from any search algorithm.
Results and Discussion
The cross-linked data acquired from shotgun analysis of the human TFIIH complex was analyzed using multiple database search algorithms as described in the methods. The set is comprised of nine MS/MS runs totaling 92,348 spectra. Results from the Kojak algorithm were exported natively to the pepXML format using the configuration parameters. Results from pLink and Protein Prospector were converted from their default tab-delimited text formats to pepXML using the conversion tools described in the methods.
Each search algorithm uses diverse score metrics to describe cross-linked PSMs. The pepXML format extension accounts for these differences, allowing the results from any of the algorithms to be represented in a common format. For example, pLink relies primarily on an expect score that is assigned to cross-linked PSMs, and the peptide-specific information is limited to differential modifications and site of linkage (Supplementary Figure 2). Protein Prospector and Kojak, on the other hand, have additional metrics applied to each cross-linked PSM and their component peptides (Supplementary Figures 3 and 4). Protein Prospector in particular provides both PSM-level algorithm scores and expect values, and additionally, peptide-level algorithm scores and expect values, whose metrics are useful in downstream validation processes. The pepXML schema extensions allow for all of this information to be stored in a common format, which can then be extracted as necessary by any downstream data analysis application.
Additionally, the algorithms pLink and Kojak provide PSMs for loop-linked and non-linked peptides, the latter often including peptides to which a cross-linker has bound but no linkage to another site was possible prior to quenching the reaction. Though not as useful as cross-linked PSMs, these identifications can be helpful when assessing solvent-accessible sites on protein surfaces, or assessing experimental conditions. The pepXML schema extensions allow for these loop-linked and non-linked PSMs to be represented in the same results file, so that the entire data analysis from any search engine can be stored in a single file, and the different types of PSMs can be parsed as necessary using the attributes defined in the schema extension.
In accordance with the pepXML format, results from multiple searches can be combined into a single file, for example, when using the InteractParser from the TPP [12, 16]. The TFIIH data were collected in multiple runs originating from SCX fractions prior to reversed phase LC-MS. pLink and Protein Prospector can combine their search results into a single output file, but Kojak cannot. After conversion of the results to pepXML, InteractParser was used to combine all the results into a single pepXML file for each search algorithm. Similarly, analyses that require multiple-pass searches to identify multiple cross-linker chemistries in the same data  can be combined into a single pepXML file because the format allows for multiple cross-linker and search parameter designations. Furthermore, pepXML files from multiple algorithm searches of the same data set can be combined into a single data set file, opening the possibility for multi-algorithm PSM validation analysis within the pepXML schema extension for cross-linked data sets. This technique has been used to improve peptide identification and validation in typical shotgun MS analyses .
Visualizing PSMs in pepXML Files
To demonstrate the utility of a common format for the representation of cross-linked PSMs, and to facilitate visual inspection and evaluation of cross-linked PSMs, a spectrum visualization tool was developed. The visualization tool accepts as input a pepXML file. It then parses pepXML content to locate the corresponding data files and displays the spectra in the view window. Overlaid on each spectrum are the fragment ions from the associated PSM search results (Figure 2). PSM information, such as peptide sequence, differential modifications, and fragment ions are provided in a customization pane on the right, and the user can toggle which information to highlight on top of the spectrum. Cross-linked PSMs containing two peptide sequences can be toggled to highlight the fragment ions from either or both peptides. Loop-linked PSMs identify both sites of cross-linker attachment and apply the appropriate mass corrections to the corresponding fragment ions. Non-linked PSMs, if provided in the search results, are also displayed, thus not limiting the spectrum visualization tool solely to cross-linked search results.
Inclusive in the visualization tool is a table summary of all PSMs contained in the pepXML. Each column in the table lists the scoring metrics of the respective algorithm that was used to generate the PSMs. Each row is a PSM. Selecting any row of the table instantly displays the spectrum and fragment ion masses of the corresponding peptide(s) in the PSM. The table can be customized to sort and filter the PSMs on the relevant score metrics for any algorithm. For example, top-ranked PSMs can be obtained by sorting expect values of pLink and Protein Prospector results, or the score value of Kojak results. Filters can be applied to limit the results to proteins of interest, or any particular metrics provided from the search results.
Compatibility with Existing Pipelines and Alternative Formats
Many tools currently exist that make use of the pepXML format for input and output, including PSM visualization, PSM validation, quantitation, and results organization [12, 17, 22–26]. By extending the pepXML format to include cross-linking search results, these tools become potentially extensible to analysis of cross-linked PSMs. Furthermore, it is far easier to update existing tools to include the pepXML extension, rather than recode them to support the many individual formats generated by each algorithm, or write a novel algorithm that replicates the same functionality for each of the cross-linking pipelines in existence. For example, the TPP PepXML Viewer was upgraded to support this new cross-linking extension. The tool can now be used to browse, filter, categorize, and export custom spreadsheets for the cross-linking results from all cross-linking algorithms that make use of pepXML (Figure 3). In particular, the PepXML Viewer was used to create custom reports for the pepXML formatted results from the three algorithms used here. This is just one example of how a common pepXML format enables many different analysis tools to be plugged into cross-linking data analysis pipelines.
The proteomics field has made great strides in establishing open community standards for data representation. The HUPO Proteomics Standards Initiative [27, 28] has defined several open formats that allow data and results from many different instruments and pipelines to be represented in a common language. The current standard for database search results, mzIdentML, does not support cross-linked PSMs at the time of this writing. The pepXML schema extension presented here provides a means to support the increasingly common cross-linked peptide analyses, until such time that a community standard is established. The transition to new standards is also not immediate, as is seen with the persistence of other data formats such as MGF and mzXML, and the tools that require use of them. Thus, support for cross-linked PSMs in pepXML will help ease the burden of the eventual transition to mzIdentML by extending the life of existing tools and pipelines that support pepXML until upgrades and alternatives are developed.
pepXML is an open format in which results from many different search algorithms are represented uniformly, enabling easy integration of diverse results into robust analytical pipelines. The schema for pepXML was extended to represent results from shotgun cross-linking spectral searching algorithms. With these schema extensions, we demonstrate how results from different cross-linking search algorithms can be easily transformed for use by downstream software applications, such as the cross-linking spectral viewer. By extending an open format, it is possible to apply many existing tools and pipelines to cross-linking data analysis, as demonstrated with the minor upgrades to the TPP. The format extension presented here has the potential to increase integration of cross-linking analysis in mass spectrometry analytical workflows.
Bruce, J.E.: In vivo protein complex topologies: sights through a cross-linking lens. Proteomics 12, 1565–1575 (2012)
Sinz, A.: The advancement of chemical cross-linking and mass spectrometry for structural proteomics: from single proteins to protein interaction networks. Expert Rev. Proteom. 11, 733–743 (2014)
Walzthoeni, T., Leitner, A., Stengel, F., Aebersold, R.: Mass spectrometry supported determination of protein complex structure. Curr. Opin. Struct. Biol. 23, 252–260 (2013)
Mayne, S.L., Patterton, H.G.: Bioinformatics tools for the structural elucidation of multi-subunit protein complexes by mass spectrometric analysis of protein–protein cross-links. Brief. Bioinform. 12, 660–671 (2011)
Sinz, A., Arlt, C., Chorev, D., Sharon, M.: Chemical cross-linking and native mass spectrometry: A fruitful combination for structural biology. Protein Sci. 24, 1193–1209 (2015)
Rinner, O., Seebacher, J., Walzthoeni, T., Mueller, L.N., Beck, M., Schmidt, A., Mueller, M., Aebersold, R.: Identification of cross-linked peptides from large sequence databases. Nat. Methods 5, 315–318 (2008)
Yang, B., Wu, Y.J., Zhu, M., Fan, S.B., Lin, J., Zhang, K., Li, S., Chi, H., Li, Y.X., Chen, H.F., Luo, S.K., Ding, Y.H., Wang, L.H., Hao, Z., Xiu, L.Y., Chen, S., Ye, K., He, S.M., Dong, M.Q.: Identification of cross-linked peptides from complex samples. Nat. Methods 9, 904–906 (2012)
McIlwain, S., Draghicescu, P., Singh, P., Goodlett, D.R., Noble, W.S.: Detecting cross-linked peptides by searching against a database of cross-linked peptide pairs. J. Proteome Res. 9, 2488–2495 (2010)
Trnka, M.J., Baker, P.R., Robinson, P.J., Burlingame, A.L., Chalkley, R.J.: Matching cross-linked peptide spectra: only as good as the worse identification. Mol. Cell. Proteom. 13, 420–434 (2014)
Hoopmann, M.R., Zelter, A., Johnson, R.S., Riffle, M., MacCoss, M.J., Davis, T.N., Moritz, R.L.: Kojak: efficient analysis of chemically cross-linked protein complexes. J. Proteome Res. 14, 2190–2198 (2015)
Gotze, M., Pettelkau, J., Schaks, S., Bosse, K., Ihling, C.H., Krauth, F., Fritzsche, R., Kuhn, U., Sinz, A.: StavroX—a software for analyzing crosslinked products in protein interaction studies. J. Am. Soc. Mass Spectrom. 23, 76–87 (2012)
Keller, A., Eng, J., Zhang, N., Li, X.J., Aebersold, R.: A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 1, 2005.0017 (2005)
Jones, A.R., Eisenacher, M., Mayer, G., Kohlbacher, O., Siepen, J., Hubbard, S.J., Selley, J.N., Searle, B.C., Shofstahl, J., Seymour, S.L., Julian, R., Binz, P.A., Deutsch, E.W., Hermjakob, H., Reisinger, F., Griss, J., Vizcaino, J.A., Chambers, M., Pizarro, A., Creasy, D.: The mzIdentML data standard for mass spectrometry-based proteomics results. Mol. Cell. Proteom. 11, M111 014381 (2012)
Pedrioli, P.G., Eng, J.K., Hubley, R., Vogelzang, M., Deutsch, E.W., Raught, B., Pratt, B., Nilsson, E., Angeletti, R.H., Apweiler, R., Cheung, K., Costello, C.E., Hermjakob, H., Huang, S., Julian, R.K., Kapp, E., McComb, M.E., Oliver, S.G., Omenn, G., Paton, N.W., Simpson, R., Smith, R., Taylor, C.F., Zhu, W., Aebersold, R.: A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 22, 1459–1466 (2004)
Martens, L., Chambers, M., Sturm, M., Kessner, D., Levander, F., Shofstahl, J., Tang, W.H., Rompp, A., Neumann, S., Pizarro, A.D., Montecchi-Palazzi, L., Tasman, N., Coleman, M., Reisinger, F., Souda, P., Hermjakob, H., Binz, P.A., Deutsch, E.W.: mzML—a community standard for mass spectrometry data. Mol. Cell. Proteom. 10, R110 000133 (2011)
Deutsch, E.W., Mendoza, L., Shteynberg, D., Slagel, J., Sun, Z., Moritz, R.L.: Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteom. Clin. Appl. 9, 745–754 (2015)
MacLean, B., Tomazela, D.M., Shulman, N., Chambers, M., Finney, G.L., Frewen, B., Kern, R., Tabb, D.L., Liebler, D.C., MacCoss, M.J.: Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966–968 (2010)
Luo, J., Cimermancic, P., Viswanath, S., Ebmeier, C.C., Kim, B., Dehecq, M., Raman, V., Greenberg, C.H., Pellarin, R., Sali, A., Taatjes, D.J., Hahn, S., Ranish, J.: Architecture of the human and yeast general transcription and DNA repair factor TFIIH. Mol. Cells 59, 794–806 (2015)
Chalkley, R.J., Baker, P.R., Medzihradszky, K.F., Lynn, A.J., Burlingame, A.L.: In-depth analysis of tandem mass spectrometry data from disparate instrument types. Mol. Cell. Proteom. 7, 2386–2398 (2008)
Leitner, A., Joachimiak, L.A., Unverdorben, P., Walzthoeni, T., Frydman, J., Forster, F., Aebersold, R.: Chemical cross-linking/mass spectrometry targeting acidic residues in proteins and protein complexes. Proc. Natl. Acad. Sci. U. S. A. 111, 9455–9460 (2014)
Shteynberg, D., Deutsch, E.W., Lam, H., Eng, J.K., Sun, Z., Tasman, N., Mendoza, L., Moritz, R.L., Aebersold, R., Nesvizhskii, A.I.: iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteom. 10, M111 007690 (2011)
Kall, L., Canterbury, J.D., Weston, J., Noble, W.S., MacCoss, M.J.: Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007)
Ma, Z.Q., Dasari, S., Chambers, M.C., Litton, M.D., Sobecki, S.M., Zimmerman, L.J., Halvey, P.J., Schilling, B., Drake, P.M., Gibson, B.W., Tabb, D.L.: IDPicker 2.0: improved protein assembly with high discrimination peptide identification filtering. J. Proteome Res. 8, 3872–3881 (2009)
Mohammed, Y., Mostovenko, E., Henneman, A.A., Marissen, R.J., Deelder, A.M., Palmblad, M.: Cloud parallel processing of tandem mass spectrometry based proteomics data. J. Proteome Res. 11, 5101–5108 (2012)
Park, C.Y., Klammer, A.A., Kall, L., MacCoss, M.J., Noble, W.S.: Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–3027 (2008)
Park, S.K., Aslanian, A., McClatchy, D.B., Han, X., Shah, H., Singh, M., Rauniyar, N., Moresco, J.J., Pinto, A.F., Diedrich, J.K., Delahunty, C., Yates III, J.R.: Census 2: isobaric labeling data analysis. Bioinformatics 30, 2208–2209 (2014)
Orchard, S., Hermjakob, H., Apweiler, R.: The proteomics standards initiative. Proteomics 3, 1374–1376 (2003)
Deutsch, E.W., Albar, J.P., Binz, P.A., Eisenacher, M., Jones, A.R., Mayer, G., Omenn, G.S., Orchard, S., Vizcaino, J.A., Hermjakob, H.: Development of data representation standards by the human proteome organization proteomics standards initiative. J. Am. Med. Inform. Assoc. 22, 495–506 (2015)
The authors thank Dr. Jie Luo and Dr. Jeff Ranish for access to the TFIIH data analyzed in this publication. This work was funded in part by National Institutes of Health from the National Institute of General Medical Sciences under grant nos. R01GM087221, S10RR027584, and the 2P50 GM076547/Center for Systems Biology.
Electronic supplementary material
Below is the link to the electronic supplementary material.
(DOCX 452 kb)
About this article
Cite this article
Hoopmann, M.R., Mendoza, L., Deutsch, E.W. et al. An Open Data Format for Visualization and Analysis of Cross-Linked Mass Spectrometry Results. J. Am. Soc. Mass Spectrom. 27, 1728–1734 (2016). https://doi.org/10.1007/s13361-016-1435-8
- Mass spectrometry
- Data format
- Data analysis
- Open source