Introduction

Nonribosomal peptides (NRPs) are most commonly produced by bacteria and fungi using nonribosomal peptide synthetases [1]. The NRPs building blocks include hundreds of monomers, including proteinogenic and non-proteinogenic amino acids, hydroxy acids, residues having N-terminally attached fatty acid chains, N- and C-methylated residues, N-formylated residues, and many others. The popularity of NRPs stems from their diverse biological activities [2]. The structures of NRPs have been reviewed elsewhere and include linear, cyclic, branched, and branch-cyclic NRPs [3]. The structure characterization of ribosomal peptides by mass spectrometry approaches involves targeted analysis with peptide/protein or genome database searches [4, 5], applying de novo peptide sequencing algorithms [6], and/or sequence-tag methods [7]. Extending the building block databases, some of these methods can also be applied to NRPs.

Commercial software tools for fast dereplication of natural products were reviewed recently [8]. AntiMarin is a commercial database containing about 50,000 compounds from marine and terrestrial microorganisms and represents a merger of AntiBase and MarinLit databases. In addition, there are many other useful repositories (Dictionary of natural products, SciFinder, Beilstein commander, KEGG, Metlin, Human metabolome database, and a Norine database of NRPs [9]). The latter one is used by NRP-Dereplication [10] as well as iSNAP [11]. The iSNAP is not just a dereplication tool but it also provides correct identification (if the peptide is present in the database) or excellent similarity search (if an analogous peptide has previously been reported). It searches against a fragment database generated in silico from 1107 compounds that were extracted from Norine, Pubchem, Journal of Antibiotics, Journal of Natural Products, and the KEGG peptide databases. Its performance is hence limited to dereplication of already known fragments. NRPquest [12] is a tool for identification of cyclic and branch-cyclic NRPs that performs modification and mutation-tolerant searches of experimental spectra against a database of theoretical spectra of putative NRPs.

Many tools for de novo sequencing of linear but ribosomal peptides have been proposed [13]. For example, PEAKS is one of the most widely used commercial tools [14], and Lutefisk [15] and PepNovo [16] are popular open-source tools. While Lutefisk is lacking a graphical user interface (GUI), DeNovoGUI [17] has been provided for PepNovo.

A few methods have been proposed for de novo identification of cyclic NRPs from low-resolution mass spectrometry data. NRP-Tagging [10] is a tool that works with MS3 data. A method based on multistage mass spectrometry has also been proposed [18]. However, the length of a peptide sequence must be predicted before the search. Both de novo tools report the lists of masses of building blocks on the output instead of NRP sequences. A method based on multiplex de novo sequencing has also been proposed providing much better results [19]. In addition to library matching, there are multiple annotation tools [2022].

In our previous work, we introduced a hypertext preprocessor (PHP) script Cyclone for identification and de novo sequencing of cyclic NRPs [23]. It involved a simple text interface and required manual conversion of NRP sequence candidates from one PHP script to another. The work with peak lists was not user friendly and the search algorithm had multiple limitations. In this paper, we present CycloBranch—a stand-alone, cross-platform, and open-source de novo NRP sequence identification engine. This software has an intuitive GUI and its applications were extended from cyclic peptides to linear, branched, and branch-cyclic peptides. To the best of our knowledge, CycloBranch represents the first true de novo searching engine working both for ribosomal and nonribosomal peptides of various molecular structures and is independent of any intact peptide or fragment ion database. The application can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/ and includes a nonribosomal annotated building block database comprising 287 residues with unique elemental compositions. Including the isobaric isomers, the overall number of building blocks is 521. In addition to de novo sequencing, the database search of MS2 spectra against an in-house collection of NRPs and Norine is supported. The identification of compounds in MS spectra can also be performed via database search.

Materials and Methods

Mass Spectra, Software, and Building Block Databases

Product ion mass spectra were collected on 12 T SolariX FTICR mass spectrometer (Bruker Daltonics, Billerica, MA, USA) equipped with a dual ESI/MALDI source. Standard peptides were infused in picomolar aqueous solutions by a linear syringe pump. Gramicidin C, Substance P, surfactin C, and valinomycin were from Sigma Aldrich, Czech Republic. The remaining standards were from an in-house peptide collection. The testing set of peptides represented 11 compounds: two linear peptides (gramicidin C and Substance P), five cyclic NRPs (beauverolide I, roseotoxin A, cyclosporin A, surfactin C, and valinomycin), two branched peptides [linearized pseudacyclin A and the synthetic peptide N-acetyl-ESL(KNFI)DQYGNH2, referenced as T-peptide below], and two branch-cyclic NRPs (pseudacyclin A and pyoverdin Pa A).

Standard mass/charge error in all product ion mass spectra was better than 2 ppm. CycloBranch was implemented in C++ and utilizes integral databases with up to 287 nonredundant building blocks. All monomers in the database were annotated by reference numbers publicly available through the chemical databases ChemSpider, PubChem, Protein Data Bank (PDB), and Norine. Three databases of building blocks were used for software evaluation—a database D 19 of proteinogenic amino acids (19 building blocks where leucine and isoleucine were not distinguished), a database D 33 (all building blocks involved in 11 testing peptides), or D 287 (the complete nonredundant nonribosomal database) [23]. Data processing was performed on a desktop personal computer with the processor Intel Core i7-3770 (3.4GHz), 8 GB RAM, and OS Windows 7 (64-bit). CycloBranch supports several datafile formats—mzML and mzXML (requires OpenMS 1.11 installed) [24], mgf (Mascot generic format), txt [containing a tab-separated mass-to-charge (m/z) ratio and intensity on each line], and baf (a native file format of Bruker Daltonics requiring prior installation of CompassXPort 3.0). A list of detected NRP sequence candidates can be exported as a comma-separated values (csv) file or as a web page. A configuration of the engine can be stored into a file (*.ini). Sample files are distributed with the engine. CycloBranch can run in parallel on multiple threads.

Software Algorithm

The algorithm for the de novo identification of NRPs covers the four main steps: (1) The construction of a de novo graph from an experimental spectrum using the database of building blocks; (2) the detection of a set of peptide sequence candidates by traversing the graph; (3) the comparison of theoretical spectra of candidates with the experimental spectrum using a selected scoring function; and (4) the reporting of NRP sequence candidates.

The graph construction is similar to a common de novo approach, but the database with hundreds of monomers is used instead of a set of genuine proteinogenic amino acids. The experimental m/z values are represented by nodes in the graph with two additional nodes applied. The first one corresponds to proton m/z value as a starting point of a b-series or the m/z value of H3O+ in case of a y-series. The last node corresponds to m/z value of a precursor ion (Figure 1).

Figure 1
figure 1

A de novo graph created from the experimental spectrum of beauverolide I (C8:0-Me(4)-OH(3) stands for 3-hydroxy-4-methyloctanoic acid). N-terminal b i ion series is shown with b1 ion missing in the spectrum

Proton IUPAC mass is also used as a starting point for cyclic NRPs; both starting points are applicable for linear, branched, and branch-cyclic NRPs. An edge is inserted if a difference between any two m/z values fits to a mass of any existing building block or their combination. The sequence of edges corresponds to linear or cyclic peptide candidates. Edges that occur because of branching can cause ambiguities in the case of branched and branch-cyclic NRPs. Thus, multiple NRP sequence candidates are generated from the sequence of edges (for details see CycloBranch manual). In the next step, theoretical mass spectra generated for NRP sequence candidates are compared with the experimental spectrum using scoring functions S 1 and S 2 , which reflect the number of matched peaks and the sum of relative intensities, respectively. Finally, candidates with the best scores are reported.

Peptide Fragmentation

The generation of theoretical mass spectra depends on the type of NRP. With linear NRPs the standard N-terminal b i and a i as well as C-terminal y i -ions are generated identically to linear ribosomal peptides [25]. In cyclic peptides up to k possible ring primary ring opening sites exist for a cyclic NRP with k building blocks. The initial ring cleavage may create up to k different linear ions i-j b k (i-j stands for positions of building blocks between which the ring is primarily opened) [23, 26]. For example, the ions 4-3 b 4 , 3-2 b 4 , 2-1 b 4 , and 1-4 b 4 exist when k = 4. Any ion i-j b k may undergo a further fragmentation, so the spectrum of a cyclic peptide usually represents the superposition of up to k spectra corresponding to different linear sequences [10]. As the peptide is cyclic and lacking its C-terminus, b-ions (or other N-terminal ions) are exclusively observed [20]. Sometimes, a b-ion may cyclize by head-to-tail mechanism; its reopening between other two monomers and further fragmentation provides b-ions corresponding to a scrambled sequence of building blocks [27]. Thus, k series of b-ions must be generated in a theoretical spectrum of a cyclic NRP and ions with scrambled sequences should be considered.

By definition, a branched peptide is represented by a core and a single lateral branch. The core can be longer and has always one N-terminus and one C-terminus. The branch can potentially be shorter and is either N- or C-terminated. Thus, two series of b-ions (or other N-terminal ions) and two series of y-ions (or other C-terminal ions) may be present (Figure 2a, b). The four series of fragment ions can be generated in a theoretical spectrum of a branched NRP when a terminal modification of the branch is detected. Since modifications are commonly defined for certain termini, one can determine if the branch is N- or C-terminated. All six nonredundant series (Figure 2a, b) must be generated when the branch is not modified.

Figure 2
figure 2

(a), (b) Ion series originating from the protonated molecule of a branched peptide; theoretical fragment ion series generated because of a ring opening of a branch-cyclic NRP – (c) a branched fragment, (d) a linear fragment

The branch-cyclic NRPs contain a ring with a lateral branch (Figure 2c, d). Their theoretical spectra are generated similarly to cyclic peptides but their ring cleavages provide branched fragment ions instead of linear ones. A theoretical spectrum of a branch-cyclic peptide is a superposition of up to k'-2 theoretical spectra (where k' stands for the number of building blocks forming the ring) corresponding to k'-2 different branched sequences and two spectra corresponding to linear sequences (peptide cleavage from both sides of the branch may occur). Each branched fragment gives six theoretical series of b-ions (N-terminated branch) or four series of b-ions and two series of y-ions (C-terminated branch). In analogy, each linear fragment theoretically provides two series of b-ions or one series of b-ions and one series of y-ions, respectively.

Results and Discussion

CycloBranch (v. 1.0.1216) has been tested on a set of 11 linear, cyclic, branched, and branch-cyclic peptides. With the linear pentadecapeptide gramicidin C, the search for a sequence tag took less than 1 s and the correct LAVVVWL tag was displayed at the positions 1–2 for D 33 database. The same result was returned using the D 287 unrestricted database with the same S 1 scoring criterion (Table 1). For the linear undecapeptide Substance P, the correct tag QFFGLMNH2 was reported as the top hit in less than 1 s using D 33 . In addition, it was reported at the third position in 9 s using D 287 and S 2 . Note that Substance P is ribosomally encoded; thus, the spectrum can also be searched against a restricted database of proteinogenic amino acids D 19 .

Table 1 NRPs Identified or Sequence Tag Determination by CycloBranch

Beauverolide I, a tetradepsipeptide cyclo([C8:0-Me(4)-OH(3)]-[Phe]-[Ala]-[Leu]), was identified by CycloBranch in less than 1 s using both D 33 and D 287 . The sequence has been reported among peptide sequence candidates at the positions 1–2 of a result list of NRP sequence candidates. Beauverolide can also be easily identified using D 33 by Cyclone if the number of (four) monomers in the peptide was predicted [23].

Roseotoxin A is a hexadepsipeptide cyclo([C5:0-Me(4)-OH(2)]-[Me-Pro]-[Ile]-[NMe-Val]-[NMe-Ala]-[bAla]) where C5:0-Me(4)-OH(2) stands for 2-hydroxy-4-methylpentanoic acid, Me-Pro for 3-methylproline, NMe-Val for N-methyl-valine, NMe-Ala for N-methyl-alanine, and bAla for beta alanine. The correct assignment was returned as the top hit in less than 1 s by CycloBranch using D 33 . It appeared as the second hit using D 287 and S 2 in 2 s. Since a small monomer ethanolamine (43 Da) is also present in D 287 , CycloBranch reported another candidate containing this block as the top hit. If a user predicted the number of monomers, Cyclone returned 216 NRP sequence candidates using D 33 . With that software interface, the user had to retype manually all 216 sequences to another PHP script in order to obtain results consistent with what we can obtain using CycloBranch.

Cyclosporin A is a cyclic undecapeptide cyclo([MeBmt]-[Abu]-[Sar]-[Me-Leu]-[Val]-[Me-Leu]-[Ala]-[Ala]-[Me-Leu]-[Me-Leu]-[NMe-Val]) where MeBmt stands for N-methyl-butenylthreonine, Abu for 2-aminobutanoic acid, Sar for sarcosine, and Me-Leu for N-methylleucine. Sequence tag [Val]-[Me-Leu]-[Ala]-[Ala]-[Me-Leu]-[Me-Leu]-[NMe-Val] was identified as the second hit in 6 s using D 33 . It was identified at the fifth position in 25 s using D 287 . S 2 was applied in both cases.

We also tested a two-phase identification of cyclosporin A on D 33 . In the first phase, the minimum threshold of relative intensity was set up to 5% and the identification of sequence tags was enabled (number of b-ions and dehydrated b-ions was used as a scoring function). The tag [Me-Leu]-[Ala]-[Ala]-[Me-Leu]-[Me-Leu]-[NMe-Val] was reported as a top hit. In the second phase, the relative intensity threshold was lowered, the identification of tags was disabled, and only NRP sequence candidates having the previously detected tag were processed. The correct sequence of cyclosporin was reported as a top hit (see Tutorials 2 and 3 in Supplemental material).

Surfactin C is an octadepsipeptide cyclo([C14:0-Me(13)-OH(3)]-[Glu]-[Leu]-[Leu]-[Val]-[Asp]-[Leu]-[Leu]) where C14:0-Me(13)-OH(3) stands for 3-hydroxy-13-methyltetradecanoic acid. It was reported at the positions 2–3 in less than 1 s for D 33 (Figures 3 and 4) or at the same positions in 1 s for D 287 and S 1 .

Figure 3
figure 3

Initial graphic CycloBranch output indicating a list of NRP sequence candidates. The experimental spectrum of surfactin C was searched against a D 33 database

Figure 4
figure 4

Examining the hits retrieved by CycloBranch. The theoretical surfactin C spectrum was compared with its experimental one

Valinomycin is a dodecadepsipeptide cyclo([Val]-[Lac]-[Val]-[Hiv]-[Val]-[Lac]-[Val]-[Hiv]-[Val]-[Lac]-[Val]-[Hiv]) where Lac stands for lactic acid and Hiv for 2-hydroxyisovaleric acid. A tag [Val]-[Hiv]-[Val]-[Lac]-[Val]-[Hiv] was reported among sequence tag candidates at the positions 1–2 in less than 1 s using D 33 and S 2 . Cyclosporin, surfactin, and valinomycin could not be identified on D 33 or on D 287 by Cyclone as they contain more than six monomers.

Linearized pseudacyclin A is a branched peptide with two N-termini and one C-terminus (Figure S-1a in the Supplemental Section). When the acquired spectrum was analyzed by CycloBranch, the sequence of this compound was reported among candidates at the positions 1–4 or 1–24 in result sets in less than 1 s using D 33 or in 40 s using D 287 , respectively (S 1 only).

T-peptide is also a branched peptide with two N-termini and one C-terminus (Figure S-1b). Since the peptide is synthetic and composed exclusively from proteinogenic amino acids, the spectrum was searched against D 19 . The correct peptide sequence was reported among sequence candidates at the positions 1–144 in 2 min and 4 s. A high number of false positive candidates was caused by all series being incomplete, especially by the missing ions b 1 , b 2 , and b 9 in both N-terminal series, and by the missing ions y 1 and y 9 in both C-terminal series (Figure 5).

Figure 5
figure 5

The theoretical T-peptide spectrum was compared with its experimental one

We also performed a two-phase identification of T-peptide on D 19 . In the first phase, the peptide was considered as a linear one and the detection of sequence tags was enabled. Two tags [Asp]-[Gln] and [Ser] were reported. In the second phase, the peptide was considered as a branched one, the detection of sequence tags was disabled and only NRP sequence candidates having both tags were processed. CycloBranch reported 24 NRP sequence candidates having the equal backbone N-acetyl-EXX(KXXX)DQYGNH2 in 31 s (see Tutorial 4 in Supplemental material).

Pseudacyclin A, a hexapeptide cyclo([Phe]-[Pro]-[Ile]-[Ile]([Orn]-[N-Ac-Ile])), contains five building blocks in a ring and one building block [N-Ac-Ile], which forms a side chain (Orn stands for ornithine, N-Ac-Ile for N-acetylisoleucine). It has not been identified by CycloBranch as a branch-cyclic peptide due to predominant elimination of [N-Ac-Ile] during the mass spectrometric analysis. Since the branch is short, pseudacyclin data were treated as for a cyclic peptide. CycloBranch reported its sequence as the top hit using both databases D 33 and D 287 in less than 1 s (S 1 ). Similarly to roseotoxin, pseudacyclin can be identified on D 33 by Cyclone but the number of monomers has to be predicted. Cyclone returns 1240 NRP sequence candidates. The user has to retype manually all 1240 candidates to another script to obtain the same results that are obtained with CycloBranch in a single and automated approach.

Pyoverdin Pa A is a decapeptide [Suc]-[ChrPaA]-[Ser]-[Arg]-[Ser]-[Fo-OH-Orn]-cyclo([Lys]-[Fo-OH-Orn]-[Thr]-[Thr]) where Suc, ChrPaA, and Fo-OH-Orn stand for succinic acid, pyoverdin Pa A chromophore, and N6-formyl-hydroxyornithine, respectively (Figure S-2). The compound was not identified by CycloBranch as a branch-cyclic peptide, as no b-ions arising from the ring opening were observed. Since the branch was long, it was treated as a linear peptide having the C-terminus cyclized (i.e., 18 Da have to be subtracted from C-terminal fragment ions masses). The correct NRP sequence was reported among candidates at the positions 1–12 in 150 s using D 33 and S 2 . A tag [Arg]-[Ser]-[Fo-OH-Orn] was reported as a top hit using D 33 or at the sixth position using D 287 and S 2 in 1 s.

Conclusion

The stand-alone and open-source de novo peptide identification engine CycloBranch was effectively utilized for identification or sequence tag determination of NRPs from accurate product ion mass spectra. It represents the first and true de novo engine working for nonribosomal peptides. It supports sequencing of linear, branched, and branch-cyclic NRPs as well as cyclic peptides. NRP sequence tags were provided in the output even when a building block was not present in a database or the spectrum contained incomplete fragment ion series. The parts of an unknown structure, if not covered by database building blocks or their combinations, are returned as exact monoisotopic masses in suggested sequence tags. CycloBranch has a graphical and user-friendly interface and it can run in parallel on multiple threads.

The remaining challenge is the automated detection of the peptide type that is to be identified. Although distinguishing of a linear and a cyclic peptide spectrum has been proposed for ribosomal peptides [5], distinguishing various types of NRPs is still a nontrivial task. It may be advantageous to run the engine repeatedly for different types of NRPs. For example, if y-ions are not observed, multiple overlapping b-ion series and scrambled fragment ions occur, a peptide likely contains a cycle. Otherwise, it may correspond to a linear or a branched NRP. The tool is designed to be extended for the identification of other types of NRPs (e.g., bicyclic or multiply branched) and structures containing other building blocks (e.g., saccharides, nucleotides). It is worth noting that de novo engine for top-down sequencing of proteins is also desperately needed by the proteomic community.