Introduction

Nonribosomal peptides (NRPs) are natural products commonly produced by bacteria and fungi [1]. NRPs have a wide range of use in human and veterinary medicine, agriculture, the food industry, environmental protection, and other fields. They serve as antibiotics, immunosuppressants, cytostatics, toxins, surfactants, siderophores, etc. The siderophores [2, 3] are iron carriers and promising markers of infectious diseases caused, e.g., by Aspergillus fumigatus or Pseudomonas aeruginosa [4, 5]. While ribosomal peptides are described by linear sequences of amino acid residues (i.e., monomers or building blocks), NRPs frequently form cyclic and branch-cyclic structures. In contrast to ribosomal peptides, the sequences of NRPs consist of various building blocks, including proteinogenic and non-proteinogenic amino acids, N- and C-methylated residues, N-formylated residues, hydroxy acids, residues with N-terminally attached fatty acid chains, chromophores, and many others [6]. Many siderophores are not peptides but polyketides (PKs) [3, 7].

Natural products are commonly analyzed using mass spectrometry [8]. Molecules in a sample are ionized and separated by mass-to-charge (m/z) ratios. Mass spectra contain m/z ratios and intensities of collected ions or their fragments in the case of tandem mass spectrometry. The fragmentation principles of peptides have been described many times in the literature [9]. If collision-induced dissociation is used as a fragmentation technique, the spectra of linear peptides commonly contain series of y, b, and a-ions. Because a ring of a cyclic peptide can theoretically be opened between any two consecutive building blocks prior to fragmentation, we observe multiple overlapping series of b-ions in the spectra of cyclic peptides [6].

In this work, we focus on high-resolution tandem mass spectra of NRPs and siderophores with NRP/PK structures [6, 7]. An automated interpretation of NRP mass spectra still remains a challenging task due to the complex structures of the molecules. Moreover, PK siderophores contain uncommon building blocks that turn over the orientation of peptide bonds in molecules (e.g., desferrioxamines or ornibactins) and further complicate mass spectra interpretation [7]. For example, desferrioxamine B is composed of two types of building blocks Hpd (N-hydroxy-1,5-pentanediamine) and Suc (succinic semialdehyde) which alternate regularly (Fig. 1). In consequence, the molecule has two N-termini, and uncommon series of ions arise during mass spectra analysis. The observed peaks correspond to b-ions and ions whose masses are increased by the mass of two hydrogens [7].

Fig. 1
figure 1

The structure, building blocks, and fragmentation of linear polyketide siderophore desferrioxamine B. The nominal masses of fragment ions are reported. Because the molecule has two N-termini, we use the letters L (left) and R (right) to distinguish between the two series of b-ions

The spectra of natural products are commonly annotated using database search. Recent advances in microbial chemistry give rise to various databases of natural products. For example, the Natural Products Atlas [10] collects microbial natural products published in peer-reviewed scientific literature. At the time of writing this paper, it contained over 29 thousand compounds. COCONUT [11] is another database collecting data from more than 50 resources and has almost 407 thousand unique natural products. Also, the sizes of other chemical databases grow rapidly. In May 2021, the PubChem database contained about 110 million [12], ChemSpider over 100 million [13], and ChEBI over 59 thousand compounds [14].

The process of determining already known compounds is referred to as dereplication. Several web-based tools have been described for the dereplication of NRPs from tandem mass spectra. An early tool was iSNAP, which included a built-in database containing about 1100 NRP structures compiled from Antibase and the Dictionary of Natural Products [15, 16]. Users could upload a text file containing custom structures in SMILES format [17, 18]. They could define sites where a theoretical structure was fragmented, e.g., peptide and ester bonds. Input experimental spectra were then compared with theoretical spectra. The Dereplicator [19] is a recent tool accessible on GNPS platform [20] that allows the annotation of known peptidic natural products using an in silico fragmentation tree. Two other alternatives of Dereplicator have also been presented. The Dereplicator+ [21] which enables the annotation of non-peptidic natural products, and Dereplicator VarQuest [22], which allows modification-tolerant database search of mass spectra.

NRPro [23] is another recently released tool that compares the input experimental spectra with theoretical spectra generated from the sequences of building blocks, which are available in Norine database [24]. The database of NRPs and their building blocks was released in 2008. Nowadays, it contains 544 blocks and 1740 structures of NRPs composed of these monomers [25]. Users can contribute to the database with missing NRPs and monomers using the interface called MyNorine [26]. The monomeric structures of NRPs can be created using Smiles2Monomers [27] or rBAN [28] tools. However, users cannot create custom databases of building blocks and NRP sequences. Moreover, it’s not always easy to export flawless sequences of building blocks for tandem mass spectra analysis.

An alternative approach for NRP mass spectra interpretation is de novo sequencing [6]. In 2015, we released CycloBranch – an open-source, cross-platform, and stand-alone tool implemented in C++ and dedicated to de novo analysis and dereplication of linear, cyclic, branched, and branch-cyclic NRPs from tandem mass spectra [6, 29, 30]. The de novo sequencing was based on an input database of NRP building blocks. A graph was created from an experimental spectrum where vertices corresponded to m/z values. An edge was inserted into the graph if a difference between any two m/z values fitted a mass of a block or combination of more blocks (pairs, triplets, etc.). The graph was then browsed starting from an artificial vertex corresponding to the m/z value of H+ or H3O+ ion if b-ion or y-ion series were searched, respectively. The final vertex corresponded to the m/z value of the precursor ion. NRP sequence candidates were generated from the paths assigned to b-ion or y-ion series. Their theoretical spectra were then compared with the experimental spectrum and sorted using a scoring function. The dereplication was also supported but required an input database of sequences composed of NRP building blocks. In 2017, CycloBranch was extended to support the de novo sequencing and dereplication of NRP/PK siderophores from tandem mass spectra [7]. We also added the ability for dereplication and de novo analysis of natural products in conventional, liquid-chromatography, and imaging mass spectra using fine structures of isotopic peaks [7, 31]. Recently, CycloNovo was also released for de novo sequencing of cyclic peptides [32]. The tool is based on the construction of de Bruijn graphs from experimental spectra and the prediction of k-mers (short sequences composed of k amino acids). Although the algorithm was designed for cyclic peptides composed of proteinogenic amino acids, it was also successfully applied to NRPs. While CycloBranch is focused on single spectra analysis, CycloNovo was designed for high-throughput analysis of mass spectra and is available on the GNPS platform [20]. A comprehensive overview of other related approaches has been reported by Mohimani and Pevzner [33].

Here, we present MassSpecBlocks [34]—an open-source and web-based tool for easy construction of NRP/PK sequence (Fig. 2) and building block (Additional file 1: Fig. S1) databases, which can be used in CycloBranch software to compare the experimental tandem mass spectra of NRPs/PKs with theoretical spectra [6, 7]. Although the databases of building blocks and sequences can be created directly in CycloBranch, the process requires a lot of manual labor. The structures of compounds must be searched in a database by a web browser. Residue formulas of building blocks must be calculated manually from structural formulas. Acronyms of building blocks, their residue formulas, and sequences of building blocks must be typed manually into CycloBranch. MassSpecBlocks allows automatization of this process. Anyone can create custom databases of NRP/PK sequences and their building blocks in a user-friendly graphical interface. The databases can be shared with collaborators and easily exported into a plain text format to be used for mass spectra analysis in CycloBranch software. The molecular structures in SMILES format can be searched in PubChem, ChemSpider, ChEBI, NP Atlas, COCONUT, and Norine databases covering a wide range of known compounds. MassSpecBlocks is freely available at https://ms.biomed.cas.cz/msb and can be installed entirely offline on a local computer. The tool provides a REST API to access the data. The source codes and installation instructions are available on GitHub [35].

Fig. 2
figure 2

Sample list of NRP and PK sequences

Implementation

The architecture of MassSpecBlocks is shown in Fig. 3. The front-end has been implemented in TypeScript using the JavaScript library React (v. 16.13.1) [36] and Node.js runtime environment (v. 15.14.0) [37]. It becomes a common standard that chemical databases offer an Application Programming Interface (API) to access their data. PubChem [38], ChemSpider [39], NP Atlas [40], COCONUT [11], and Norine [26] provide REST (Representational State Transfer) APIs, while the ChEBI database has SOAP (Simple Object Access Protocol) API [41]. MassSpecBlocks utilizes these APIs to collect the data about chemical structures using the Finder component. The range of supported databases can be extended if a new class implementing the Finder’s interface is created. The List component is used to obtain the data from a local database and visualize it in interactive tables, which can be edited using a mouse-click on an individual item. The compounds are drawn using SmilesDrawer (a modified version forked from v. 1.2.0) [42]. A disadvantage of SmilesDrawer was that the structures could not be edited in graphics mode. Thus we used a freely available JavaScript Molecule Editor JSME (v. 2017-02-26) to support this feature [43].

Fig. 3
figure 3

Architecture of MassSpecBlocks

The database is stored in MySQL [44], and the backend has been implemented in PHP framework Symfony (v. 5.2) [45]. It provides a REST API to access the data, takes care of the user access rights, data verification and transformation into SQL queries, etc. The REST API can also be used to cooperate with other tools. The documentation is available on our server [46] or at https://localhost:port/rest/doc if the backend is installed on a local machine, where the port is a number of the port where the server is running. The API currently has 77 endpoints to manipulate the database. The individual endpoints can be tested directly from the documentation page. The results are returned in JSON file format. Alternatively, the data can be exported into plain text files accepted by CycloBranch software.

Data model

The data model includes entities representing building blocks, sequences, and terminal modifications of sequences (Additional file 1: Fig. S2). Sequences and building blocks with similar structures are grouped into families. A list of organisms can be assigned to a sequence. Every entity has a relationship with a container (Additional file 1: Fig. S3). The container groups data (sequences, blocks, modifications, etc.) owned by a user. A registered user can create/edit multiple private containers, which can be cloned and shared with collaborators (Additional file 1: Fig. S4). An administrator can also create public containers. An unregistered user can access compounds in public containers and can search in PubChem, ChEBI, NP Atlas, COCONUT, and Norine. The registered user can also search in ChemSpider if an API key is provided. The data in containers can be exported into (or imported from) a plain text format accepted by CycloBranch software (i.e., lists of sequences, blocks, and modifications). The default installation includes three public containers: (1) Nonribosomal Peptides and Siderophores is a sample container that includes 85 building blocks and 146 sequences of amphibactins, aquachelins, beauverolides, cyclosporins, desferrioxamines, ferrichromes, fusarinines, gramicidins, ornibactins, pseudacyclins, pyoverdines, roseotoxins, etc. (2) The container Proteinogenic Amino Acids includes the twenty basic building blocks. It can be cloned and used for user purposes. (3) Siderophores and Secondary Metabolites (MS) is a container that includes a list of 709 compounds taken from the reference [2]. The last container does not include the sequences of building blocks and can be used by CycloBranch software to detect whole compounds in conventional, liquid-chromatography, and imaging mass spectrometry data [7, 31].

Frontend

The main screen of MassSpecBlocks is shown in Fig. 4. Users can search compounds in public databases using one of the following properties—a name, a molecular formula, a monoisotopic mass, a database identifier, and a molecular structure in SMILES format (if the corresponding API supports the respective feature). The Find button starts a search process. Missing properties in the input form are filled automatically when the process is completed. The molecular structure can be edited by a direct change of the SMILES notation or using the Edit button, which opens JSME. The SMILES format can handle markups that are used in stereochemistry, but this information cannot commonly be utilized in mass spectra analysis. Thus the button Generic SMILES removes these markups from the notation. Because a structure in SMILES format can be defined in many ways, another button Unique SMILES turns the notation into its canonical form using the algorithms CANON and GENES described by Weininger, D. et al. [47]. A list of building blocks is generated using the button Build Blocks (Fig. 5). The SMILES strings of building blocks are searched in a local database first (i.e., in a selected container) to obtain the remaining properties of the blocks. If a string is not present in the local database, the metadata is searched in PubChem. When a user checks the results, the sequence, blocks, and remaining metadata can be stored in the current container using the Save button.

Fig. 4
figure 4

Annotation of peptide bonds in the structure of pseudacyclin A

Building blocks and sequences

The other pages in MassSpecBlocks are dedicated to the administration of building blocks (Additional file 1: Fig. S1), sequences (Fig. 2), terminal modifications (Additional file 1: Fig. S5), containers (Additional file 1: Fig. S3) and account settings. A building block has the following properties: a name, a unique acronym, molecular formula and monoisotopic mass of residue, a list of neutral losses, a family, and a reference to a public database. The monoisotopic mass is not mandatory because it is always calculated from the formula. The list of neutral losses can include multiple formulas separated by a semicolon (e.g. ”H2O;NH3”). A reference to an external database is composed of a database identifier and an identifier of a referenced item. For example, alanine can be referenced by ”CID: 602”, ”CSID: 582”, and ”CHEBI: 16449” in PubChem, ChemSpider, and ChEBI databases, respectively.

NRP/PK sequences have the following properties: a compound name, a structure type (linear, cyclic, branched, branch-cyclic, linear-polyketide, cyclic-polyketide, and other), a sequence of building blocks, a formula and monoisotopic mass of the neutral molecule, lists of families and organisms, terminal modifications, and a reference into a public database. The format of sequence notation was taken from the CycloBranch software. The sequences of linear and cyclic NRPs/PKs are written in the format [A]-[B]-[C]-[D]-[E], where A to E are acronyms of building blocks. In the case of cyclic NRPs/PKs, we assume that block E is connected to block A. The sequences of single branched and branch-cyclic NRPs have the notation [A]\(\backslash\)([B]-[C]\(\backslash\))[D]-[E], where B is the branching block (e.g., ornithine in pseudacyclin A [6, 48]) and C is the last block of the branch. In the case of branch-cyclic NRPs, E is connected to A. The tool automatically determines the sequence type from the molecular structure. Multiply branched NRPs are automatically annotated as the type other due to the data compatibility with CycloBranch software, which currently does not support multiply branched peptides in this notation. Depending on the type of analyzed NRP/PK, we can optionally define the names of up to three terminal modifications. In this case, the modifications must not be included in the respective building blocks and must be defined separately. For example, if a terminal block corresponds to N-acetyl-isoleucine, we can replace the block with isoleucine and define the acetylation as an N-terminal modification (i.e., C2H2O, +42.010565 Da). Every modification is defined using a name (or an acronym), a formula, a monoisotopic mass, and the information whether the modification is N-terminal or C-terminal (Additional file 1: Fig. S5). While linear NRPs may have N-terminal and C-terminal modifications, cyclic NRPs do not have any terminal modifications. A branch modification (possibly N-terminal or C-terminal) may be defined for single branched and branch-cyclic NRPs. A linear PK may have two modifications of the same type (e.g., N-terminal). Note that due to the compatibility with CycloBranch software, the tool does not deal with internal modifications of building blocks. From principle, we define the modified building blocks as new blocks.

Families and organisms

MassSpecBlocks automatically recommends families corresponding to the analyzed structures. A premise is that a similar structure is already defined in a container, and a family is assigned to it. The family can be created and assigned to the sequence in the Family menu if the new sequence is being created (Fig. 5). Optionally, it can be defined in the list of sequences (Fig. 2) or container configuration (Additional file 1: Fig. S4). The families are recommended using a simple text search or Tanimoto similarity. The administrator can select the model used by the application in the web interface. The text similarity was the first model we implemented, and it was preserved for historical reasons for its simplicity. A substring S of a compound name was compared with the names of compounds in the current container. The families corresponding to matched names were recommended. For example, we defined pseudacyclin A and assigned the family name pseudacyclins to it. If the sequence of pseudacyclin B was being created, the substring S = “pseudacyclin” was taken from the name pseudacyclin B. Because the substring S was found in the sequence name pseudacyclin A, the family pseudacyclins was recommended.

Fig. 5
figure 5

Building blocks of pseudacyclin A

The second approach works the same way with the difference that we do not pair the names of compounds, but we pair their structures using Tanimoto similarity. The Tanimoto similarity [49] is calculated as \(T(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A|+|B|-|A \cap B|}\), where A and B are the sets of blocks in the respective structures. For example, pseudacyclin A is composed of six building blocks (Figs. 2 and 4). In pseudacyclin B, one isoleucine is substituted by valine [48]. The similarity of these two pseudacyclins is \(T(A,B) = \frac{5}{5+6-5} = \frac{5}{6}\). If we have a sequence and search for similar sequences in a container using Tanimoto similarity, we can get multiple results with the same similarity. In this case, a sequence with the smallest difference in the total number of blocks is selected, and the corresponding family is recommended. If more than one family is assigned to the sequence, all the families are recommended. For example, we create the sequence of desferrioxamine B and assign two families, ferrioxamines and siderophores, to it. If we would like to add a new sequence of desferrioxamine E, both families are recommended. Currently, the Tanimoto similarity is used on our web server.

A list of organisms can be assigned to the sequence in the Organism menu (Fig. 5). Similar to families, it can also be defined in the list of sequences (Fig. 2) or container configuration (Additional file 1: Fig. S4). However, the organisms are not automatically recommended if the button Build Blocks is pressed.

SmilesDrawer

SmilesDrawer is a modern and open-source JavaScript library for parsing and drawing SMILES-encoded molecular structures. It does not require any form of client-server communication, and the rendered images are pretty. For our purpose, we had to make the following modifications in the SmilesDrawer’s code: (1) We had to mark peptide (and ester) bonds in a molecular graph of a compound to generate the building blocks. SmilesDrawer represents a molecular graph as an array of vertices (atoms) and an array of edges (bonds). The latter array was traversed, and edges corresponding to peptide bonds were annotated. If a resulting building block would include three or fewer vertices, the corresponding bond was skipped. Thus the terminal modifications like acetylation did not form independent blocks. (2) Because some molecular structures are complex and the automated annotation of peptide bonds may not always produce the desired result, it is advantageous to let a user control this process before a molecular structure is split into building blocks. For this purpose, a mouse interaction feature was added. After a mouse-click event, the edges of the molecular graph were traversed and checked if they overlapped with the current mouse position. If an edge overlapped with the cursor position, the corresponding bond was un/marked as the red break-up point (Fig. 4). (3) The depth-first search (DFS) was implemented to browse the molecular graph and generate the structures of building blocks in SMILES format. The graph was traversed so that the blocks were generated in the correct order from N-terminus to C-terminus. The SMILES strings were fixed so that carboxyl groups were attached at C-termini instead of formyl groups. Finally, the DFS was used to estimate if the resulting structure was linear, cyclic, branched, branch-cyclic and if it corresponded to an NRP or polyketide. In the beginning, the structure was considered to be linear. If a block was found more than once during the DFS, the structure was marked as cyclic. If some block was connected with more than two neighbors, the structure was marked as branched. If both properties were satisfied, the structure was annotated as branch-cyclic. Because CycloBranch currently supports only singly branched and branch-cyclic NRPs, the structure was marked as the type other if multiple branches were detected. An NRP building block has N-terminus and C-terminus. If a building block was found which did not have the C-terminus (for example, see the block Hpd in Fig. 1 having nitrogen atoms on both termini), the structure was marked as a polyketide. As in the case of NRPs, we determined if the polyketide was linear or cyclic.

Results

In this section, we show a sample decomposition of two NRPs and two linear PK siderophores into building blocks by MassSpecBlocks. We also show how to annotate mass spectra in CycloBranch and comparison of our approach with other methods. The complete data processing workflow is shown in Fig. 6. Further instructions on how to process the data are available in a video tutorial on our website [50]. The databases of sequences, building blocks, and terminal modifications (i.e., the content of a container) were exported from MassSpecBlocks into a simple plain text format (every row in a file corresponds to an item; the values on each row are tab-separated). The high-resolution tandem mass spectra were acquired on a 12T SolariX FTICR mass spectrometer (Bruker Daltonics, Billerica, MA, USA). They are freely available on CycloBranch’s homepage [29] and in a public GNPS library [20]. We used the spectra of pseudacyclin A (CCMSLIB00000531485 in GNPS library), roseotoxin A (CCMSLIB00000531484), ornibactin C8 (CCMSLIB00001059073), desferrioxamine B (CCMSLIB00001059065), and ferrioxamine B (CCMSLIB00001059066). CycloBranch v. 2.0.8 was used to compare the experimental spectra with theoretical spectra generated from the exported databases.

Fig. 6
figure 6

Data processing workflow

Nonribosomal peptides

We have chosen NRP pseudacyclin A with the branch-cyclic structure as the first showcase example [6, 48]. The main screen of the MassSpecBlocks application is shown in Fig. 4. The structure of pseudacyclin A in SMILES format and other properties were found in PubChem using the compound name. The peptide is composed of six building blocks. The ring corresponds to the sequence cyclo(Pro-Ile-Ile-Orn-Phe), where Orn stands for ornithine. The last block, Ac-Ile (N-acetyl-isoleucine), is connected to Orn. The tool automatically recognized the peptide bonds and marked them in red. The peptide bond connecting Ac to Ile on the branch was not marked because the resulting block Ac would contain only three nonhydrogen atoms. At this stage, a user could mark or unmark any bond using the computer mouse.

The decomposition of the molecule into building blocks is shown in Fig. 5. MassSpecBlocks correctly determined the branch-cyclic structure of the molecule. The building blocks Ile, Pro, and Phe were found in a local database of building blocks that contained 20 proteinogenic amino acids. Ornithine and N-acetyl-isoleucine were automatically found in PubChem using their structures in SMILES format. Similar to the whole peptide structure, every building block could be modified in JSME using the respective Edit button. Finally, the sequence and building blocks were stored in a local database when the Save button was pressed. The data was exported from the container into the plain text files using the Export button (Additional file 1: Fig. S3). CycloBranch successfully compared the experimental spectrum of pseudacyclin A with the theoretical spectrum generated from the input sequence of building blocks. The sample software configuration is shown in Additional file 1: Fig. S6, and the result of spectra comparison in Additional file 1: Fig. S7.

Another example of annotated peptide bonds and an ester bond in the molecular structure of a depsipeptide roseotoxin A [6, 51] is shown in Additional file 1: Fig. S8. The structure was found in PubChem using the molecular formula C31H53N5O7. MassSpecBlocks listed sixty-six thumbnails of candidate structures. The correct structure of the cyclic peptide was selected manually. The corresponding building blocks are reported in Additional file 1: Fig. S9.

Polyketide siderophores

In the following example, we have chosen two linear PK siderophores ornibactin C8 and desferrioxamine B. Ornibactins contain C-terminally attached putrescine (Put)—a PK building block that is terminated by amine groups on both sides (Additional file 1: Fig. S10). It’s a common practice that residue formulas of building blocks are calculated using water elimination (e.g., the molecular formula of Ile is C6H13NO2, but the residue formula is C6H11NO). However, because Put does not include any oxygen, the water molecule cannot be eliminated. For this reason, we calculated the residue formula as the elimination of two hydrogens. So, the molecular formula of Put is C4H12N2, but the residue formula is C4H10N2. We can see that MassSpecBlocks correctly determined this building block during the decomposition and annotated ornibactin as the linear polyketide.

The structure of desferrioxamine B and building blocks generated by MassSpecBlocks are shown in Additional file 1: Fig. S11. Analogously to Put, the block Hpd is terminated with amine groups on both sides. Assuming no hydroxyl group is attached to amine in Hpd, the water molecule cannot be eliminated. Thus we define the residue formula of Hpd as C5H12N2O, although the molecular formula is C5H14N2O. The second block Suc has the residue formula C4H4O2 and is terminated by formyl groups on both sides. This leads to the idea that the molecular structure of the block should include carboxyl groups on both sides. Because the exact positions of chemical elements inside a building block cannot be determined by mass spectrometry, we added only one hydroxyl group. So, following the standard definition, we added a water molecule to the residue formula to get the molecular formula of Suc (C4H6O3). We can see that MassSpecBlocks correctly determined the structures of Hpd and Suc. Similar to the previous example, it correctly annotated the structure of desferrioxamine B as the linear polyketide.

The presence of PK building blocks in the molecule impacts its fragmentation, and thus uncommon series of ions arise in mass spectra. The annotated spectra and characteristic fragmentation patterns of ornibactins and desferrioxamines have been discussed in our previous work [7]. Sample configuration of CycloBranch for the spectrum of desferrioxamine B is shown in Additional file 1: Fig. S12. The spectrum was compared with the list of 146 NRP and siderophore sequences in less than a second. In Additional file 1: Fig. S13, we show an output report if a precursor mass filter was enabled. The sequence of desferrioxamine B was reported as a single hit. To show the strength of our approach, we also show a report if the precursor filter was disabled. We can see that the correct sequence was still reported as the top hit. CycloBranch implements several scoring functions, including the number of matched peaks, the sum of relative intensities of matched peaks, the weighted ratio of matched peaks to all peaks (i.e., the sum of relative intensities of matched peaks divided by the sum of intensities of all peaks), etc. In this case, we used the weighted ratio, which was 96.4% for the top hit. The annotated spectrum of desferrioxamine B is shown in Additional file 1: Fig. S14 and was opened by a double-click on the respective row in the output report. We can see that the fragmentation corresponds to those shown in Fig. 1. CycloBranch correctly annotated isotopes of fragment ions because the isolation window used during the mass spectrometry analysis was set up to m/z 561.36 ± 2 Da.

Comparison with other approaches

We used the spectra mentioned above to compare our approach with NRPro (1.0), Dereplicator (1.2.8), and Dereplicator+ (1.0.0). Dereplicator and NRPro were designed mainly for mass spectra annotation of peptidic natural products. These tools visualized the annotated peaks in mass spectra and matching fragments in molecular structures of respective compounds. Dereplicator+ also supported polyketides, but matched fragments were not visualized in molecular structures. CycloBranch used a simplified visualization of building blocks (Additional file 1: Fig. S14). An advantage of NRPro was the ability to retrieve structures missing in Norine from ChEBI, NP Atlas, and PubChem. While Dereplicator and NRPro could analyze spectra only with predefined sodium and potassium adducts (i.e., [M+Na]+ or [M+K]+ ions), CycloBranch offered an ability to annotate spectra with custom adducts, e.g., [M+Fe-2H]+, [M+Fe-3H+Na]+, [M+Fe-3H+K]+, [M+Al-2H]+, etc. Dereplicator+ did not offer us any possibility to define the adducts.

In comparison to other tools, CycloBranch supports annotation of fine structures of isotopic peaks and visualization of profile mass spectra in addition to line spectra. A custom list of neutral losses can be defined in addition to common losses like H2O and NH3. If necessary, additional lists of neutral losses can be defined for individual building blocks. Dereplicator and Dereplicator+ can process the input experimental spectra in mgf, mzXML, mzML file formats. NRPro currently supports mgf and mzXML. CycloBranch support spectra in a plain text file format (txt), in the open formats mgf, mzXML, mzML, and native file formats of Bruker (baf), Waters (raw), and Thermo (raw).

Table 1 Comparison with other approaches—the total numbers of peaks in the experimental spectra of NRPs/PKs and the numbers of peaks annotated by the tools. The weighted ratio of matched peaks is also reported for CycloBranch in percent (i.e., the sum of intensities of matched peaks divided by the sum of intensities of all experimental peaks)

The numbers of peaks annotated by the tools are reported in Table 1. The precursor and fragment ion m/z error tolerances were set up to 5ppm for CycloBranch and NRPro. The values used in Dereplicator and Dereplicator+ were 0.005 Da. Because the input mgf/mzXML files included line spectra, and the other tools did not offer an option to crop the low-intensity peaks, we set up the minimum threshold of relative peak intensity to zero. To unify the configuration of CycloBranch with NRPro, we were looking only for b-ions, a-ions and allowed only one neutral loss of H2O or NH3. See the sample configuration for pseudacyclin A in Additional file 1: Fig. S6 and desferrioxamine B in Additional file 1: Fig. S12. Because the input databases were different and CycloBranch was a stand-alone tool while the other web-based tools, we could not compare their performance exactly. However, while CycloBranch and NRPro commonly returned results in a second, the average data-processing time was 2 minutes and 4 minutes for Dereplicator and Dereplicator+, respectively.

Pseudacyclin A was reported by all the tools as the single hit. Roseotoxin A was found by CycloBranch and NRPro. NRPro also reported destruxin B1 as an isomer of roseotoxin A whose theoretical peaks fit the same experimental peaks. Dereplicator reported bursaphelocide B (another isomer of roseotoxin A) as the only hit. Dereplicator+ assigned the spectrum to roseocardin whose structure corresponds to roseotoxin A. Ornibactin C8 was reported by all the tools as the most significant hit. The advantage of NRPro was that it also annotated peaks of internal fragment ions, which CycloBranch does not currently support in the case of linear polyketides. On the other hand, CycloBranch marked 55 peaks if we enabled the annotation of peaks with up to three neutral losses. The weighted ratio of matched peaks was 58% in this case. Desferrioxamine B was found by CycloBranch and NRPro. Both tools annotated all the monoisotopic peaks shown in Additional file 1: Fig. S14. CycloBranch was the only tool that annotated the ferri-form of this compound (i.e., the spectrum of [M+Fe-2H]+ ion).

Note that the analysis of desferrioxamines and other PK siderophores requires knowledge of characteristic fragmentation patterns (Fig. 1). Some pre-existing knowledge is also required if the de novo sequencing of NRPs is performed because the combinations of building blocks frequently form isomers. We commonly get many NRP sequence candidates which fit the experimental spectrum well. If we perform the database search, the situation is much easier. Because CycloBranch is focused on the annotation of high-resolution mass spectra, the most important parameters to be tuned are m/z error tolerance and the minimum threshold of intensity. The users should also check in the literature which compounds can be produced by organisms they study. Further instructions to fine-tune the tool’s configuration for various use-cases can be found in the Tutorial’s section on the CycloBranch’s website [29] and in our previous work [6, 7, 52].

Conclusion

MassSpecBlocks was presented as an open platform for creating custom databases of NRPs/PKs where the structures were stored as sequences of building blocks. The tool can be accessed online or installed offline on a local computer; it provides a REST API and can be extended to support the compound search in various public repositories. In combination with the CycloBranch software, we provided a robust workflow for annotating tandem mass spectra of NRPs, PKs, siderophores, and other natural products, which can be represented as sequences of building blocks.