The implementation of the NP-likeness scorer web application and database is in two parts: the scorer training and the creation of the database and the development of the web application that connects on this database.
The NP-likeness score is based on the sum of the frequency of their fragments among NPs and SMs. Here the fragments are represented by atom signatures that are canonical circular descriptors of an atom’s environment in the molecule. The NP-likeness score is computed for each atom in a molecule represented as a directed acyclic graph, where every node is an atom and the edges are bonds between them. The levels of neighbourhood of an atom in a molecule is the height of the signature of that atom and determines the overall size of the fragment. In the present study atom signatures of height 2 were calculated. This height provides a better structural accuracy compared to height 1, and is not excessively large as height 3, which avoids over-training. A molecular signature is the sum of all its atom signatures.
Training data
The training data was extracted and combined from several public and open databases: for natural products, it was integrated from ZINC, ChEBI [3], ChEMBL [4], PubChem [5], the Traditional Chinese Medicine DataBase (TCMDB [6]), NPAtlas [7], AfroDB [8], SANCDB [9], NuBBE [10], HIT [11], NPACT [12], StreptomeDB [13], UNPD [14], the manually curated data used for the study published in 2012 [2] and some other datasets not associated with any database or publication such as UEFS (accessed through ZINC). Datasets from companies, SelleckChem [15] and InterBioScreen [16], synthesising and selling the compounds were also used as they openly provide reliable molecular structures for natural products. The Super Natural II database [17] was excluded from the training dataset, due to uncertainty about its data quality and provenance (e.g. this database lists dodecahedrane, which is not an NP), but NP-likeness scores were computed for molecules stored in it and are displayed on the web application and in the MySQL database. Synthetic molecules were randomly selected from the ZINC database excluding all natural products, metabolites and other biogenic molecules. In total, the training set consists of 364,807 natural products and 489,780 synthetic molecules.
Cheminformatic processing in NaPLeS is realised with the Chemistry Development Kit (CDK) [18]. First, each molecule undergoes a curation process. The stereochemistry is removed from all molecules due to a big variation in databases of stereochemistry presence and depiction. This step is particularly important to avoid fragment redundancy. The molecule is then checked for disconnected parts and only the biggest one is kept for further curation. Molecules smaller than 6 atoms and containing non-organic atoms (allowed atoms: C, H, N, O, P, S, Cl, F, As, Se, Br, I, B, Na, Si, K, Fe) are discarded as suggested by Ertl et al. [1]. Then, redundant molecules between databases are eliminated based on their structural identity using their InChI. Next, linear and circular sugar moieties are removed from all molecules to omit moieties that are less distinctive due to their repetitive and redundant nature, albeit commonly present in NPs.
Atom signatures [19] (fragments) of height 2 are calculated for each molecule. For each fragment, its frequency among natural products compared to synthetic molecules is computed with Eq. 1, where NPi is the number of occurrences of the fragment i in natural products, SMi the number of occurrences of the fragment i in synthetic molecules, NPt is the total number of natural products and SMt is the total number of synthetic molecules. If the fragment is present several times in one molecule, its occurrence is counted accordingly (e.g. if the fragment occurs three times in one molecule, the total number of occurrences of this fragment in the corresponding molecule category will be increased by 3). The NP-likeness score of a molecule corresponds to the sum of frequencies of fragments in this molecule, corrected by its size (Eq. 2).
$$Frag_{i} = { \log }\left[ {\frac{{NP_{i} }}{{SM_{i} }}*\frac{{SM_{t} }}{{NP_{t} }}} \right]$$
(1)
$$NPls = \frac{{\mathop \sum \nolimits_{i = 0}^{N} Frag_{i} }}{N}$$
(2)
NP-likeness database
The MySQL 5.8 Docker image is used to store the molecules, molecular fragments and the corresponding scores. The table ‘ori_molecule’ contains the information about the molecules from the public databases uploaded for the NP-likeness scorer training. Each molecule is described by the identifier from its original database, a SMILES, an InChIKey, the submission date, its original (i.e. source) database and its status (natural product, synthetic molecule or biogenic). In this table molecules can be redundant. The ‘molecule’ table contains unique fully connected molecules that are at least 6 atoms large and do not contain non-organic atoms (for definition, see above). Each molecule is associated with a unique identifier, its structural information (SMILES), whether it is a NP or not, if it contains sugar moieties, the NP-likeness scores computed for the molecule with and without the sugar moieties and various parameters such as the heavy and total atom counts (with and without the sugars), the number of rings, the number of repeated fragments (if 0 all fragments that constitute the molecule are found only once in it) and the number of predominant heavy atoms (carbons, oxygens and nitrogens). The tables ‘fragment_with_sugar’ and ‘fragment_without_sugar’ contain the atom signature of each fragment (SMILES-like notation), a unique fragment numerical identifier, the atom signature height (currently only the height 2 is stored) and the relative frequency of fragments in natural products computed with the Eq. 1. The table ‘molecule_fragment_cpd’ stores the relations between the fragment and molecule identifiers, whether it concerns fragments computed with sugar removal or not, and the number of occurrences of the fragment in the molecule. Two additional tables are required for the web application to run correctly: ‘user_uploaded_molecule’ and ‘user_uploaded_molecule_fragment_cpd’ that temporarily store the molecules submitted by the web application user and the information computed for them.
Database filler
The code for the training described in the previous section is available in the NPdatabaseFiller application. This is an application containerised with Docker, running with Spring Boot and a MySQL database. The communication between the Java code and the MySQL database is handled by the Hibernate Object Relational Mapping (ORM).
NPdatabaseFiller uses the previously described training data and fills the MySQL database used by the NaPLeS web application. It can also be used as a stand-alone application to compute NP-likeness scores locally for a large number of molecules or to recreate the NP-likeness database from scratch.
Three execution options for NPdatabaseFiller are available and can be selected by editing the docker-compose.yml file of the application. It is (a) possible to compute the NP-likeness scores from scratch for all submitted molecules. For this, it is necessary to provide molecular files with an appropriate format and an equivalent number of natural products and synthetic molecules for the training. To (b) compute the NP-likeness scores for one molecular file, without updating the whole database, and to (c) update the scores of all fragments in the database and the NP-likeness scores for all molecules present in the database. The last option is useful in case where a number of new molecules has been inserted in the database and the user wants to use them to re-train the scorer. A schematic presentation of the workflow is shown in Fig. 1a.
Web application
The NaPLeS web application was developed with the Spring Boot framework and is composed of two Docker containers: the back- and the front-end build on an openjdk:8u171-slim image, and the database in a MySQL 5.8 container described in the previous section. The web application allows to compute the NP-likeness scores for submitted molecules from a big number of molecular fragments in a reasonable time (5 to 10 s for molecules with up to 20 heavy atoms, up to 20 s for molecules with more than 60 heavy atoms). The back-end is written in Java 8 using Spring Boot framework and relying on the Hibernate ORM for the communication with the MySQL database, and Thymeleaf as server-side Java template engine to serve dynamic content to the front-end.
Submission of molecules to compute their NP-likeness scores is possible in three ways: uploading a file in SDF, MOL or SMI (SMILES) format for a maximum of 200 molecules, pasting a SMILES string or drawing a molecule in the chemical editor. This threshold is defined only for the public instance of NaPLeS to allow a pleasant user experience and avoid long waiting times. It can be overwritten in a locally installed or cloud instance.
The molecular editor is using the OpenChemLib JavaScript libraries (https://github.com/cheminfo/openchemlib-js). The submitted molecules have their sugar moieties removed, then are fragmented in atom signatures, the scores of the matched fragments are retrieved, summed and normalised by the size of the molecule. The computed NP-likeness scores are reported in a results table with some additional molecular information, the depiction of the submitted molecules and if they exist, the identifiers of the submitted molecules in public databases. The results table is enhanced with DataTables.js library (https://datatables.net) and allows an easy export of tabular data in CSV and Excel formats, copying to the clipboard, sorting the results by all columns and a dynamic search. If any of the submitted molecules contains a fragment that is not in the database, the user is alerted, and the fragment is excluded from the score computation. In the results page distributions of NP-likeness scores as also depicted and where the computed results are situated among them. The schema of the NaPLeS query workflow is shown in Fig. 1b.