Background

In the past two decades high throughput screening (HTS) has provided a large amount of experimental data on compound biological activities. Data mining and machine learning methods provide an in silico counterpart building predictive models based on chemical structure features and other properties, and training sets of known bioactivities. Despite these capabilities quantitative methods do not tend to model the biochemical and physiological process well. Recent developments in machine learning have focused on the exploration of large data sets with non–congeneric molecules. The applicability of Quantitative Structure Activity Relationship (QSAR) studies to predict toxicity is very limited. The rationale behind the use of machine learning is to discover patterns and signatures in data sets from high throughput in-vitro assays. Nonetheless, the development of in-silico models as alternative approaches to mutagenicity assessment of chemicals without animal testing is constantly increasing and has attracted researchers in the field of Quantitative Biological Activity Relationship (QBAR) [1] and even toxicology.

Mutagenicity is the ability of a substance to cause genotoxicity. Experimentally, mutagenicity is assessed by Ames test performed on Salmonella typhimurium bacterial strains where each bacterial strain is sensitive to specific chemical mutagen [2]. It has been found that the predictive power of positive Ames test for rodent carcinogenicity is high, ranging from 77% to 90% [3]. Kazius et al. [4] assembled a data set of 4337 compounds and derived 29 toxicophores with an error rate of 18% in training set and 15% in a validation test set. Helma et al. [5] reported MOLFEA algorithm for generation of descriptors based on molecular fragments for non-congeneric compounds and compared various machine learning algorithms with its data set of 684 compounds derived from Carcinogenic Potency Database (CPDB: http://potency.berkeley.edu/). The data set gave an accuracy of 78% with 10 folds of cross validation. Hansen et al. [6] reported a unique new public Ames Mutagenicity data set with 6500 compounds and compared results with commercial and non-commercial tools. Zhang and Sousa [7] also reported the use of MOLMAP descriptors for bond properties which were used for training of Random Forest classifier. Error percentages, as low as 15% - 16% were achieved with an external validation set of 472 compounds against a training set of 4083 structures. Up to 91% sensitivity and 93% specificity were obtained from the test sets. Feng et al. [8] used four data sets NCI, Mut, Yeast and Tox and generated four different types of descriptors. Using statistical methods, models were built to link chemical descriptors to the biological activity. King et al. [9] reported different methods for establishing structure activity relationships (SARs). They represented chemical structures by atoms and bond connectivities in combination with inductive logic programming algorithm Progol. They tested 230 compounds which were divided in two sets of 188 compounds and 42 compounds. For 42 compounds Progol formed a SAR better than linear regression and back propagation. Judson et al. [10] used different classifiers to predict the accuracy of the model of complex chemical toxicology data sets. Neural networks and Support Vector Machines (SVM) were at the top of the list of classifiers, predicting with 96% and 99% specificity, respectively. They also mentioned that irrelevant features decreased the accuracy rate, with linear discriminant analysis suffering the maximum degradation. Ferarri and Gini [11] proposed the idea of a trained QSAR classifier supervised by a SAR layer that incorporates coded human knowledge. The model is implemented in the CAESAR project (http://www.caesar-project.eu) [12] where initially a classifier is trained on more than four thousand molecules based on Bursi data set by using molecular descriptors, then in the next step the relative knowledge to complement its practice is extracted from a collection of well-known structural alerts. Votano et al. [13] reported the application of three QSAR methods using artificial neural networks, k-nearest neighbors, and decision forest, to a data set of 3363 diverse compounds. They used molecular connectivity indices, electrotopological state indices, and binary indicators to obtain an accuracy of 82%.

Unlike many bioactivities, mutagenicity can be linked to very specific chemical structure fragments and functional groups, usually referred to as toxicophores, which interfere with DNA [1416]. These include aromatic amines, hydroxyl amines, nitroso compounds, epoxides, thiols, nitrogen mustards, aziridines, aromatic azo’s, propiolactones, aliphatic halides, thiophenes, heteroatom derivatives, polycylic planar compounds, hydrazine, hydrazide and hydroxylamine. It has also been found that detoxifying structures such as the CF3, SO2NH, SO2OH and aryl sulphonyl derivates render mutagenic compounds non-mutagenic [17].

In this paper, firstly, we have applied four classification algorithms - Naïve Bayes, J48, Random Forest and Sequential Minimal Optimizer (SMO) - to model the mutagenicity data of compounds. In particular, we were interested in discovering whether such “whole molecule” algorithms are appropriate for mutagenicity prediction, or whether this is better done using simple alerts based on toxicophores. We were also interested in whether we would replicate previous work indicating that Random Forest is a better classifier than other Base and Ensemble classifiers [18]. We tested the model with validation sets (PubChem data sets AID1189 and AID1194, DrugBank [19] approved, and withdrawn drugs and Zinc metabolites data (zinc.docking.org/browse/subsets/special.php) [20] all of which indicate that the Random Forest model performs well.

Methods

Data sets

This work included 3 training data sets: Set 1 (Bursi mutagenicity data set) having a total of 4337 compounds, Set 2 (Benchmark data set) with 6512 compounds and Set 3 which was a combination of Set 1 and Set 2 containing 8208 compounds after removing the duplicate structures based on the canonical smiles of the Set 1 and Set 2 using Pipeline Pilot [21]. The data sets were divided into training (80%) and testing (20%). The datasets are given in the Additional file 1 and Additional file 2. Table 1 shows the distribution of compounds on the training and test sets of the three sets (Set 1, Set 2 and Set 3). For using the datasets in Weka we performed the remove useless feature option which removes the unnecessary variables from the data. We converted the data to the ARFF format for further classification. For set 1 “remove useless” operation in Weka, of the initial 179 descriptors, 151 descriptors which contained 24 weighted burden number descriptors, 8 properties descriptor and 120 pharmacophore fingerprints were obtained. For set2 and 3 the remove useless operation resulted in 154 descriptors (of the initial 179 descriptors) which contained 24 weighted burden number descriptors, 8 properties descriptors and 123 pharmacophore fingerprints.

Table 1 Distribution of different data sets and it compounds (mutagens and non-mutagens) in test and train sets

For validation of the generated model, external test sets were used. External data sets, AID1189 and AID1194, were taken from EPA DSSTOX data set in the CPDB [22]. AID1189 contained 1477 compounds with 788 mutagens and 689 non-mutagens and AID1194 contained 832 compounds with 396 mutagens and 436 non-mutagens. The toxicity models were tested against the 1410 approved drugs and 66 withdrawn drugs from the DrugBank database and as well as with the 22080 metabolite data which were taken from the recently published ZINC Data sets. The metabolites may be toxic or non-toxic the idea here is to check whether the compounds formed after metabolism has some mutagenicity or not using our predictive models.

Chemical descriptors

For each data set, descriptors were calculated by PowerMV [23]. PowerMV calculates a total of 6122 descriptors classified as 546 atom pair descriptors, 4662 Carhart descriptors, 735 fragment pair descriptors, 147 pharmacophore fingerprints, 24 Weighted Burden Number descriptor and 8 properties descriptors. Among those we used:

Property descriptors including XlogP (a measure of the propensity of a molecule to partition into water or oil), polar surface area (PSA), number of rotatable bonds, H-bond donors, H-bond acceptors, molecular weight, blood–brain indicator (0 indicating a compound does not pass the BBB, and 1 indicating that a compound passes the BBB) and bad group indicator (the molecule contains a chemically reactive or toxic group).

Pharmacophore Fingerprint descriptors based on bioisosteric principles. They are divided in to six classes totaling to 147 descriptors.

Weighted Burden number descriptors, a set of continuous descriptors and are also a variation of the Burden number [24]. One of the three properties, namely, electronegativity, Gasteiger partial charge or atomic lipophilicity and XLogP is placed on the diagonal of the Burden connectivity matrix. The off-diagonal elements are weighted by one of the following values: 2.5, 5.0, 7.5 or 10.0. Then the largest and the smallest eigenvalues are used as descriptors.

Machine learning classifiers

Machine learning has been widely used in classifying molecules as active or inactive, mutagen or non-mutagen against a protein target [25]. In this work we used Weka [26] open source software which is a collection of different classifiers for data mining and machine learning. It is licensed under GNU GPL. It includes tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Of the many data mining approaches that have been explored, four have evolved to largely dominate other classification methods at present. These are a) Bayesian methods [27] b) Support Vector Machines [28] c) Decision trees [29] and d) Random Forest [30, 31].

Workflow

The data sets were downloaded in SD File format. The PowerMV descriptor calculation tool was used to generate chemical descriptors. A total of 179 descriptors were generated for all the three data sets used. Bioassay data was appended as an outcome column to each of the data sets in the comma separated values (CSV) file format. The compounds were labeled mutagens and non-mutagens based on the respective bioassay data. After merging of Set1 and Set 2 compounds, the duplicate compounds are removed from the data and it resulted in 8292 compounds representing Set 3 data set. Useless descriptors were removed among the 179 descriptors which resulted in 155 descriptors for Set 2 and Set 3 and 152 descriptors for Set 1. Each data set was trained with 10 fold cross validation with default parameters for all the four classifiers mentioned earlier. The models generated were tested with remaining 20% test data and also validated using external data sets from PubChem AID1189 and AID1194, DrugBank drugs and Zinc metabolites data. Additional file 3 contains the csv formatted file of descriptors for external datasets. Using the knowledge flow provided by Weka, a workflow represented in Figure 1 was prepared which loads the data sets, applies the classifiers to generate the models which are tested using the test compounds.

Figure 1
figure 1

The diagram above represents the knowledge workflow model of Weka environment software.

Results

The results are discussed for each of the data sets for which the models were developed using the four classifiers. The Random Forest was parameterized with 100 trees because we did not find much difference in the out of bag error rates for 500 trees (which was around less than 0.5%).

Set 1: The results given in Table 2 show that Random Forest outperformed the other classifiers. For Set 1 the Random Forest classifier classified internal 20% test data with 79.81% accuracy, 79.5% average precision and 89.2% AUC ROC which is the best model for the Set 1. For the external set, AID1189, it classified with 64.65% accuracy, 66.4% Average precision and 67.3% AUC ROC as shown in Table 3. For AID1194 it predicted 84.85% accuracy, 84.9% Average precision and 93.1% AUC ROC as shown in Table 4. Figure 2 depicts the number of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) compounds predicted in Set 1.

Table 2 Result table for Set 1 with four classifier algorithms Naïve Bayes, Random Forest, J48 and SMO
Table 3 Result table for AID1189 taken as test set for the models prepared by different sets i.e. Set 1, Set 2 and Set 3
Table 4 Result table for AID1194 taken as validation set for the models generated on different sets i.e. Set 1, Set 2 and Set 3
Figure 2
figure 2

The graph represents number of Set 1 compounds classified by TP, FN, FP and TN by Naive Bayes, Random Forest, J48 and SMO classifiers.

Set 2: The results provided in Table 5 portray the predictive power of the Random Forest algorithm. Here it was observed that for Set 2 the Random Forest classified the internal 20% test data with 78.18% accuracy, 77.3% precision and 85% AUC ROC. J48 also performed well with 73.6% accuracy but was not better than Random Forest. With external test set AID1189 it classified with 61.6% accuracy, 66.6% precision and 64.5% ROC area as given in Table 3. With AID1194 it classified 87.86% accuracy, 87.7% precision and 94.3% AUC ROC as given in Table 4. Figure 3 depicts the number of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) compounds predicted in Set 2.

Table 5 Result table for Set 2 with four classifier algorithms Naïve Bayes, Random Forest, J48, and SMO
Figure 3
figure 3

The graph represents number of Set 3 compounds classified by TP, FN, FP and TN by Naive Bayes, Random Forest, J48 and SMO classifiers.

Set 3: 154 descriptors were taken from initial 179 descriptors ,which contained 24 272 weighted burden number descriptors, 8 properties 273 descriptors and 124 pharmacophore fingerprints. The results are given in Table 6 classifiers. After merging compounds from the two sets it was observed that Random Forest was better in classifying compounds and for external test sets gave more accurate results than for the other two set (Tables 3 and 4). For the internal 20% test set it gave an accuracy of almost 90% and AUC ROC of 95.3%. For AID1194 it showed an accuracy of 91.9% with ROC area of 96.3%. The other classifier J48 gave an accuracy of 87%. Figure 4 mentions the number of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) compounds predicted in Set 3.

Table 6 Result table for Set 3 with four classifier algorithms Naïve Bayes, Random Forest, J48, and SMO
Figure 4
figure 4

The graph represents number of Set 3 compounds classified by TP, FN, FP and TN by Naive Bayes, Random Forest, J48 and SMO classifiers.

For each of the data sets modeled with Random Forest performance was much better than the other classifiers. The Random Forest model performs an implicit feature selection, using a small subset of "strong variables" for the classification only, leading to its superior performance on high dimensional data. The outcome of this implicit feature selection of the Random Forest can be visualized by the "Gini importance". In the Figure 5, important variables used in Random forest model generation are represented. We also used the important variables based on Gini Importance i.e. 30 listed in the diagram to model our data sets. For all the test sets of the three sets the accuracy was in range of 79% to 84%. The variable selection using Gini importance resulted in a decrease of accuracy rate to 1% to 1.5%. Descriptor optimization is an important step while making learning models. Descriptors are often selected based on the correlation methodology [32] for example in Weka, a cfs subset Eval attribute evaluator is present which selects the most uncorrelated descriptors for model generation. The Gini importance showed in the Figure 5 is another approach of variable selection which is based on inequality among values of a frequency distribution on each split of the tree [33]. It is defined as a ratio with values between 0 and 1: the numerator is the area between the Lorenz curve of the distribution and the uniform distribution line; the denominator is the area under the uniform distribution line.

Figure 5
figure 5

Set 3 Variable Importance Graph.

We use the models to test the 1410 approved drugs compounds and 66 withdrawn drugs and also 22080 metabolites in the ZINC database. It was assumed that the approved drug compounds would not show mutagenicity and hence, were labeled as non-mutagens. The withdrawn compounds show various pharmacological side effects and among them mutagenicity could also be an important side effect. So, the withdrawn compounds were labeled as mutagens. Among the metabolites 9523 compounds were labeled as mutagens and remaining as non-mutagens arbitrarily. We tested the compounds on the three sets with Random Forest of 100 trees. The Table 7 displays the tested compounds results.

Table 7 The drug and the metabolites data tested with Set 1, Set 2, Set 3 with random forest

Each model was tested with the drug data and the metabolites data. It was found that every model predicted the drug data with almost the same specificity i.e. the true negatives which were labeled as non- mutagen. Every model predicted with almost more than 84% specificity. To improve the model of prediction of true negatives we also implemented the classification with cost matrix in Weka and tested our data sets. We set the cost of false positive to 2.5 for misclassifying every non-mutagenic compound. Every data set was classified with more than 90% as true negative. The models predicted the withdrawn drugs data with low sensitivity and it predicted most of the compounds as false positives (non-mutagen). The compounds from Zinc metabolites database show very low mutagenic effects to the living systems and after testing with each model it was observed that Set 3 gave the best classification of the compounds. From 9523 mutagen compounds labeled arbitrarily, it predicted 8037 compounds as false negatives (mutagens compounds labeled predicted as non- mutagens) and 10774 compounds as True negatives (non-mutagens compounds labeled predicted as non-mutagens) from 12557 compounds. This indicates that 85% of the compounds in the zinc metabolite dataset are non-mutagenic.

Analysis of false positives and false negatives results

Erroneous compounds i.e. the false positives, false negatives were observed for the test set of Set 3, drug data sets, and metabolites. Each data set is described below.

Set 3: The test set contained 1698 compounds of which 910 compounds were classified as mutagens and 788 as non-mutagens. False Negatives (Mutagenic compounds incorrectly classified as Non-Mutagens) appeared in the test sets which resulted in 121 compounds. It was observed that compounds containing toxicophores were being classified as non-mutagens. From 121 false negatives, derivatives of 18 aromatic nitro groups, 9 quinoline, 7 butyl acetate, 5 cresol, 4 phenanthrene, 4 acetanilide, 3 carbinol, 3 methyl aminoethanol, 3 azo compounds were observed and the remaining were singletons. Some of the compounds are given in Figure 6. 131 false positives (non mutagenic compounds incorrectly classified as mutagenic) compounds were also predicted by the Random Forest classifier. It was observed that 14 aromatic nitro groups, 21 styrene groups, 4 anisoles, 4 benzylamines, 4 dimethylaniline, containing compounds were predicted as mutagenic due to presence of aromatic nitro group, 3 quinolines. Additional file 4 contains the smiles and the predicted results of false positives and false negatives of the test set. Figure 7 shows some false positive compounds.

Figure 6
figure 6

Represents some compounds which are Mutagenic but predicted as Non Mutagen(False Negative) by Random Forest in the test set.

Figure 7
figure 7

Shows false positive compounds of the test sets.

DrugBank data set: It was observed that 207 compounds where predicted as mutagens among 1410 approved drugs. It was found that compounds containing some essential toxicophores were classified as mutagenic. The structures present in the Figure 8 are drugs which are predicted as false positives. For example, Tacrine which was used to treat Alzheimer’s disease is a centrally active acetyl cholinesterase inhibitor but it was also observed that it induces reversible increase in transaminase activity leading to hepatic injury to 30-50% of the patients [34]. Ciprofloxacin a flourinated quinoline belonging to the class of antibiotics which includes other drugs such as the enoxacin, fleroxacin, norfloxacin, ofloxacin etc. Ciprofloxacin is associated with fatal liver failure [35] and also it has been associated with cases of renal failure. This drug has been predicted as false positive which indicates that the model predicted the compound which was labeled non mutagen is originally mutagenic.

Figure 8
figure 8

Shows some drugs predicted as false positives.

Of the 66 withdrawn drugs labeled as mutagens, only 14 compounds were identified as mutagens. The drugs were withdrawn from the market due to signs of toxicity and adverse effects to humans. Side effects include hepatotoxicity, hepatitis, teratogenicity (study of human birth defects), myocardial infarction, mutagenicity and others. In the withdrawn data 52 compounds were predicted by Random Forest as non mutagens. Figure 9 shows some of the compounds which are predicted as false negatives. Additional file 5 contains smiles and the predicted results of approved and withdrawn compounds.

Figure 9
figure 9

Shows the withdrawn drug compounds predicted as false negative.

Metabolites data set: This data set contained 22080 compounds and around 3269 compounds were predicted as mutagens. The Additional file 6 contains the ZINC ids and smiles along with predictions of the Random Forest Set 3 classifier.

Comparison of the random forest with CAESAR

The results of the Random Forest classifier were compared with the standalone CAESAR mutagenicity software (v.2.0). The results are provided in the Table 8. It was observed that CAESAR was unable to predict certain compounds that contained ions in their structure. For the validation sets AID1194, AID 1189 and the test sets of Set 3, the total number of predicted mutagens were 394,788 and 910 respectively. The non-mutagens predicted to be were 438,697 and 788 for the above sets. The CAESAR tool is based on the structural alerts described by Ashby, Kazius et al. [36]. The tool was unable to predict correctly 163,322,124 compounds for AID 1194, AID1189 and 20% of test set respectively in the validated datasets. The results in the table show number of classified compounds with respect to the total number of compounds in each of the data sets. The comparison clearly shows that our Random Forest model performed much better than CAESAR and could even classify compounds which are not classified by the tool.

Table 8 Comparison of Caesar with Random Forest (rf) with the validation sets depicting True Positives (TP), False Negatives (FN), True Negatives (TN), False Positives (FP) and Accuracy

Conclusion

Previously the Benchmark data set was the largest mutagenicity data set containing more than 6000 molecules classified as mutagens and non-mutagens. In this work we were able to create a new mutagenicity data set (Set 3) containing more than 8000 compounds.

The models generated using Random Forest classifier was observed to have a high performance rate. This was proved by a higher sensitivity and specificity results for the validation sets AID1189, AID 1194. Descriptor optimization is important criteria for model generation, the use of Gini importance could play an important role in descriptor space optimization. Other than that the comparative results of descriptor based Random Forest with CAESAR (which is based on the structural alerts) clearly shows that Random Forest has the better predictive ability to classify mutagenic from non-mutagenic. Classification of the Drug data and the metabolite datasets gave us a clear view the impact of predictive models in drug design and discovery. The mutagenic predictive models could make a great impact in classifying compounds in large repositories such as PubChem and ZINC which could help to accelerate the pipeline of drug discovery.