Introduction

Food flavor chemicals are used and/or present in foods at very low level. Human exposure to these flavor chemicals through foods is too low to raise concerns about general toxicity. Regarding mutagenicity, however, there are health concerns even with trace amounts because there is no threshold for mutagenicity, and even very low levels of exposure of mutagenic chemicals do not result in zero carcinogenic risk [1]. Therefore, the presence or absence of mutagenicity is an important point for risk assessment of flavor chemicals.

The bacterial reverse mutation test (Ames test) is an important mutagenicity test, but it requires approximately 2 g of sample for a dose-finding study and main study [2]. On the other hand, the amount of flavor produced industrially is extremely small, which often means that testing is impossible. Additionally, the peculiar odor of some flavors sometimes makes it difficult to perform the test in the laboratory. Recently, quantitative structure–activity relationship (QSAR) approaches instead of the Ames test have been frequently used for assessing the mutagenicity of chemicals [3]. Ono et al. assessed the viability of QSAR tools by using three QSAR tools to calculate the Ames mutagenicity of 367 flavor chemicals (for which Ames test results were available) [4]. Consequently, the highest sensitivity (the ability of a QSAR tool to detect Ames positives chemicals correctly) was 38.9% with the single tool and 47.2% even with the combination of three tools, which indicated that application of QSAR tools to assess the Ames mutagenicity of flavor chemicals was still premature. Therefore, it is necessary to improve or develop QSAR tools for predicting Ames mutagenicity of flavor chemicals.

Flavor chemicals are relatively low molecular weight chemical substances mainly composed of carbon, hydrogen, oxygen, nitrogen, and sulfur that often have specific functional groups. In Japan, most food flavors are classified into 18 types according to their chemical structure [5]. Therefore, with a focus on their characteristic chemical space, we thought that there was potential to increase the predictive performance by developing a local QSAR model customized for flavor chemicals. In recent years, computational software has been provided to assist with development of QSAR models by machine learning. We have tried to develop a QSAR model specialized for flavor chemicals using StarDrop™ software, which has a module (Auto-Modeller™) that can generate predictive models automatically.

Before developing the QSAR model, we developed a new robust Ames database of 406 food flavor chemicals that is based on Ono’s database [4]. We re-evaluated ambiguous data judged as “equivocal” in Ono’s database via literature review and incorporated Ames test data of flavor chemicals from other publicly available databases. In parallel, we performed the Ames test with key flavor chemicals of which Ames data is unknown and incorporated their results into the new database. This benchmark food flavor chemical database is useful for development of QSAR models and evaluation of QSAR model performance.

Materials & methods

Ames test database of food flavor chemicals

We utilized the Ames test database of food flavor chemicals reported by Ono et al. [4], but because the database includes 14 “equivocal” judgments (Table 1), we re-evaluated by reviewing the reference literature and re-classified them as positive, negative, or inconclusive. Ames test data of the inconclusive chemicals were excluded from the database. If there were any other flavor chemicals from publicly available Ames test database (Hansen database [6]), they were also added.

Table 1 Re-evaluation of Ames test data, which were categorized as “equivocal” by Ono et al. [4]

Ames test

Ames tests were performed for 45 flavor chemicals. The purities and suppliers of the test chemicals are shown in Table 2. The Ames tests were conducted by contract research organizations following Good Laboratory Practice compliance according to the Industrial Safety and Health Act test guideline with preincubation method [7]. The test guideline requires five strains (Salmonella thyphimurium TA100, TA98, TA1535, TA1537, and Escherichia coli WP2 uvrA) under both the presence and absence of metabolic activation (rat S9 mix prepared from phenobarbital and 5,6-benzoflavone-induced rat liver), which is similar to the Organization of Economic Co-operation and Development guideline TG471 [8]. The positive criterion is when the number of revertant colonies increased more than twice as much as the control in at least one Ames test strain in the presence or absence of S9 mix. Dose dependency and reproducibility were also considered in the final judgment. The relative activity value (RAV), which is defined as the number of induced revertant colonies per mg, was calculated for the positive result.

Table 2 Flavor chemicals in which Ames test was newly conducted

Commercial QSAR tools

DEREK Nexus™ is a knowledge-based commercial software developed by Lhasa Limited, UK [9, 10]. The software includes knowledge rules created by considering insights related to structural alert, chemical compound examples, and metabolic activations and mechanisms. We used DEREK Nexus™ version 6.1.0 in this study. DEREK Nexus™ ranks the possibility of mutagenicity (certain, probable, plausible, equivocal, doubted, improbable, impossible, open, contradicted, nothing to report) by applying a reasoning rule.” When it is certain,” “probable,” “plausible,” or equivocal,” the query chemical is predicted to be positive in the Ames test.

CASE Ultra is a QSAR-based toxicity prediction software developed by MultiCASE Inc. (USA). CASE Ultra uses a statistical method to automatically extract alerts based on training data by using machine learning technology [11, 12]. The structural characteristics of the alert surroundings are called the modulator, and these are also learned automatically from the training data. In this algorithm, to construct a QSAR model with continuous toxicity endpoints, various physical chemistry parameters and descriptors are used. We used CASE Ultra version 1.8.0.2 with the GT1_BMUT module in this study. The prediction result of each module is ranked as known positive,” “positive,” “negative,” “known negative,” “inconclusive, or out of domain. A query chemical ranked known positive,” “positive or inconclusive is predicted to be positive in the Ames test.

Software for developing a new QSAR model

StarDrop™ developed by Optibrium Ltd. (UK) is an integrated software for drug discovery that includes the statistics-based QSAR model generation tool, Auto-Modeller™. Using multiple modeling techniques and a suite of built-in descriptors, Auto-Modeller™ automatically generates tailored predictive models based on the study dataset for the domain that needs to be predicted.

Analysis of QSAR tool performance

Because the Ames test results are binary, positive, or negative, their predictive power can be objectively quantified and assessed from their coincidence from the QSAR calculation results. The 2 × 2 prediction matrix comprising true positive (TP), false positive (FP), false negative (FN), and true negative (TN) is given in Table 3. Sensitivity (ability to detect positive substances) is calculated as TP / (TP + FN), specificity (ability to detect negative substances) is calculated as TN / (TN + FP), and accuracy (prediction rate of positive and negative) is calculated as (TP + TN) / (TP + TN + FP + FN). Applicability is provided by (TP + TN + FP + FN) / total number.

Table 3 2 × 2 contingency matrix for Ames mutagenicity classification

Results

Development of a new Ames test database of food flavor chemicals

We developed a new Ames test database consisting of 406 food flavor chemicals (Table 4). The data source is described as follows.

Table 4 406 food flavor chemicals assessed by Ames test and QSARs

Ono et al. reported an Ames test database consisting of 367 food flavor chemicals (positive: 24, equivocal: 12, negative: 331) [4]. However, it actually contained 369 chemicals (positive: 24, equivocal: 14, negative: 331). Table 1 shows the 14 equivocal chemicals. We reviewed key references that led to equivocal and re-evaluated to determine if there was evidence of positivity or negativity in view of current testing criteria. Our final judgment and the supporting reasons are described in Table 1 [13,14,15,16,17,18,19,20,21,22,23]. If there was insufficient evidence or no detailed information available for the judgment, we concluded that they were inconclusive.” Among 14 equivocal flavoring chemicals, four were positive, six were negative, and four were inconclusive. In total, 365 flavor chemicals (positive: 28, negative: 337), excluding four inconclusive chemicals, were added to the new database.

Two flavor chemicals, quinoline (91–22–5) and 4-methylquinoline (491–35–0) have been added to the new database. Their Ames test data were found in the Hansen data set [6].

We newly performed Ames tests for 45 flavor chemicals. The information of tested samples and the Ames test results are shown in Table 2. Ten of the 45 Ames test results were previously reported [24]. The raw Ames test data are available in the Additional files. Among 45 flavor chemicals, 15 were positive and 30 were negative. Six chemicals, indole (120–72–9), 5-methylfurfural (620–02–0), 2,3-pentanedione (600–14–6), allyl isothiocyanate (57–06–7), skatole (83–34–1), and gamma-terpinene (p-Mentha-1,4-diene) (99–85–4), are also present in Ono’s database. In Ono’s database [4], 2,3-pentanedione was judged as negative, but it clearly increased the mutant frequency in TA100 in the absence of S9 mix (Additional file (6)). The results of these Ames tests are reflected in the new database. Finally, 39 new food flavor chemicals were added to the database.

Development of a new QSAR model for predicting Ames mutagenicity

We developed a new QSAR model for predicting Ames mutagenicity by using StarDrop™ Auto-Modeller™. To develop the QSAR model, the available Ames test study dataset is essential. We used 406 datasets of flavor chemicals in the new Ames test database to develop the model. To further increase the size of the dataset (especially positive data), we added Ames test data of chemicals structurally similar to flavor chemicals. We previously developed a large Ames test database consisting of > 12,000 industrial chemicals [25]. We selected 428 chemicals (positive: 255; negative: 173) from the database that have molecular weights < 500 and possess a characteristic substructure of flavor chemicals defined in the Food Sanitation Law in Japan [5]. The Ames test data of 834 chemicals (positive: 299, negative: 535) were integrated as the study dataset for the development of the QSAR model.

Prototypes of predictive models were built by using an automatic process. The study dataset was divided into training (70%) and validation (30%) data by using the cluster method, which uses an unsupervised non-hierarchical clustering algorithm developed by Butina [26]. Auto-Modeller™ has three modeling methods (Gaussian process, random forest, and decision tree) for the category model. In a pretest, the random forest model gave the best performance for our target. The descriptors were automatically generated, including whole molecule descriptors (e.g., molecular weight, logP, and polar surface area) and 2D structural descriptors from the training set. Because the accuracy of the prototype depends on the training data set and the data splitting process is not replicable, 80 prototypes were built to search for the best model. The prototypes that earned favorable prediction scores were selected for further performance evaluation by using the Ames test data of flavoring chemicals, and their performances were compared with those of the benchmarks. Finally, a new QSAR model StarDrop NIHS 834_67 was developed. The prediction result is ranked as positive or negative.”

Performance of the QSAR model

We evaluated the performance of StarDrop NIHS834_67 to predict the Ames mutagenicity. We calculated the Ames mutagenicity of 406 food flavors listed in the new Ames test database by using StarDrop NIHS 834_67, DEREK Nexus™, and CASE Ultra. Table 4 shows the results of the QSAR calculation. Table 5 is a 2 × 2 prediction matrix, and Table 6 shows the performance (sensitivity, specificity, accuracy, and applicability) of the three (Q) SARs. StarDrop NIHS 834_67 showed the best performance. Table 7 shows nine FN chemicals that were positive in the Ames test but were negatively predicted by NIHS834_67. Table 8 shows 13 FP chemicals that were negative in the Ames test but were positively predicted by NIHS834_67.

Table 5 Results of QSAR calculation of 406 flavor chemicals in 2X2 contingency matrix
Table 6 Performance of three QSARs for predicting Ames mutagenicity of 406 flavor chemicals
Table 7 Ames positive chemicals, but predicted as negative by StarDrop NIHS 834_67 (False negative)
Table 8 Ames negative chemicals, but predicted as positive by StarDrop NIHS 834_67 (False positive)

Discussion

We have developed new Ames database consisting of 406 types of food flavor chemicals. This benchmark food flavor chemicals database is open to the public and useful for risk assessment of food additives and developing QSAR models for predicting Ames mutagenicity of food flavor chemicals and other low molecular weight chemicals. The main body of the database is derived from the database reported by Ono et al. [4]. We re-assessed 14 equivocal chemicals and classified them as negative, positive, or inconclusive. However, the positive and negative chemicals remaining in Ono’s database were not re-assessed. Some of these chemicals may also be misjudged. In fact, 2,3-pentanedione (600–14–6), which was negative in Ono’s database, was clearly positive in the present Ames test (Additional file (6)). To ensure database robustness, it is necessary to re-assess the test results reported as positive and negative. As will be described later, especially, the results of the Ames test that differ from the QSAR prediction results could be questioned.

In 2012, Ono et al. reported the performance of three commercial QSAR tools (Derek for Windows, MultiCASE, and ADMEWorks) for predicting Ames mutagenicity of 367 food flavor chemicals [4]. Derek for Windows and MultiCASE are earlier models of DEREK Nexus™ and CASE Ultra, respectively. As a result, the sensitivity, specificity, and accuracy were 38.9, 93.4, and 88.0% (Derek for Windows), 25.0, 94.3, and 87.5% (MultiCASE), respectively. In this study, we evaluated the performance of DEREK Nexus™ and CASE Ultra for 406 food flavors in the new Ames database. As a result, the sensitivity, specificity, and accuracy were 70.5, 96.1, and 93.3% (DEREK Nexus™) and 70.5, 90.3, and 88.2% (CASE Ultra), respectively. These results indicate that the performance of the QSAR prediction has improved significantly over the last decade. The improvement in sensitivity was particularly remarkable. Improvement of the QSAR models and accumulation of newly acquired Ames test training data may have contributed to the high performance. In particular, the NIHS-sponsored Ames/QSAR International Challenge Project has contributed significantly to improving the performance of commercial QSAR tools, such as DEREK Nexus™ and CASE Ultra, which have acquired over 12,000 unique chemical Ames datasets [24]. The newly developed StarDrop NIHS 834_67 outperformed DEREK Nexus™ and CASE Ultra. StarDrop NIHS 834_67 also acquired 428 chemicals (positive: 255, negative: 173) selected from the 12,000 unique chemical Ames datasets. Despite incorporating the same training data, StarDrop NIHS 834_67 provided higher prediction, probably due to differences in the target chemical space. Flavor chemicals are relatively low molecular weight and have unique functional groups that allow them to focus on the chemical space of interest and develop highly predictable models with relatively small size training data. Our attempt to develop a local QSAR model that focused on flavor chemicals has been somewhat successful. However, it is not surprising that that StarDrop NIHS 834_67 showed higher performance than other QSAR tools. It may be because StarDrop NIHS 834_67 used the results of 39 new flavor chemical datasets and revised existing flavor chemical data for training and validation data.

Considering that the estimated interlaboratory reproducibility of the Ames test has been reported to be approximately 85% [27, 28], the performance of the prediction may be approaching the upper limit. Nonetheless, FN and FP analysis points to improvements in the database and QSAR models. Of the nine FN flavor chemicals by StarDrop NIHS 834_67, menthone (89–80–5), raspberry ketone (54–51–2), and cadinene (29350–73–0) were also predicted as negative by DEREK Nexus™ and CASE Ultra (Table 7). The Ames mutagenicity of these chemicals, which were predicted to be negative by the three QSARs, may actually be negative chemicals. We need to perform actual Ames tests to confirm.

In this study, we examined the Ames tests for raspberry ketone (54–51–2) and the result was positive (Table 4). However, the mutagenic activity was very weak (RAV: 10) (Additional file (12)). Structural features found in FN chemicals include the α, β-unsaturated carbonyl structures, trans-cinnamaldehyde (104–55–2), 4-phenyl-3-buten-2-one (122–57–6), 4-methyl-2-pentenal (5362–56–1), and 2- furyl methyl ketone (1192–62–7), which were predicted to be positive by DEREK Nexus™ and/or CASE Ultra. The α, β-unsaturated carbonyl structure is a typical alert for Ames mutagenicity [29,30,31]. These predictions indicate that the alert is incorporated in DEREK Nexus™ and CASE Ultra but not in StarDrop NIHS 834_67. By incorporating α and β-unsaturated carbonyl chemicals as training data, it is expected that the FN rate of StarDrop NIHS 834_67 will be reduced and the predictability will be improved.

On the other hand, of the 13 FP chemicals, 3,4-hexanedione (4437–51–8) was also predicted as positive by DEREK Nexus™ and CASE Ultra. The Ames mutagenicity of this chemical may actually be positive. Interestingly, 12 other FP flavor chemicals were correctly predicted as negative by DEREK Nexus™ and CASE Ultra, which highlights the different characteristics between StarDrop NIHS 834_67 and other QSAR tools and indicates the potential for further improvement.

Conclusions

We developed a new Ames database of 406 food flavor chemicals. Using this database and other Ames datasets of chemicals that are structurally similar to flavor chemicals, we also developed a new QSAR model for predicting Ames mutagenicity. The local QSAR model, StarDrop NIHS 834_67, is customized to efficiently predict the mutagenicity of food flavors and other low molecular weight chemicals, delivering performance superior to that of other commercial QSAR tools. By further improving the model, it can be used to assess the mutagenicity of food flavors without actual testing.