1 Introduction

Xanthine oxidase (XO) is an important enzyme that catalyzes the oxidation of xanthine to uric acid, the last reaction of metabolic pathway to breakdown purine in the body. Inhibiting the activity of the enzyme xanthine oxidase (XO), thereby reducing the formation of uric acid has been shown to be one of the most effective treatments for gout. However, the XO inhibitors such as allopurinol and febuxostat still have some undesirable effects including skin rash, nausea, vomiting, kidney failure, Steven—Johnson syndrome… [16]. Therefore, searching for compounds of natural origin with xanthine oxidase inhibitory effect and low toxicity has been recently considered as an alternative solution in the discovery and development of new drug for gout treatment.

Celery (Apium graveolens L.) has been widely used in folk medicine for treatment of rheumatoid arthritis [19]. The hydroalcoholic extract of celery was also demonstrated to have effects on lowering uric acid level in vivo by inhibiting the XO activity [6]. Flavonoids, which are the main active constituents of celery extract, have been confirmed to have XO inhibitory effect in many research [4, 14]. In this case, the activity of celery extract may result from a combination of several flavonoids instead of a single compound [27].

In the traditional approach to quality control of herbal medicine, one or several chemical compounds, as known as markers, are chosen and analyzed. However, in most case, these compounds were not tested whether they have pharmacological activities the same as the herbal medicine [3]. On the other hand, because of the complexity of the chemical components and mechanism of action, the quality of herbal product can not be accurately reflected using this approach [24]. To overcome this shortcoming, spectrum-effect relationship approach was proposed and has been considered as an interdisciplinary and innovative science which allows to study plant materials more accurately [26]. The number of publishes on this field has risen dramatically in the past two decades. Some remarkable results were listed in Table 1:

Table 1 Some recent spectrum-effect relationship researches

In order to obtain spectrum data, a various number of methods could be used, including high-performance liquid chromatography (HPLC), gas chromatography (GC), infrared spectroscopy (IR)… [26]. However, the main disadvantages of these methods are the requirement of sophisticated and expensive equipment which are not always affordable in a common laboratory. Another option which is much simpler and more cost-effective is ultraviolet–visible (UV–Vis) spectroscopy. Therefore, our study was conducted with the aim of developing a quantitative model that demonstrates the relationship between the UV–Vis spectrum and xanthine oxidase inhibitory effect of celery seed extract.

2 Materials and methods

2.1 Materials

Celery seeds were collected in June 2018 in Hai Hau, Nam Dinh, Vietnam. The plant material was identified and authenticated by Prof. Hang Nguyen Thu, Department of Pharmacognosy, Hanoi University of Pharmacy. The voucher specimen (ID: HNIP/18,542/19) was deposited at the Herbarium of Department of Botany—Hanoi University of Pharmacy. Post-harvest seeds were dried and stored in a cool, dry place.

2.2 Extraction

Celery seeds were ground, accurately weighed (about 15.00 g) then transferred into the round bottom flask. After that, the solvent (ethanol/water) was added and the mixture was heated for 1 h and filtered. The process was repeated one more time. Next, combine all the extract then 1.50 g solid paraffin was added followed by stirring the mixture for 10 min at 70 °C. Allow the mixture to cool then remove the paraffin and evaporate the solvent. By modifying three extraction conditions: ethanol concentration, temperature and solvent:solid ratio, 17 different celery seed extracts were prepared.

2.3 Ultraviolet–visible (UV–Vis) spectroscopy

The mixture of 0.1500 g of celery seed extract and 20 ml of methanol was homogenized by ultrasound for 10 min followed by centrifuging at the speed of 3000 rpm for 10 min. After the supernatant was collected, the residue was ultrasonically extracted and then centrifuged with the aforementioned parameters three more times. The stock solution was prepared by combining all the after-centrifugation supernatant then transferring into a 100 ml volumetric flask, making up to the mark with methanol and mixing. Taking exactly 0,5 ml of the stock solution into a 10 ml volumetric flask, followed by adding 5 ml of 1% triethylamine in methanol and making up to the mark with methanol to obtain the test solution. The absorbance of the test solution (A) was scanned over a range of wavelength, from 190 to 600 nm with a step of 5 nm using UV–Vis Hitachi U-1900 spectrophotometer.

2.4 Xanthine oxidase inhibitory activity assay

The XO inhibitory effect of celery seed extract was assessed on Corning 96-well UV plates using the method described by Nguyen et al. [15] with minor modifications. Each well contained 50 µl of test solution; 35 µl of of 70 mM phosphate buffer (pH = 7.5); 30 µl of 0,01 U/ml enzyme solution (in 70 mM phosphate buffer) which was prepared immediately before use. After incubation at 25 °C for 15 min, 60 µl of 150 µM xanthine solution was added. The mixture was continued to incubate at 25 °C for 30 min then 25 µl of 1 N HCl was added. The absorbance was measured at the wavelength of 290 nm. Each experiment was repeated 3 times. The half maximal inhibitory concentration (IC50) of the extracts was determined using GraphPad Prism 8.0 software. Quercetin was used as positive control.

2.5 Model building

2.5.1 Data preprocessing and feature selection

The response variable (logIC50) was obtained by taking the common logarithm of the half maximal inhibitory concentration (IC50) while the absorbances were considered as independent variables (features). Then the initial dataset was randomly divided into a training set (12 samples-70%) to select important features and establish models and a test set (5 samples-30%) for evaluation of the models using SPSS 22.0 software. The independent variables (features) that had the most influence on logIC50 were selected by correlation-based feature selection algorithm using Weka 3.8 software.

2.5.2 Model building

Five models were constructed using five methods: multiple linear regression (MLR), artificial neural network (ANN), support vector regression (SVR), random forest (RF) and partial linear regression (PLS) respectively, using Weka 3.8 software.

2.5.2.1 Multiple linear regression (MLR)

MLR model is expressed as following equation:

$${\text{Y }} = \, \beta_{0} + \, \beta_{{1}} {\text{X}}_{{1}} + \, \beta_{{2}} {\text{X}}_{{2}} + \cdots + \, \beta_{{\text{n}}} {\text{X}}_{{\text{n}}}$$

where the dependent variable Y is logIC50, the independent variable Xi is the absorbance at selected wavelength, βi is the coefficient and β0 is the intercept [10].

2.5.2.2 Artificial neural network (ANN)

Artificial neural network is the method that simulates the learning and information processing of human brain [17]. The structure of a typical artificial neuron network consists of artificial neurons that are group in three type of layers: one input layer, one output layer and one or more hidden layer(s). In particular, the input layer represents the independent variables (absorbance), the output layer is specific to the dependent variable (logIC50), the hidden layer is specific to the information processing process of the network. During the learning process, based on the difference between the predicted value and the observed value of the output variable in the previous loop, the weight of the input variables will be adjusted to minimize the above error [2]. In this study, three parameters that affect to the quality of ANN model were investigated, including: number of neurons in the hidden layer (1–7), learning rate (0.1–0.9 with increments of 0.2) and momentum (0.1–0.9 with increments of 0.2).

2.5.2.3 Support vector regression (SVR)

The support vector machine algorithm was first proposed by V. Vapnik [5] and could be applied to solve both classification and regression problems. In support vector regression, a type of support vector machine, which deals with regression tasks, the object is to find a function y = f(x) that is as flat as possible so that the error between yi predicted by the model and the observed value is not greater than the given value ε. The SVR model could be either linear or nonlinear functions [23]. In this study, nonlinear SVR with RBF kernel function was used. Two parameters that affect the quality of the SVR model were investigated, including: C (1–500 with increments of 50) and gamma γ (0.1–0.9 with increments of 0.2).

2.5.2.4 Random forest (RF)

Random forest is an ensemble learning method which was first proposed by L. Breiman and has become one of the best-performing learning algorithms [20, 22]. In this study, different values of number of trees (from 10 to 100) were investigated to obtain the best quality model.

2.5.2.5 Partial linear regression (PLS)

In case of high dimension data, to use MLR or other algorithms such as ANN, SVR, RF, feature selection needs to be performed. Another solution is to reduce the dimension of data by creating new variables using PLS regression. This algorithm is particularly suited when there are more variables than observations or multicollinearity exists among independent variables [8]. In this study, the number of latent variables which were created from 83 independent variables was investigated to obtain the best model.

2.5.3 Validation

The established models were chosen based on determination coefficient R2, leave one out (LOO) correlation coefficient Q2, LOO average absolute deviation (MAD), LOO root mean square error (RMSE) and the accuracy of predicted values of dependent variable of training set and test set.

2.5.4 Y randomization

Y randomization is performed by randomly shuffling the dependent variable while keeping the remain variables unchanged to confirm that the developed model is not obtain by chance. The established model is considered reliable if the values of Q2LOO of new models are significantly lower than Q2 of the initial model [1].

2.6 Application domain

The application domain of the best model was determined using leverage approach. An observe is considered as a outlier if the hi value is larger than the warning leverage h* and the standardized residual is greater than 2.0. The warning leverage h* is calculated using the following equation:

$$h^{*} = \frac{{3\left( {k + 1} \right)}}{n}.$$

where k is the number of features and n is the number of observes [21].

2.7 Sensitivity analysis

Sensitivity analysis studies how the output (Y) is affected when the features (X) changes within a certain range. The influence of the independent variables on XO inhibitory activity of celery seed extract were determined based on the sensitivity coefficient (SC) which was calculated as follow:

$$SC = \frac{{{\Delta }Y/Y}}{{{\Delta }X/X}}$$

where ΔX/X and ΔY/Y were the rates of change in the dependent and independent variables, respectively [7].

3 Results and discussion

3.1 Database preparation

17 different celery seed extracts were obtained by modifying three extraction conditions: ethanol concentration, temperature and solvent:solid ratio. The half maximal inhibitory concentration (IC50) value and the absorbance of 17 celery seed extracts over a range of wavelength, from 190 to 600 nm were determined using the method described in 2.3 and 2.4. The statistical properties of the input and output variables were presented in the Supplementary Material. The UV–Vis spectra of 17 celery extracts were presented in Fig. 1.

Fig. 1
figure 1

The UV–Vis spectra of 17 celery extracts

Using the SPSS 22.0 software, 12 samples were randomly selected for selecting variables and establishing model between XO inhibitory activity and absorbance of celery seed extract while the remaining 5 samples (S2, S3, S8, S12, S15) were selected for the test set.

Feature selection is a critical step in application of machine learning to establish model. Too much features will take a long time to train model and increase the risk of overfitting, sometime, reduce the accuracy of established model. Feature selection could overcome these problems by keeping the important variables, reducing the number of features without losing necessary information. In this study, to identify which variables (absorbances at which wavelengths) had the greatest impact on logIC50, the correlation base feature selection (Cfs) algorithm was used. The results showed that, the absorbances of celery seed extracts at 6 wavelengths: 500, 495, 410, 405, 230 and 210 nm were the most important features. These six absorbances were presented in Table 2.

Table 2 Absorbance at 6 wavelengths and IC50 values of 17 samples

3.2 Model building

3.2.1 Multiple linear regression (MLR)

The model (M1) established from absorbances at six wavelengths and logIC50 was obtained using MLR algorithm and presented as follows:

$${\text{LogIC}}_{{{5}0}} \, = \,{4}.{7945 }{-} \, 0.{9872}\, \times \,A_{405} \, + \,{3}.{6}0{22}\, \times \,A_{230} {-\!\!-}{4}.{6717}\, \times \,A_{210}$$

The results showed that M1 model has a low coefficient of determination (R2 = 0.5570) and a low leave one out correlation coefficient (Q2 = 0.0406) which indicated that this model was not stable (Table 3).

Table 3 Evaluation of five models: MLR, ANN, SVR, RF and PLS

3.2.2 Artificial neural network (ANN)

In this study, backpropagation neural network with sigmoid as activation function and number of iterations = 500 was used to establish the relationship between input and output variables. The effects of three parameters (number of nodes in hidden layer, learning rate and momentum) on the quality of the model built by ANN algorithm were investigated using one factor at a time method. Among 17 obtained models, model M2 (number of hidden neurons = 1, learning rate = 0.5, momentum = 0.3) showed the best quality with the lowest RMSE. However, the low value of predicted Q2 leave one out indicated that this model might not be suitable to predict XO inhibitory activity of celery seed extract from UV–Vis spectra. The plot between Q2, MAD, RMSE and model parameters was shown in Fig. 2.

Fig. 2
figure 2

The plot between Q2, MAD, RMSE and a number of hidden neurons, b learning rate, c momentum

3.2.3 Support vector regression (SVR)

Similar to ANN model, two parameters C (from 1 to 500) and gamma, γ (0.1–0.9) were investigated in order to construct the best quality SVR model using one-factor-at-a-time method. However, the results showed that these two parameters did not have any affected on the performance of SVR model. Therefore, model M3 with C = 1, γ = 0.1 was selected. The high values of determination coefficient (R2 = 0.9347) and leave one out correlation coefficient (Q2 = 0.9366) indicated that a strong relationship existed between input and output variables. The model also showed the ability to accurately predict XO inhibitory effect of celery seed extract with small value of MAD, RMSE and high accuracy (MAD = 0.1875, RMSE = 0.2284 and Accuracy = 98.43% on training set and MAD = 0.1317, RMSE = 0.1499 and Accuracy = 97.45% on test set, respectively). The observed and predicted values of logIC50 were plotted in Fig. 5.

3.2.4 Random forest (RF)

In order to identify the optimal number of the number of trees, 10 models were constructed and the change of Q2, MAD and RMSE value according to number of trees (from 1 to 10) was investigated. The results were presented in Fig. 3.

Fig. 3
figure 3

Q2, MAD and RMSE versus number of trees of RF

As can be seen from the graph, among 10 models, the M4 model corresponding to the number of trees of 100 gives the best quality with the RMSE reaching the lowest point of 0.2600.

3.2.5 Partial linear regression

Unlike other algorithms such as MLR, ANN, SVR or RF, high dimension data is not a problem for PLS. Therefore, from 83 initial independent variables, a number of latent variables were created without any other feature selection process. In this study, 9 models were established with 9 different numbers of new parameters. The results were presented in Fig. 4.

Fig. 4
figure 4

Q2, MAD and RMSE versus number of latent variables of PLS

As can be seen from the graph, using 3 or more variables gave an acceptable results with the high leave one out Q2 and low MAD, RMSE. However, when the number of parameters was equal or greater than 5, overfitting turn out to take place with the very low predicted Q2LOO of test set. Therefore, the PLS model with 4 latent variables was chosen for prediction of xanthine oxidase inhibitory of celery seed extract from UV–Vis spectra (M5 model).

3.2.6 Selection of the optimal model

In order to predict XO inhibitory activity of celery seed extract, five models: M1, M2, M3, M4 and M5 were built using MLR, ANN, SVR, RF and PLS algorithm, respectively. The statistical parameters of these 4 models are shown in Table 3 and the scatter plot between observed and predicted values of logIC50 were presented in Fig. 5. The results showed that the M5 model constructed by PLS method had the best quality and predictive ability with high correlation coefficient (R2 = 0.9618, Q2 = 0.8746), small error (MAD = 0.0671; RMSE = 0.0761, Accuracy = 99.29%). SVR model (M3) was also performed well with the high value of leave one out correlation coefficient (Q2 = 0.9366) suggested that this model was also able to predict biological activity of celery seed extract with high accuracy. It is appeared that ANN, MLR and RF was not good choices in this case. Therefore, PLS model was chosen for further analysis. Our results were also consistent with other research in which the relationship between spectrum data and biological activity could be established well using partial linear regression [12, 13, 18]. These results also allow to affirm that there was a quantitative relationship between the UV–Vis spectra and the XO inhibitory effect of celery seed extract. However, because of the small sample size, more investigations are required in order to apply this model to predict the XO inhibitory effect of celery extract from UV–Vis absorbances at 6 different wavelengths.

Fig. 5
figure 5

The scatter plots of observed and predicted values

3.2.7 Y randomization

To confirm that the established PLS model was not obtained by chance, the Y-randomization test was applied. The dependent variable was random shuffles ten times and 10 new PLS models were obtained. The results showed that the average of Q2 values of 10 models was 0.1976 which was lower than the Q2 value of initial model (0.8746), therefore, proved the robustness of PLS model.

3.3 Application domain

The Williams plot of the PLS model was illustrated in Fig. 6. The results showed that all 17 extracts (12 in the training set and 5 in the test set) had the hi values that were lower than the warning leverage h* = 1.25 and the standardized residual was less than 2.0. On the other words, there was no outlier in dataset and the biological activities predicted by the establish model were reliable.

Fig. 6
figure 6

Application domain of PLS model

3.4 Sensitivity analysis

Using PLS transformation, it was identified that absorbances at 16 wavelengths: 210, 230, 235, 240, 245, 250, 255, 320, 325, 330, 335, 340, 400, 405, 410, 415 nm might play an important role in the biological activity of celery seed extract. Sensitivity analysis was performed on all these 16 variables in order to explain the influence of input parameters on the output. Each variable was increased by 10%, 20%, 30%, 40% and 50% then the average sensitivity coefficients (SC) of all independent variables were determined. The results were showed in Fig. 7. Among 16 parameters, the SC values of the absorbance at 230 nm was the highest. Therefore, the influence of this variable on the XO inhibitory activity of celery seed extract should be taken account.

Fig. 7
figure 7

Sensitivity coefficient of 16 input variables

Our study has been the first research so far that establish a quantitative relationship between UV–Vis spectra and the XO inhibitory effect of celery seed extract. Compared to other chromatographic methods (HPLC, GC…), UV–Vis spectroscopy had some advantages such as simple, easy to implement and low-cost. However, one of the limitations of this method is that UV–Vis spectrum only indirectly reflected chemical components, so it is hard to identify exactly which compound(s) was responsible for the biological effect of the extract. Despite that, UV–Vis spectroscopy was still an efficient method used in quality control of herbal medicine.

4 Conclusion

In this study, the xanthine oxidase inhibitory effect of celery extract was predicted from UV–Vis absorbances at 6 different wavelengths using 5 methods: multiple linear regression (MLR), artificial neural network (ANN), support vector regression (SVR), random forest (RF) and partial linear regression (PLS). The best model was obtained by PLS algorithm with the determination coefficient R2 = 0.9618, leave one out correlation coefficient Q2 = 0.8746, MAD = 0.0671, RMSE = 0.0716 and the accuracy on training set and test set were 99.29% and 96.69%, respectively. Our results suggested that the PLS model from UV–Vis spectra could be an effective method for rapid determination of xanthine oxidase inhibitory activity of herbal medicine, particularly celery seed extract.