Medical & Biological Engineering & Computing

, Volume 46, Issue 6, pp 605–611

Discovering active compounds from mixture of natural products by data mining approach

  • Yi Wang
  • Yecheng Jin
  • Chenguang Zhou
  • Haibin Qu
  • Yiyu Cheng
Original Article

DOI: 10.1007/s11517-008-0323-1

Cite this article as:
Wang, Y., Jin, Y., Zhou, C. et al. Med Biol Eng Comput (2008) 46: 605. doi:10.1007/s11517-008-0323-1

Abstract

Traditionally, active compounds were discovered from natural products by repeated isolation and bioassays, which can be highly time consuming. Here, we have developed a data mining approach using the casual discovery algorithm to identify active compounds from mixtures by investigating the correlation between their chemical composition and bioactivity in the mixtures. The efficacy of our algorithm was validated by the cytotoxic effect of Panax ginseng extracts on MCF-7 cells and compared with previous reports. It was demonstrated that our method could successfully pick out active compounds from a mixture in the absence of separation processes. It is expected that the presented algorithm can possibly accelerate the process of discovering new drugs.

Keywords

Quantitative composition–activity relationship Causality Bioassay-guided isolation Drug discovery Traditional chinese medicine 

1 Introduction

Natural products like alkaloids and terpenes from medical plants and some bacteria, steroids from marine animals, macromolecular products have been important sources for drug discovery [9]. It has been reported that over 70% of new anti-cancer drugs are isolated from natural products or from synthetic molecules with naturally occurring molecular scaffolds [16]. Plant extracts that are traditionally used in China and India for memory-enhancing, as nerve tonics, anxiolytic, anti-inflammatory and immunopotentiation, have recently been screened for new leads of psychotherapeutic compounds using bioassay-guided screening [10, 11, 23]. Since, natural occurring compounds contain more chemical diversity than synthetic compound libraries, there exists a tremendous potential for the identification of new active compounds from natural sources.

Traditionally, the discovery of active components from natural products chiefly relies on various chemical experiments and biological assays, which were routinely achieved by a sequential process [8]. In this process, the products from microbial fermentation or plant extracts were firstly separated into many fractions. Then, bioassays were performed for each fraction to select the most active samples for further separation. Following the separation and bioassay loop, some active substances (“leads”) were purified through chemical methods [15]. The obvious disadvantage in this method is its high cost and labour consumption, as well as the low efficiency in drug discovery. Therefore, new methods are needed to accelerate the pace of identifying active substances from natural products and to reduce the costs for drug discovery.

In the current era, drug discovery and development is a rapidly developing area that increasingly depends on information technology to store, exchange, and mine data and information [6]. The methods of multi-component analysis have been widely used in modern drug discovery process, such as target discovery, combinatory chemistry, chemometrics, etc., [3, 13]. Since the biological activities of a mixture might result from synergetic, additive or antagonistic effects of its active compounds [2], its bioactivity can be changed by varying the proportion of its active substances. Thus, novel approaches are needed to understand the relationship between the chemical components and their pharmacological effects. Quantitative composition–activity relationship (QCAR) is a new data-mining method to understand the correlation between chemical composition of a mixture and its biological activity [20]. Essentially, identification of active compounds from mixture is a variable selection problem, which can be solved by computational approaches.

The objective of this study was to develop a bioinformatics approach to intelligently analyzing chemical and biological data for discovering active compounds. Application and experimental validation of the proposed approach for discovering cytotoxic compounds from extracts of medicinal plants is presented and discussed.

2 Methods

2.1 Theory and algorithm

Typically, natural product extracts are composed of numerous substances. Although it is hard to isolate and purify each substance from the mixture, chemical composition of natural products can be represented and characterized by spectrometric or spectral information. With the aid of modern analytical techniques like chemical fingerprinting [21], quantification or semi-quantification of each constituent in the mixture can be easily achieved. Suppose there are n kinds of detected compounds Ci (i = 1, 2,..., n) in an extracted mixture X from certain natural product, its chemical composition can be represented by a vector [X1, X2,..., Xn]. As different derivation or preparation procedure of this natural product may cause the alternation of some constituents, another extract X′ can be represented by \( [{X}\ifmmode{'}\else$'$\fi_{1} ,{X}\ifmmode{'}\else$'$\fi_{2} , \ldots ,{X}\ifmmode{'}\else$'$\fi_{n} ]. \) If there is a series of mixtures that contain the same compounds in different proportions, their activities Y may be altered in a wide range due to the variation of proportion of active compounds. Thus, relationship between biological activity and the chemical composition of the natural product can be represented by a mathematic function Y = f(X).

In case of a mixture with one active compound CA, there is a dose-dependent relationship between activity of the mixture and proportion of CA. In such situation, f(X) can be simplified to a linear or log linear function. However, when there are several active compounds, additive or synergistic effects between the compounds should be taken into consideration. Some computational approaches concerned with variable selection can be used to investigate such chemical and biological datasets and to identify active compounds.

In this study, a causality discovery method has been developed to solve the variable selection problem. Causal relationship is a relation between particular events: something happens and causes something else to happen [14]. The basic idea of causation analysis algorithm is using conditional dependency tests to determine causal adjacent relationship of the explanatory variables and the predictor.

Assume a dataset consists of the independent variable X and the dependent variable y. For xiX, the conditional dependence coefficient of xi and y in the condition of variable subset S is \( r_{{i|S}} \). The statistic of conditional independent test upon xi and y in the condition of variable subset S is
$$ F_{{i|S}} = \frac{{r^{2}_{{i|S}} }} {{1 - r^{2}_{{i|S}} }} \times (N - q - 1) \sim {}F_{{1,N - q - 1}} , $$
(1)
where N is the number of variable in X and q is the number of variable in subset S.
So, the corresponding probability can be calculated by
$$ P_{{i|S}} = F^{{ - 1}}_{{1,N - q - 1}} (F_{{i|S}} ). $$
(2)
If the threshold of significance level is set as α, then when
$$ p_{{i|S}} > \alpha , $$
(3)
xi and y are considered conditional dependent; otherwise the correlation is considered conditional independent.
To ensure the selected variable is conditional independency to other variables, a procedure of forwards and backwards search is performed in the implementation of algorithm, paralleling the same process of stepwise regression. The detailed algorithm is listed as following:

3 Experimental methods

Panax ginseng (PG) is amongst the most commonly used medicinal plants and chiefly grows in the northeast of China, Korea and some other East Asian regions. A large number of saponins, called ginsenosides, are considered to be the major active components of PG. Nine ginsenosides, namely, Rb1, Rb2, Rb3, Rc, Rd, Re, Rf, Rg1, and Rg2, are commonly detected markers in ginseng extractions. Various pharmacological studies indicate that extracts of ginseng have effects on CNS, cardiovascular, endocrine and immune system [1]. It is also reported that ginseng extract can inhibit cancer cell proliferation in human breast cancer cell [7]. However, the anticancer effect of ginsenoside has not been clearly established. To evaluate and illustrate the capability of the presented approach in screening active compounds from mixture, a dataset related to anti-proliferation effect of PG extraction was produced in our lab and employed in calculation.

Twenty-eight cultivated PG samples were collected from major yielding areas in three GAP cultivated gardens in Jilin and Niaoling Province of China. All ginseng samples were identified morphologically by National Institute for the Control of Pharmaceutical and Biological Products (Beijing, China). Voucher specimens of PG have been deposited in the herbaria (Department of Chinese Medicine Sciences and Engineering, Zhejiang University, China), under the acquisition number of RS001 ∼ RS036. Reproducible extraction procedure and simultaneous determination of ginsenosides, which was reported in our previous investigations as an HPLC/MS method [19], were used in this study. Nine ginsenosides (shown in Fig. 1) were detected and quantified as listed in Table 1. The proportions of 9 ginsenosides in 28 samples varied due to different yielding areas and growth years. To eliminate the influence of distinct content levels of ginsenosides in mixtures, the chemical data is normalized. Briefly, comparative proportion of the ith compound Ci is normalized on a linear scale based on its maximum amount (ximax) and minimum amount (ximin) in all mixtures using xi = (xi − ximin)/(ximax − ximin). The interactions between nine components were calculated by correlation analysis.
Fig. 1

Chemical structures of nine ginsenosides

Table 1

The content of 9 ginsenosides in 28 extracted mixtures of P. ginseng and their effects on proliferation rate of human breast carcinoma cells, MCF-7

Mixtures Number

Rg1

Rb1

Rf

Re

Rg2

Rc

Rb2

Rb3

Rd

Proliferation Rate (%)

1

0.36

0.92

0.12

0.52

0.11

0.37

0.56

0.14

0.25

45.5 ± 5.6

2

0.26

0.23

0.11

0.28

0.05

0.19

0.27

0.04

0.07

83.0 ± 3.0

3

0.20

0.20

0.14

0.31

0.05

0.18

0.29

0.05

0.08

73.5 ± 2.1

4

0.18

0.23

0.17

0.43

0.06

0.24

0.35

0.07

0.11

87.3 ± 3.6

5

0.25

1.05

0.21

1.20

0.04

1.02

1.36

0.29

0.36

61.3 ± 4.1

6

0.20

0.20

0.12

0.27

0.03

0.17

0.24

0.04

0.05

67.1 ± 5.8

7

0.21

0.18

0.11

0.29

0.02

0.17

0.25

0.04

0.06

69.1 ± 4.1

8

0.26

1.43

0.43

1.24

0.09

1.50

1.87

0.44

0.23

57.5 ± 5.7

9

0.26

0.98

0.11

1.05

0.04

0.37

0.45

0.23

0.35

71.0 ± 1.2

10

0.26

0.94

0.28

0.54

0.02

0.66

0.98

0.26

0.38

54.9 ± 2.2

11

0.24

0.54

0.21

1.26

0.06

0.27

0.37

0.07

0.06

104.9 ± 8.5

12

0.26

1.15

0.27

1.21

0.05

0.96

1.28

0.31

0.45

62.5 ± 0.8

13

0.15

0.17

0.07

0.25

0.03

0.15

0.20

0.03

0.04

85.1 ± 6.5

14

0.16

0.28

0.09

0.22

0.05

0.17

0.23

0.05

0.06

64.6 ± 3.0

15

0.16

0.29

0.13

0.27

0.07

0.21

0.29

0.05

0.05

69.7 ± 2.9

16

0.23

0.78

0.13

0.62

0.03

0.47

0.46

0.13

0.46

77.4 ± 7.9

17

0.22

1.04

0.20

0.92

0.04

0.74

1.00

0.18

0.66

55.1 ± 6.3

18

0.27

0.73

0.24

0.88

0.03

0.73

0.71

0.32

0.40

84.0 ± 4.0

19

0.40

1.35

0.19

1.25

0.09

0.88

1.38

0.33

0.30

65.9 ± 3.7

20

0.32

0.21

0.17

0.39

0.04

0.27

0.30

0.04

0.06

67.3 ± 1.5

21

0.21

0.21

0.11

0.28

0.04

0.23

0.30

0.05

0.07

72.3 ± 0.9

22

0.28

0.16

0.13

0.32

0.02

0.17

0.28

0.05

0.06

80.7 ± 2.4

23

0.40

0.60

0.21

1.15

0.03

0.92

1.42

0.28

0.45

81.1 ± 3.1

24

0.30

0.91

0.26

1.08

0.03

0.75

1.00

0.22

0.57

54.1 ± 1.9

25

0.25

1.04

0.22

1.44

0.05

1.36

1.82

0.42

0.62

76.3 ± 3.5

26

0.22

0.37

0.11

0.31

0.03

0.20

0.26

0.04

0.04

73.6 ± 1.8

27

0.20

0.43

0.18

1.04

0.05

0.21

0.32

0.03

0.08

118.6 ± 14.2

28

0.22

1.22

0.15

1.20

0.04

0.54

0.51

0.15

0.42

42.2 ± 1.3

Bioactivity assay of each sample was performed according to the protocol of Hu et al. [22] with minor modification. Briefly, human breast carcinoma cells, MCF-7, were grown in DMEM containing 10% fetal calf serum and 300 mg/L glutamine and cultured at 37°C in a humidified CO2 incubator. For cytotoxicity assay, cells were seeded into 96-well plates at a density of 2 × 103/well. Proliferation effects of ginseng extracts were determined by a MTT assay, after incubation of cells with ginseng extracts at approximate concentration of 500 μg/L for 48 h. Control cultures were incubated in medium alone. The cytotoxicity toward MCF-7 induced by adriamycin, a clinically used positive agent, was also determined as positive control. All assays were performed at least four times. The percentage of MCF-7 proliferation was calculated as follows: cell proliferation rate (%) = 100 × (cell number in experimental well/cell number in the control well). Proliferation rate less than 100 indicates potential anti-cancer activity, whilst proliferation rate over 100 suggests effect on facilitating growth of cancer cell.

3.1 Determination of parameter

The threshold α is an important parameter in the algorithm. Too low threshold will lead to too strict confidence test, which may result in none of the selected compound. Usually, the result is reliable when α ≤ 0.2, indicating a confidence level of greater than 80%. In present study, a serial of thresholds in the range of 0.01–0.2 were used to calculate causality between active components and their bioactivities. Then, appropriate threshold was selected by the following evaluation criteria in cross-validation.

3.2 Evaluation criteria

To assess the reliability of our results from computation, a statistic (instability factor, IF) is defined to evaluate the coherence of selected compounds in leave-one-out (LOO) cross validation
$$ {\text{IF}} = \frac{{{\sum\nolimits_{S_{j} \subset S} {(N_{{U \not\subset U_{j} }} + N_{{U_{j} \not\subset U}} )} }}} {{N_{S} }}, $$
(4)
where S is the sample set and Ns is the number of instances in S. Sj is the sample set excluding the jth instance in the LOO cross validation. U and Uj are respectively the resultant set obtained by variable selection algorithm on S and Sj. NU\Ujis the number of variables in U that are not in Uj, NUj\U is the number of variables in Uj that are not in U. In current study, IF of casual discovery algorithm and stepwise regression were compared. And the threshold caused maximal alternation in IF was selected to be used in our study.

3.3 Implementation

Our algorithm was implemented in Matlab 6.5 (MathWorks, Inc.). With the parameter specified above, active compounds were identified in less than 1 min using a PC with a 1.1 GHz CPU under Windows environment with 256 MB RAM.

4 Results

4.1 Selection of active ginsenoside by data mining approach

Above described ginseng dataset was used to test the capability of the proposed algorithm. Leave-one-out cross validation was performed to compare the reliability of the proposed approach and stepwise regression. The selected times of each ginsenoside was recorded and listed in Table 2. When threshold was set to 0.05 and 0.2, only one ginsenoside, i.e. Rb1 was selected by our algorithm, which was regarded as causal adjacent to the decrease of cancer cell proliferation rate. However, two ginsenoside, Rb1 and Re was identified by stepwise regression with the threshold 0.05. And when increasing α to 0.2, the result of stepwise regression algorithm changed and more compounds were selected. The result shows that the number of active compounds selected by stepwise regression increases with increased threshold, whilst the result of our algorithm is relatively stable. In current, the threshold is used to compare the probability of experimental results and null hypothesis. A bigger threshold would result in lower reliability of computational outcome. Traditionally, experimenters have used either the 0.05 level (sometimes called the 5% level) or the 0.01 level (1% level) as significance level to perform statistical tests. However, when seeking active compounds from natural mixtures, 0.05 level seems too strict that may miss some compounds with weak and potential activities. Thus, a relatively high and conservative threshold, such as 0.2, may produce better performance.
Table 2

Selected compounds of stepwise regression and casual discovery algorithm in leave-one-out cross validation

Threshold α

Algorithm

Frequency of variable (ginsenoside) selection

Rb1

Re

Rd

Rb3

Rc

Rg1

Rg2

Rb2

Rf

0.05

STEPWISE

28

28

0

0

0

0

0

0

0

Casual discovery

28

0

0

0

0

0

0

0

0

0.2

STEPWISE

28

28

16

1

0

0

0

0

0

Casual discovery

28

0

0

1

1

0

0

0

0

The variation of IF companying with the change of threshold (Fig. 2) indicates the instability of result obtained by stepwise regression is much higher than that of causal discovery algorithm. The above results demonstrate that our algorithm is more robust than stepwise regression method in identifying active components. When threshold was set at 0.2, variation of IF in LOO cross validation by stepwise regression was maximal. It suggested variation of IF can be used as a parameter to select appropriate threshold, like the root mean square error of cross-validation (RMSECV) was used to select hidden node numbers in Artificial Neural Network (ANN). Although stepwise regression is the most commonly used method in empirical studies to yield correlation between outcome variable and regressors, it holds distinct hypothesis that possible probability distributions among a set of variables are restricted and unaltered. In some cases, such as nonlinear or manipulated systems, regression method may lead to false results. In contrary, causal discovery algorithm mentioned above could use conditional independence test to substitute random controlled trials and provide the most trust-worthy methods for establishing causal relationships from data. The causal discovery method has been successfully applied in study of social sciences, economy and medicine [12]. Comparing stepwise regression with the proposed algorithm, the major difference lied in the numbers of variable subset used for conditional independent test. In stepwise regression, all selected variables were used for statistic test in one time. But our algorithm used all subsets of the selected variables to perform the test for many times. Then the maximal or minimal statistic was used to compare with the threshold of significance level. Therefore, the stricter condition for variable selection brought in higher robustness.
Fig. 2

Variation of instability factor (IF) with the increase of threshold (α). The line with solid squares represents IF of stepwise regression, whereas the line with diamond shape is the IF of causal discovery algorithm in LOO cross validation

4.2 Verification of computational result by MTT assay

To further validate the computational results, we used MTT assay to test the bioactivity of the all nine components. The cytotoxicities of nine ginsenosides on MCF-7 were evaluated at different concentration from 15.6 to 500 μM. As shown in Fig. 3, Rb1 at concentrations above 15.6 μM moderately inhibited the proliferation of MCF-7. In contrary, Re and other ginsenosides rarely showed concentration dependent inhibition effect for MCF-7 proliferation. The experimental results validate that our algorithm can select the most active compound from the mixture of natural products.
Fig. 3

Dose-dependent proliferation rate of MCF-7 induced by nine ginsenosides. Rb1 inhibits the proliferation of cancer cell in a significant dose-dependent manner. Bars represent standard deviations, n = 4

An inter-correlation matrix was constructed in nine components and was listed in Table 3. Intercorrelations among the proportions of 9 ginsenosides in 28 samples were not too high (less than 0.9) except for the correlations among Rc, Rb2 and Rb3. However, owing to low correlation between their proportions with bioactivity, high inter-correlations among Rc, Rb2 and Rb3 have rare affection on causality calculation. Our complementary experiments also validate Rc, Rb2 and Rb3 have less effect on growth of MCF-7 cells.
Table 3

Intercorrelations among proportions of 9 ginsenosides in 28 samples

 

Rg1

Rb1

Rf

Re

Rg2

Rc

Rb2

Rb3

Rb1

0.446

1

      

Rf

0.326

0.653

1

     

Re

0.433

0.814

0.647

1

    

Rg2

0.260

0.359

0.213

0.224

1

   

Rc

0.436

0.823

0.811

0.783

0.229

1

  

Rb2

0.500

0.792

0.784

0.766

0.260

0.981

1

 

Rb3

0.494

0.850

0.764

0.775

0.228

0.957

0.938

1

Rd

0.380

0.763

0.476

0.688

−0.073

0.732

0.703

0.740

It has been suggested that ginseng extracts can show considerable variability in their bioactivity [24]. This variation might be attributed to the distinction in chemical composition of ginseng extracts, manufactured by different researchers in different labs. King et al. [7] reported that the alternation of preparation method, including solvent, temperature and time in extraction process, lead to opposite activity. The computational and experimental results in our work demonstrated that the variability of ginseng extracts’ effect on inhibition of MCF-7 proliferation may be caused by composition of different ginsenosides.

5 Discussion and conclusion

The current study is a preliminary study to use an informatics approach to discover active compounds from natural products. A limited set of data has been tested and more studies should be performed to validate the efficacy of proposed algorithm. A major problem in the computational search for active compounds lies in whether experimental error and uncertainty of bioassay is valid enough to obtain reliable result. In our study, the relative standard variation (RSD) of MTT array is chiefly less more than 10%, which is acceptable value for experimental error of bioassays in vitro. In our previous study, RSD of bioassay on animal model was much higher [5]. Despite the accuracy of the informatics approach need further improvement, interpretation of computational results can give new insight into the chemical biological space and allow narrowing down the screening range. Thus, the presented data mining approach can guide the focused screening of active compounds and improve the effectiveness and efficiency of new drug discovery.

Besides traditional bioassay-guided isolation, many new strategies have been developed to identify active compounds from natural products. Butterweck and her colleagues [4] have designed a step by step removal strategy to investigate synergistic or antagonistic effects of different compounds in St. John’s wort (SJW), a widely used natural product in the treatment of mild to moderate forms of depression. In their study, several extracts of SJW, containing different proportion of three major active components, caused different antidepressant activity. It is expected that when the sample size of mixtures produced by is properly increased, our algorithm can be extended to handle this type of chemical and biological data. Furthermore, the correlation between chemical composition and bioactivity of natural products can be characterized and interpreted by modeling approaches, such as ANN and support vector regression [18]. Then, created models will help to determine optimum proportion of active components. Similar with computer-aided drug design for Western drug with single entity, development of new botanical drug may be achieved by computer-aided screening and rational combination.

One of the major developments in modern biology has been a shift from experimental descriptions to an attempt “physicalistic” explanation of biological process, including chemical–biological interaction [17]. Mathematical, statistical, and data mining approaches have been employed in biological study to represent a universal problem, i.e. some factors cause temporal and spatial changes in biological system. The factors (variables) here can be gene frequencies in population genetics, specific regulation of proteins in disease related proteomics, or chemical composition of mixture from natural products in this case. Compared with traditional statistical tools relying on measures of linear associations, causal discovery approaches have better performance to qualitatively delineate nonlinear system. So it can be expected that our algorithm can be applied in complex chemical biology systems with numerous variables.

Acknowledgments

This project was financially supported by the Chinese National Basic Research Priorities Program (No. 2005CB523402), the Program for New Century Excellent Talents in University of China (No. NCET-06-0515) and the Science and Technology Development Program of Zhejiang Province (No. 2006C33024).

Copyright information

© International Federation for Medical and Biological Engineering 2008

Authors and Affiliations

  • Yi Wang
    • 1
  • Yecheng Jin
    • 1
  • Chenguang Zhou
    • 1
  • Haibin Qu
    • 1
  • Yiyu Cheng
    • 1
  1. 1.Pharmaceutical Informatics Institute, College of Pharmaceutical SciencesZhejiang UniversityHangzhouChina

Personalised recommendations