Advertisement

Analytical and Bioanalytical Chemistry

, Volume 399, Issue 2, pp 635–649 | Cite as

Determination of galactosamine impurities in heparin samples by multivariate regression analysis of their 1H NMR spectra

  • Qingda Zang
  • David A. Keire
  • Richard D. Wood
  • Lucinda F. Buhse
  • Christine M. V. Moore
  • Moheb Nasr
  • Ali Al-Hakim
  • Michael L. Trehy
  • William J. Welsh
Original Paper

Abstract

Heparin, a widely used anticoagulant primarily extracted from animal sources, contains varying amounts of galactosamine impurities. Currently, the United States Pharmacopeia (USP) monograph for heparin purity specifies that the weight percent of galactosamine (%Gal) may not exceed 1%. In the present study, multivariate regression (MVR) analysis of 1H NMR spectral data obtained from heparin samples was employed to build quantitative models for the prediction of %Gal. MVR analysis was conducted using four separate methods: multiple linear regression, ridge regression, partial least squares regression, and support vector regression (SVR). Genetic algorithms and stepwise selection methods were applied for variable selection. In each case, two separate prediction models were constructed: a global model based on dataset A which contained the full range (0–10%) of galactosamine in the samples and a local model based on the subset dataset B for which the galactosamine level (0–2%) spanned the 1% USP limit. All four regression methods performed equally well for dataset A with low prediction errors under optimal conditions, whereas SVR was clearly superior among the four methods for dataset B. The results from this study show that 1H NMR spectroscopy, already a USP requirement for the screening of contaminants in heparin, may offer utility as a rapid method for quantitative determination of %Gal in heparin samples when used in conjunction with MVR approaches.

Keywords

Heparin Galactosamine impurities Proton nuclear magnetic resonance (1H NMR) Multivariate regression (MVR) Variable selection 

Introduction

Heparin is a naturally occurring polydisperse mixture of linear, highly sulfated carbohydrates composed of repeating disaccharide units, which generally comprise a 6-O-sulfated, N-sulfated glucosamine alternating with a 2-O-sulfated iduronic acid [1, 2, 3]. As a member of the glycosaminoglycan family, heparin has the highest negative charge density among known biological molecules. During heparin biosynthesis, the polysaccharide chains are incompletely modified and variably elongated, leading to heterogeneity in chemical structure, diversity in sulfation patterns, and polydispersity in molecular mass [4]. As one of the oldest drugs still in widespread clinical use, heparin is highly effective in kidney dialysis and cardiac surgery. Heparin is the most widely used anticoagulant for preventing or treating thromboembolic disorders and for inhibiting coagulation during hemodialysis and extracorporeal blood circulation [5, 6, 7, 8].

Pharmaceutical heparin is usually obtained by extracting animal tissues, such as porcine intestines or bovine lung after proteolytic digestion, and then precipitating the preparations as quaternary ammonium complexes or barium salts [9, 10, 11, 12]. To ensure the appropriate biological activity, chemical parameters, including purity, molecular mass distribution, degree of sulfation, as well as the presence of specific oligosaccharide sequences, must be strictly controlled. Due to the heterogeneity of heparin preparations, it is difficult to accurately determine the precise chemical structure and to measure the performance of purification protocols [13, 14, 15].

Heparin always contains varying amounts of undesirable impurities. Among these, chondroitin sulfate A and dermatan sulfate (DS) have been identified. These chondroitin derivatives differ from heparin in that they contain galactosamine, and the level of these galactosamine-containing impurities in heparin is used as an indication of the purity of the drug substance [16, 17, 18, 19, 20]. DS is the most common chondroitin sulfate impurity in heparin with concentrations up to a few percent [21, 22]. DS is composed of alternating iduronic acid–galactosamine disaccharide units, and due to their similarity with the iduronic–glucosamine disaccharide units of heparin, commercial heparin preparations usually contain small amounts of DS.

Currently, the United States Pharmacopeia (USP) monograph for heparin purity specifies that the weight percent of galactosamine (%Gal) may not exceed 1% in total hexosamine content. The revised Stage 3 USP monograph has been proposed by the FDA to specify that %Gal may not exceed 1.0%. Therefore, for this work, we applied the 1.0% %Gal specification to delineate heparin samples that pass or fail this criterion.

The accurate measurement of %Gal in heparin is an important parameter to assure the safety and efficacy of the drug. The experimental determination of %Gal by acid digestion and high-performance liquid chromatography (HPLC) with a pulsed amperometric detector requires expert operators, expensive equipment, and careful sample preparation. In contrast, although the NMR approach requires more expensive equipment than the HPLC method, the sample preparation is minimal and the data are already required for other aspects of USP testing. Therefore, the development of simple computational methods for the prediction of %Gal values from NMR data is of particular interest.

1H NMR spectroscopy is very sensitive to minor structural variations, so the repeating disaccharide units of heparin can be easily identified in 1H NMR spectra by specific signals [4, 9, 15]. 1H NMR spectroscopy has been widely applied for the characterization of the chemical composition of heparin and its derivatives, as well as for the identification of contaminants from various sources [16, 20, 23, 24]. With the help of chemometric techniques, useful chemical information from complex NMR spectra can be extracted and the characterization and quantification of analytes can be accomplished using the NMR signals in much the same way as fingerprints. Chemometric models have been successfully applied to the study of 1H NMR spectra of several heparin samples [25, 26, 27, 28] in which spectral data were transformed into discrete variables for subsequent multivariate regression analysis (MVA).

The objective of this study was to employ MVA to build quantitative models for the prediction of %Gal in laboratory heparin samples based on analysis of their 1H NMR spectral data. Several multivariate regression methods were implemented and compared, including multiple linear regression (MLR), ridge regression (RR), partial least squares regression (PLSR), and support vector regression (SVR). To obtain stable and robust models with high predictive ability, two variable selection techniques, viz., genetic algorithms (GAs) and stepwise methods, were employed to choose only the most information-rich subset of variables. The present results show that NMR spectroscopy together with chemometric techniques are useful for quantifying the %Gal content in heparin, thereby potentially obviating labor-intensive and costly chemical analysis.

Methods

NMR spectroscopy measurement

All samples were analyzed using a Varian Inova 500 instrument at the Washington University Chemistry Department NMR Facility operating at 499.893 MHz for 1H nuclei. Samples were run with the probe air temperature regulated at 25 °C. Spectral parameters include: a spectral window of 8,000 Hz centered on residual water at 4.77 ppm, 16 transients co-added, a 90° pulse width, acquisition time of 1.892 s, and a relaxation delay of 20 s. The total acquisition time per sample was 5.84 min. These acquisition parameters typically gave S/N values measured around the N-acetyl methyl proton signals at 2.045 ppm of approximately 1,000–2,000:1 for the heparin samples. The concentration of the heparin in the NMR tube was 27 mg/mL (20 mg/700 μL). All samples were made approx. 3 mM in 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) as an internal chemical shift reference.

Galactosamine content analysis

A detailed description of the methods employed for the experimental determination of percent galacosamine in total hexosamine can be found in [29].

Data processing

1H NMR analytical data of over 100 heparin sodium active pharmaceutical ingredient (API) samples from different suppliers with varying levels of chondroitins (primarily in the form of DS) were obtained from the chromatographic and spectroscopic experiments. These samples contained up to 10% by weight of chondroitins in the API by the %Gal HPLC assay. 1H NMR data were processed with MestRe-C (version 5.3.0) software. Phase and baseline corrections were applied. Chemical shifts were referenced to internal DSS. Each 1H NMR spectrum was reduced and divided into segments with width of 0.03 ppm spanning the interval from 1.95 to 5.70 ppm, and peak integration was performed for each spectral region. All the heparin NMR spectra contain water from residual H2O in the D2O used (at 4.77 ppm). In addition, a number of batches contain other solvents and reagents, including methanol (at 3.35 ppm, singlet), ethanol (at 1.18 or 3.66 ppm, triplet or quartet), and acetate (at 1.92 ppm, singlet), in varying amounts [20]. These regions were excluded from the data acquisition, and the total dataset was reduced to 74 regions or variables. In order to compensate for the concentration difference between the heparin samples, the integral of each spectrum was normalized to the total of the summed integral values.

Prior to chemometric analysis, the spectra were converted to ASCII files representing data in n × m-dimensional space (n and m equal to the number of samples and the number of variables, respectively), and the resulting data matrix was imported into Microsoft Excel 2003. The data were preprocessed by autoscaling, also known as unit variance scaling (i.e., each of the variables is mean-centered and then divided by its standard deviation) [30]. Based on the range of %Gal, the NMR spectral data were classified into two datasets, dataset A and dataset B, which correspond to 0–10% and 0–2% galactosamine, respectively. Dataset B is a subgroup of dataset A. For each dataset, heparin samples were randomly split into two subsets: the training set for model construction and calibration and the test set for model validation and assessment of predictive ability. The relevant statistical parameters for dataset A (model A) and dataset B (model B) are summarized in Table 1.
Table 1

Summary properties of the range of samples for which %Gal was measured by HPLC analysis

 

Number of samples

%Gal

Minimum

Maximum

Median

Mean

Dataset A

Training set

76

0.01

9.68

0.86

1.74

Test set

25

0.11

8.05

0.87

1.76

 

 

 

 

 

 

Dataset B

Training set

57

0.01

1.86

0.66

0.71

Test set

19

0.11

1.74

0.72

0.73

Computation programs

Mathematical treatments for data standardization, multivariate analysis, and statistical model building were performed using the R statistical analysis software for Windows (version 2.8.1) [31]. Stepwise variable selection, genetic algorithms, multiple linear regression, ridge regression, partial least squares regression, and support vector regression are implemented by the packages chemometrics, subselect, stats, MASS, pls, and e1071, respectively [32, 33].

Results and discussion

NMR spectra of heparin samples

Heparin has a basic repeating disaccharide unit of iduronic acid and glucosamine, whereas the basic repeating disaccharide unit for DS is iduronic acid and galactosamine. Roughly every fifth amino group is acetylated in heparin, but each and every amino group is acetylated in the case of DS [20, 23]. The NMR signals of the acetyl methyl groups at approx. 2.0 ppm are well separated from the other NMR signals in the 3.0- to 6.0-ppm range. Figure 1 illustrates the 500-MHz 1H NMR spectra of a USP grade heparin and a heparin sample that contains DS, plotted in the range from 1.95 to 6.0 ppm. DS displays resonances at chemical shifts distinct from those of heparin (e.g., 2.08 and 3.54 ppm). DS also has protons which resonate at frequencies close to or overlapping those of certain heparin protons (e.g., 3.87, 4.03, 4.68, and 4.87 ppm). The calibration analysis is strengthened by the fact that the acetamido groups of heparin and DS resonate at different frequencies [27].
Fig. 1

1H NMR spectra of heparin samples with different %Gal. a In the 2.20- to 1.95-ppm region. b In the 6.00- to 3.00-ppm region. Brown %Gal = 9.59; Blue %Gal = 0.09

Variable selection

Variable selection is a crucial step in regression analysis as it controls both the number of variables and the mathematical complexity of the model. The presence of variables not related to the response can produce background noise, and redundant variables may confound regression models, thereby reducing their predictive ability. The selection of variables for multivariate calibration is an optimization procedure whose goal is to select the subset of variables that produce simple and robust regression models with high prediction performance. Two well-established methods, the stepwise selection technique and the GA, were used here for variable selection from the original NMR spectral matrix.

Stepwise procedure

In stepwise multiple regression, variables are added one at a time and can be deleted later if they fail to make a significant contribution to the model. The variable most correlated with the response enters the model first, and then forward selection continues. Each time a new variable is added, the significance of the regression terms is tested. If the contribution of a variable existing in the model is decreased and made no longer significant by a new variable, then the insignificant variable is removed from the model. Any variables that entered the model in the earlier stages can be discarded at the later stages. The process of forward addition and backward elimination is repeated until the inclusion of any other variables cannot further improve the model, and finally each variable included in the model is significant [34]. The Bayes information criterion (BIC) was used as a measure of the model fit, which can be expressed as [32]:
$$ {\rm BIC} = n\log ({\rm RSS} /n) + m\log n $$
(1)
where RSS is the residual sum of squares, n is the number of samples, and m is the number of regression variables. Each variable is added to or removed from the model in order to achieve the largest reduction of the BIC. When the BIC value can be reduced no longer, the model selection process is stopped, resulting in the optimal subset of variables.
The variation of BIC values with the model size for all steps of the stepwise procedure is summarized in Tables 2 and 3. For dataset A, the most highly correlated variable, i.e., 2.08 ppm, entered the model first, followed by 2.02, 2.11, 4.31, 3.53, 3.50, 5.61, and 5.34 ppm. The variable 2.11 ppm was then discarded, after which variables 5.43, 4.25, 3.59, and 2.14 ppm were added sequentially to yield the final set of 11 variables for the model (Table 2). A similar process for dataset B led to a final set of five variables (Table 3). Comparing the final variable subsets for datasets A and B, the only variable in common is 2.08 ppm. This outcome suggests that the differences in DS content (%Gal) between datasets A and B greatly influence the selection of variables. The selected variables can be used directly for the MLR and ridge regression models and, optionally, for the PLSR and SVR models.
Table 2

Stepwise variable selection procedure for dataset A

Model size

BIC

Selected variables (ppm)

Add(+)/Drop(−)

1

190.97

2.08

+2.08

2

124.81

2.02, 2.08

+2.02

3

99.64

2.02, 2.08, 2.11

+2.11

4

84.84

2.02, 2.08, 2.11, 4.31

+4.31

5

76.98

2.02, 2.08, 2.11, 3.53, 4.31

+3.53

6

56.00

2.02, 2.08, 2.11, 3.50, 3.53, 4.31

+3.50

7

54.23

2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.61

+5.61

8

50.01

2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.34, 5.61

+5.34

7′

45.47

2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.61

−2.11

8′

45.05

2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.43, 5.61

+5.43

9

42.17

2.02, 2.08, 3.50, 3.53, 4.25, 4.31, 5.34, 5.43, 5.61

+4.25

10

40.48

2.02, 2.08, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61

+3.59

11

37.49

2.02, 2.08, 2.14, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61

+2.14

Table 3

Stepwise variable selection procedure for dataset B

Model size

BIC

Selected variables (ppm)

Add(+)/Drop(−)

1

69.71

2.08

+2.08

2

42.48

2.02, 2.08

+2.02

3

30.73

2.02, 2.08, 2.11

+2.11

4

27.61

1.99, 2.02, 2.08, 2.11

+1.99

5

24.13

1.99, 2.02, 2.08, 2.11, 4.37

+4.37

6

20.93

1.99, 2.02, 2.08, 2.11, 4.22, 4.37

+4.22

5′

17.17

1.99, 2.08, 2.11, 4.22, 4.37

−2.02

6′

15.50

1.99, 2.08, 2.11, 2.20, 4.22, 4.37

+2.20

5″

14.45

1.99, 2.08, 2.20, 4.22, 4.37

−2.11

Genetic algorithms

GAs are numerical optimization tools and randomized search techniques which simulate biological evolution based on Darwin’s theory of natural selection. The basic operation of GAs consists of five steps: encoding the variables as chromosomes; generating the initial population of chromosomes; evaluating the fitness function; creating the next generation of chromosomes; and terminating the process [35, 36]. GAs have demonstrated their utility for selecting optimal variables in multivariate calibration [37, 38, 39, 40, 41, 42] and are especially suitable for datasets with a large number (~200) of variables, such as the present case for the heparin NMR datasets. GA training requires the selection of several parameters, i.e., the number of chromosomes, initial population, selection mode, crossover parameters, mutation rate, and convergence criteria, all of which can influence the final results. In the present study, the entire set of 74 variables was used as input to the GA for the selection of the optimal subset of variables for predicting %Gal.

An initial population size of chromosomes was set to 200, and the number of selected variables in the model was maintained between 5 and 40. The chromosome with the maximum fitness value was chosen during each population. Depending on the fitness values, a subset of pairs of chromosomes was selected to undergo crossover (analogous to reproduction) where two existing chromosomes exchange parts of their genetic content and two new chromosomes are formed. Following crossover, one or more mutations may occur where the bits of an individual’s string are randomly inverted such that the state of the gene is changed from “0” to “1” or vice versa. The crossover probability and mutation probability were set to 50% and 1%, respectively. The search was terminated after completing 100 generations.

As the GA process is a stochastic process, the search results depend on the randomly generated original population. Consequently, the variables selected after conducting multiple search processes can be substantially different. Following accepted procedures, the GA algorithm was run 500 times, from which subsets of the 5, 10, 20, 30, and 40 most frequently selected variables were retained to build calibration models from the training set of samples (Table 4). Overall, the three most frequently chosen variables (2.08, 3.50, and 3.53 ppm) correspond to the characteristic peaks of DS, which reinforced our confidence in using the GA algorithm with our specific settings for variable selection.
Table 4

Variables (ppm) selected by the GA method

No. of variables

Selected variables

Dataset A

5 variables

2.08, 2.11, 3.50, 3.53, 4.46

10 variables

2.02, 2.08, 3.50, 3.53, 3.56, 3.71, 3.80, 5.49, 5.55, 5.67

20 variables

2.08, 2.11, 2.17, 2.20, 3.50, 3.53, 3.56, 3.71, 3.74, 3.92, 4.01, 4.04, 4.40, 4.46, 4.52, 4.92, 5.01, 5.46, 5.58, 5.67

30 variables

2.02, 2.08, 2.11, 2.14, 2.20, 3.53, 3.71, 3.74, 3.89, 3.98, 4.04, 4.13, 4.19, 4.34, 4.40, 4.46, 4.52, 4.92, 4.95, 4.98, 5.01, 5.04, 5.07, 5.22, 5.25, 5.37, 5.40, 5.58, 5.61, 5.67

40 variables

1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59, 3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07, 4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07, 5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67

Dataset B

5 variables

2.08, 3.50, 3.56, 3.71, 4.46

10 variables

2.02, 2.08, 2.14, 3.50, 3.56, 3.71, 4.46, 5.19, 5.49, 5.64

20 variables

2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.71, 3.77, 4.07, 4.13, 4.37, 4.43, 4.46, 4.49, 4.58, 5.04, 5.10, 5.19, 5.49, 5.61

30 variables

1.96, 2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.62, 3.71, 3.92, 3.95, 3.98, 4.07, 4.13, 4.37, 4.43, 4.46, 4.49, 4.58, 4.64, 5.04, 5.07, 5.10, 5.13, 5.16, 5.19, 5.22, 5.31, 5.49, 5.52

40 variables

1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59, 3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07, 4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07, 5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67

Multiple linear regression

MLR produces a linear model describing the relationship between a dependent (response) variable and independent variables [36, 43]:
$$ y = Xb + e $$
(2)
where y is the measured response vector (y 1,y 2,…,y n ), and X is a matrix of size n × (m + 1) in which the first column is assigned the value 1 as the intercept term and the remaining columns are assigned the values x ij . The parameters n, m, i, and j correspond respectively to the number of samples, the number of variables, the index for samples, and the index for variables. The parameter b is the vector of the estimated regression coefficients and e is the vector of the y residuals resulting from systematic modeling errors and random measurement errors assumed to follow a normal distribution with expected value E(e) = 0. By minimizing the sum of the squared residuals, the regression coefficients can be approximated as [44, 45]:
$$ b = {({X^T}X)^{ - 1}}{X^T}y. $$
(3)
Each variable x j is then multiplied by its regression coefficient b j to obtain the predicted value for y, noted as ŷ:
$$ \hat{y} = {b_0} + {b_1}{x_1} + {b_2}{x_2} + ... + {b_m}{x_m}. $$
(4)
The quality of the calibration model is evaluated by building a regression between the experimental values and the predicted values. Statistical parameters typically used to measure the model’s performance, viz., coefficient of determination (R 2), root mean squared error (RMSE), and relative standard deviation (RSD), are given by the following equations [46, 47]:
$$ {R^2} = 1 - \frac{{\sum\limits_{i = 1}^n {{{({y_i} - {{\hat{y}}_i})}^2}} }}{{\sum\limits_{i = 1}^n {{{({y_i} - \bar{y})}^2}} }} $$
(5)
$$ {\rm RMSE} = \sqrt {{\frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{({y_i} - {{\hat{y}}_i})}^2}} }} $$
(6)
$$ {\rm RSD} = \frac{{{\rm RMSE} }}{{\bar{y}}} \times 100\% $$
(7)
where y i is the actual %Gal of sample i measured by HPLC, ŷ i is the %Gal predicted by the model, and \( \bar{y} \) is the mean of all samples in a dataset. R 2 is probably the most familiar measure of the model’s ability to fit the data. A value of R 2 near zero suggests no linear relationship, while a value approaching unity indicates a near perfect linear fit. An acceptable model should have a large R 2, a small RMSE, and a small RSD. The value of R 2 will increase as the model increases in complexity (i.e., more independent variables), so the number of variables in the model must be considered. An alternative for R 2 is the adjusted coefficient, \( R_{{\rm adj} }^2 \), which favors models with a small number of variables, as shown by the equation [32]:
$$ R_{{\rm adj} }^2 = 1 - \frac{{n - 1}}{{n - m - 1}}(1 - {R^2}). $$
(8)

MLR is a simple and easy calibration method that avoids the need for adjustable parameters such as the factor number in partial least squares regression, the regularization parameter λ in ridge regression, and the kernel parameters in SVR. Consequently, MLR is among the most common approaches used to build multivariate regression models. However, overly complex MVR models with large numbers of independent variables may actually lose their predictive ability. This common problem occurs when too many variables are used to fit the calibration (training) set and can be avoided by cross-validation and further external validation of the model using test samples reserved for this purpose.

The performance was compared for MVR models that varied with respect to the number of variables using either stepwise or GA methods for variable selection (Table 5). For dataset A, when all 74 variables were employed for the regression analysis, the model yielded \( R_{{\rm adj} }^2 \) values of 1.0 for the training dataset, but only 0.62 for the test set. Figure 2a depicts the plot of experimental %Gal by HPLC versus that predicted by MVR from the NMR data. All the training sample points are located on a straight line through the origin and with a slope equal to 1. However, many test samples deviate from the diagonal in the plot. When MLR is trained using all 74 variables, some of the variables are unrelated to the variation of the response, i.e., the %Gal. Such cases produce models that are overfitted to the training set and, invariably, produce poor results for the test set.
Table 5

Statistical parameters obtained from the MLR models using stepwise and GA variable selection methods

 

All

Stepwise

Genetic algorithms

No. of variables

74

11

5

10

20

30

40

Model A

Dataset A

Training

       

 RMSE

0.01

0.26

0.35

0.27

0.26

0.22

0.17

 RSD

0.01

0.15

0.20

0.15

0.15

0.13

0.10

 \( R_{{\rm adj} }^2 \)

1.00

0.98

0.97

0.98

0.98

0.99

0.99

Test

       

 RMSE

1.34

0.33

0.29

0.23

0.29

0.31

0.55

 RSD

0.76

0.19

0.17

0.13

0.16

0.18

0.31

 \( R_{{\rm adj} }^2 \)

0.62

0.98

0.98

0.99

0.98

0.98

0.93

Dataset B

Training

       

 RMSE

0.01

0.19

0.26

0.19

0.18

0.16

0.14

 RSD

0.01

0.27

0.39

0.27

0.25

0.22

0.20

 \( R_{{\rm adj} }^2 \)

1.00

0.86

0.78

0.86

0.89

0.90

0.92

Test

       

 RMSE

1.47

0.29

0.29

0.20

0.27

0.28

0.55

 RSD

1.99

0.40

0.39

0.27

0.36

0.38

0.75

 \( R_{{\rm adj} }^2 \)

0.11

0.66

0.70

0.85

0.76

0.72

0.59

Model B

Dataset B

Training

       

 RMSE

NA

0.21

0.18

0.13

0.10

0.07

0.03

 RSD

NA

0.30

0.25

0.18

0.14

0.10

0.04

 \( R_{{\rm adj} }^2 \)

NA

0.80

0.85

0.92

0.95

0.98

1.00

Test

       

 RMSE

NA

0.26

0.25

0.18

0.15

0.10

0.14

 RSD

NA

0.36

0.34

0.24

0.20

0.13

0.19

 \( R_{{\rm adj} }^2 \)

NA

0.69

0.73

0.86

0.92

0.96

0.94

Fig. 2

Predicted (from NMR data) versus measured (from HPLC) %Gal for dataset A (%Gal = 0–10). a Predicted by model A using all 74 variables. b Predicted by model A using ten variables selected from GA

When the most information-rich variables were selected and variables that are redundant or uncorrelated to the response were discarded, the performance of the model was enhanced significantly. The predictive ability was remarkably improved for the MVR models containing up to 11 variables based on stepwise selection. Compared with the all-variable model, the \( R_{{\rm adj} }^2 \) for the test set increased from 0.62 to 0.98, even though the \( R_{{\rm adj} }^2 \) value for the training set dropped slightly from 1.0 to 0.98. Taken together, these results reflect the excellent agreement between the measured and predicted values after appropriate variable selection.

Using GA for variable selection, the model’s quality relied somewhat on the number of selected variables. Table 5 shows that \( R_{{\rm adj} }^2 \) for the training set improved continuously from 0.97 to 0.99 between 5 and 40 variables. In contrast, the test set followed a different pattern, i.e., the \( R_{{\rm adj} }^2 \) value initially increased to a maximum 0.99 at ten variables, after which it gradually decreased to 0.93 at 40 variables. Therefore, the minimum number of prediction errors occurred when the model was of moderate complexity. In the present case, the resulting model demonstrated good performance in estimating the %Gal concentrations using ten variables. As shown in Fig. 2b, the measured and predicted values were highly correlated over the entire concentration range for both the training and test datasets. Comparing the GA and stepwise selection methods, the statistical parameters \( R_{{\rm adj} }^2 \) and RMSE revealed a slight advantage for the former over the latter.

As the FDA has proposed that the Stage 3 USP monograph specify the upper acceptable limit for %Gal as 1.0%, we checked the predictive performance of our models at low %Gal concentration. When only dataset B (0.0–2.0%Gal) is considered, the results predicted by the global model A are only mediocre, as expected. Using the all-variable model, \( R_{{\rm adj} }^2 \) approached 1.00 for the training set, but was unacceptable at 0.11 for the test set. Variable selection did improve the predictive ability of model A, e.g., the \( R_{{\rm adj} }^2 \) value for the test set was 0.85 using ten variables (Fig. 3a).
Fig. 3

Predicted (from NMR data) versus measured (from HPLC) %Gal for dataset B (%Gal = 0–2). a Predicted by model A using ten variables selected from GA. b Predicted by model B using 30 variables selected from GA

Dataset B was employed to construct the local MLR models with enhanced predictive ability in the lower range of 0.0–2.0%Gal. When building MLR models, the number of samples must equal or exceed the number of independent variables. The training set for dataset B contained only 57 samples, much lower than the 74 independent variables extracted from the NMR data; consequently, the full-variable model was not feasible. The results, summarized in Table 5, reveal that the top model performance was attained (\( R_{{\rm adj} }^2 = 0.{96} \)) using a subset of 30 variables selected by GA. The superb agreement between the predicted and experimental values (Fig. 3b and Table 6) confirms the high predictive ability of the local model B in the lower range of 0.0–2.0%Gal. Stepwise variable selection yielded marginally satisfactory results in terms of the predictive ability of model B. The \( R_{{\rm adj} }^2 \) values for the training and test sets were 0.80 and 0.69, respectively, which, while acceptable, were inferior to those obtained from the corresponding GA models for any number of variables. A possible explanation is that stepwise variable selection is limited in its ability to explore possible combinations of variables.
Table 6

%Gal values for the 19 test-set samples measured by HPLC analysis and predicted by MLR regression model using 30 variables selected by GA

Test sample

Measured

Predicted

Test sample

Measured

Predicted

1

0.11

0.22

11

0.75

0.83

2

0.16

0.10

12

0.83

0.81

3

0.19

0.24

13

0.87

0.92

4

0.25

0.27

14

1.01

0.97

5

0.31

0.44

15

1.07

0.95

6

0.39

0.30

16

1.17

1.30

7

0.42

0.46

17

1.23

1.43

8

0.51

0.64

18

1.63

1.52

9

0.59

0.42

19

1.74

1.77

10

0.72

0.77

   

Ridge regression

MLR is particularly sensitive to highly correlated (co-linear) variables, which can result in highly unreliable model predictions. In addition, MLR is inappropriate when there are fewer samples than variables. As a shrinkage method, RR limits the range of the regression coefficients and thereby stabilizes their estimation [32]. The RR technique aims to resolve the co-linearity problem associated with MLR by modifying the XX matrix so that its determinant can be appreciably different from 0. The objective of RR is to minimize:
$$ \sum\limits_{i = 1}^n {({y_i} - {{\hat{y}}_i}} {)^2} + \lambda \sum\limits_{j = 1}^m {b_j^2} $$
(9)
where the first term is the RSS and the second term is a regularizer which penalizes a large norm of the regression coefficients. The ridge parameter or complexity parameter λ determines the deviation between the ridge regression and the MLR regression and thereby controls the amount of shrinkage [44]. Inspection of Eq. 9 reveals that the expressions for ridge regression and MLR are identical when the regularization parameter λ = 0. The larger the value of λ, the greater the penalty (shrinkage) that is applied to the regression coefficients. The ridge regression coefficient b ridge can be estimated by solving the minimization problem in Eq. 9 [43, 44]:
$$ {b_{ridge}} = {({X^T}X + \lambda I)^{ - 1}}{X^T}y. $$
(10)

Equation 10 is a linear function of the response variable y. The coefficient b ridge is similar to the regression coefficient of MLR in Eq. 3, but the inverse is stabilized by the ridge parameter λ. The performance of ridge regression depends heavily on proper choice of the parameter λ, which is achieved using cross-validation procedures.

In ridge regression, the first step is to find the optimal value of the parameter λ which yields the smallest prediction error. By estimating the prediction error as the mean squared error for prediction (MSEP) using the generalized cross-validation (GCV) procedure, a series of λ values corresponding to a range of variables was obtained (Table 7). The dependence of the MSEP on the ridge parameter λ for the 40-variable model selected using GA is illustrated in Fig. 4a. The optimal value of λ is 0.267, which yielded the smallest prediction error. The relationship between the regression coefficients and the parameter λ is shown in Fig. 4b, where the regression coefficient of each variable is represented by a particular curve and the size changes as a function of λ. It is clear that larger values of the ridge parameter lead to greater shrinkage of the coefficients which approach zero as λ approaches infinity. The optimal choice of λ = 0.267 is depicted by the vertical line in Fig. 4b, which intersects the optimized regression coefficients of the curves.
Table 7

Statistical parameters obtained from RR models using stepwise and GA variable selection

 

All

Stepwise

Genetic algorithms

No. of variables

74

11

5

10

20

30

40

Model A (global)

λ

0.01

0.28

0.18

0.56

0.64

0.34

0.27

Dataset A

Training

       

 RMSE

0.02

0.26

0.35

0.27

0.26

0.23

0.17

 RSD

0.01

0.15

0.20

0.16

0.15

0.13

0.10

 \( R_{{\rm adj} }^2 \)

1.00

0.98

0.97

0.98

0.98

0.99

0.99

Test

       

 RMSE

0.93

0.32

0.28

0.23

0.29

0.33

0.64

 RSD

0.53

0.18

0.16

0.13

0.17

0.19

0.36

 \( R_{{\rm adj} }^2 \)

0.80

0.98

0.98

0.99

0.98

0.97

0.90

Dataset B

Training

       

 RMSE

0.02

0.18

0.27

0.18

0.17

0.15

0.14

 RSD

0.03

0.26

0.38

0.26

0.24

0.22

0.20

 \( R_{{\rm adj} }^2 \)

1.00

0.86

0.78

0.86

0.90

0.90

0.92

Test

       

 RMSE

0.97

0.29

0.28

0.20

0.26

0.27

0.54

 RSD

1.30

0.38

0.37

0.27

0.35

0.36

0.73

 \( R_{{\rm adj} }^2 \)

0.31

0.69

0.69

0.85

0.77

0.75

0.60

Model B (local)

λ

0.01

0.27

0.06

0.02

0.05

0.03

0.01

Dataset B

Training

       

 RMSE

0.01

0.21

0.17

0.13

0.10

0.07

0.03

 RSD

0.01

0.30

0.25

0.19

0.14

0.10

0.04

 \( R_{{\rm adj} }^2 \)

1.00

0.80

0.85

0.92

0.95

0.98

0.99

Test

       

 RMSE

0.23

0.26

0.25

0.18

0.15

0.11

0.14

 RSD

0.31

0.35

0.34

0.24

0.20

0.14

0.19

 \( R_{{\rm adj} }^2 \)

0.78

0.69

0.73

0.86

0.91

0.95

0.95

Fig. 4

Ridge regression for the heparin 1H NMR data at 40 variables selected from GA. a The optimal ridge parameter λ = 0.267 is determined by generalized cross-validation (GCV). b The corresponding regression coefficients are the intersections of the curves of the regression coefficients with the vertical line at λ = 0.267

Prediction of the test data was achieved using the optimized regression coefficients. The statistical parameters calculated for the ridge regression models, including the adjusted coefficient \( R_{{\rm adj} }^2 \), RMSE, and RSD for both training and test sets, are presented in Table 7. For the all-variable model, the coefficient of determination \( R_{{\rm adj} }^2 \) for the test set increases from 0.62 for the MLR model to 0.80 for the ridge regression model for dataset A (%Gal = 0.0–10.0). The all-variable MLR model is unavailable for dataset B (%Gal = 0.0–2.0) since the number of variables exceeds the number of samples. Ridge regression is unconstrained by this condition, and the all-variable model yielded \( R_{{\rm adj} }^2 = {1}.00 \) for the training set and 0.78 for the test set (Table 7). However, the large errors (RSD = 0.31) for the test set are indicative of model overfitting and poor predictive ability. When variable selection was applied using either stepwise or GA methods, the predictive ability of the RR models approached that of the MLR models. Like the MLR models, the RR model showed poor predictive ability when the number of variables is too few (underfitting) or too many (overfitting). Therefore, selecting the appropriate number of variables was a key factor in achieving highly predictive models by ridge regression.

Partial least squares regression

PLSR is perhaps the most widely used multivariate regression method in chemometrics [32]. The aim of PLSR is to construct predictive models between two blocks of variables, the latent variables (principal components, or PCs) and the response variables, so that the covariance between them is maximized. The advantage of this method over MLR is its capacity to build a regression model based on highly correlated (co-linear) variables. In PLSR, the X data are first transformed into a set of orthogonal PCs, a linear combination of the original variables, which serve as new variables for regression with a dependent variable y.

The PCs are chosen in such a way as to provide maximum correlation with dependent variable; thus, the PLSR model contains the smallest necessary number of PCs. This number of PCs, which determines the complexity of the model, is typically obtained using the leave-one-out cross-validation (CV) procedure on the training set [41, 45]. The optimal model size corresponds to that with the lowest uncertainty estimates obtained from the predictive error sum of squares. The black lines in Fig. 5 depict the standard error of prediction (SEP) values from a single cross-validation with ten segments (parts), while the gray lines are produced by repeating this procedure 100 times [32]. The dashed horizontal line represents the SEP value for the test set at the optimal number of components depicted by the dashed vertical line. By repeating this CV procedure 100 times, the SEP was much larger for the all-variable model than for the corresponding 20-variable model with variables selected by GA, which speaks to the latter’s greater stability. The optimal number of PCs was 12 and 15 for the all-variable and 20-variable models, respectively.
Fig. 5

Relationship between the number of components (PCs) and the standard error of prediction (SEP) of the PLSR model for dataset A. The black lines were produced from a single tenfold CV, while the gray lines correspond to 100 repetitions of the tenfold CV. a Plot of SEP versus number of components for the all-variable model. b Plot of SEP versus number of components for the 20-variable model selected by GA

Training set models were constructed using variables selected by either GA or stepwise methods. The number of PCs previously judged to be optimum was employed, and the computed models were applied to the test set. The optimal number of PCs for each model, along with corresponding values of \( R_{adj}^2 \), RMSE, and relative standard error, are summarized in Table 8. As mentioned above, the all-variable model using 12 PCs gave the minimal cross-validation error. PLSR models built using 11 variables selected by the stepwise method yielded \( R_{{\rm adj} }^2 = 0.{98} \) for both the training set and the test set. The performance in predicting %Gal was better for the model using 5–20 GA-selected variables than for the all-variable model. The ten-variable model, which gave a near-perfect \( R_{adj}^2 \) of 0.99 and a low RSD of 0.12 (Table 8), was therefore chosen as the optimal model. When the %Gal of dataset B (%Gal = 0.0–2.0) was predicted by model B, the all-variable PLSR model yielded \( R_{{\rm adj} }^2 = 0.{85} \) and RMSE = 0.20 (Table 8). Variable selection by GAs on dataset B greatly enhanced the predictive ability of the models. The optimal model was obtained using 30 variables with \( R_{{\rm adj} }^2 = 0.{96} \) for the test set.
Table 8

Statistical parameters obtained from PLSR models using stepwise and GA variable selection methods

 

All

Stepwise

Genetic algorithms

No. of variables

74

11

5

10

20

30

40

Model A

Optimal PCs

12

8

5

8

15

18

22

Dataset A

Training

       

 RMSE

0.16

0.26

0.35

0.27

0.26

0.26

0.23

 RSD

0.09

0.15

0.20

0.16

0.15

0.15

0.13

 \( R_{{\rm adj} }^2 \)

0.99

0.98

0.97

0.98

0.98

0.98

0.99

Test

       

 RMSE

0.39

0.31

0.29

0.22

0.28

0.33

0.37

 RSD

0.22

0.18

0.17

0.12

0.16

0.19

0.21

 \( R_{{\rm adj} }^2 \)

0.96

0.98

0.98

0.99

0.98

0.97

0.97

Dataset B

Training

       

 RMSE

0.14

0.17

0.26

0.23

0.19

0.18

0.16

 RSD

0.20

0.25

0.38

0.33

0.27

0.25

0.24

 \( R_{{\rm adj} }^2 \)

0.91

0.87

0.78

0.82

0.86

0.87

0.90

Test

       

 RMSE

0.29

0.27

0.29

0.20

0.26

0.27

0.28

 RSD

0.39

0.36

0.39

0.26

0.35

0.36

0.38

 \( R_{{\rm adj} }^2 \)

0.70

0.74

0.69

0.85

0.75

0.73

0.72

Model B

Optimal PCs

28

5

5

9

19

23

34

Dataset B

Training

       

 RMSE

0.03

0.20

0.17

0.13

0.10

0.06

0.04

 RSD

0.04

0.28

0.25

0.18

0.13

0.09

0.05

 \( R_{{\rm adj} }^2 \)

0.99

0.80

0.85

0.92

0.96

0.98

0.99

Test

       

 RMSE

0.20

0.26

0.25

0.18

0.15

0.09

0.14

 RSD

0.27

0.34

0.33

0.24

0.20

0.12

0.19

 \( R_{{\rm adj} }^2 \)

0.85

0.70

0.73

0.86

0.92

0.96

0.95

Support vector regression

In multivariate regression models (i.e., MLR and PLSR), a linear relationship is assumed between the NMR spectral variables and the %Gal. Consequently, the predictive ability of the resultant model will suffer if the actual relationship between the dependent and independent variables is nonlinear rather than linear. In these cases, regression methods that encompass both linear and nonlinear relationships between the dependent and independent variables offer a more effective strategy. SVR models process both linear and nonlinear relationships by using an appropriate kernel function that maps the input matrix X onto a higher dimensional feature space and transforms the nonlinear relationships into linear forms [43, 44]. This new feature space is then implemented to deal with the regression problem [48]. By introducing Vapnik’s ε-insensitive loss function, the support vector machine approach was extended beyond classification to regression [49, 50]. In this method, the training objects are represented as a tube with radius ε. If all data points are situated inside the regression tube, the loss function is equal to 0, whereas if a data point is located outside the tube, the loss function increases with the Euclidean distance between the data point and the radius ε of the tube [42]. Thus, the ε-insensitive loss function can be expressed as [51]:
$$ L{\left( {y_{i} \widehat{y}_{i} ,\varepsilon } \right)} = \left\{ {\begin{array}{*{20}c} {{0,}} & {{{\left| {y_{i} - \widehat{y}_{i} } \right|} \leqslant \varepsilon }} \\ {{{\left| {y_{i} - \widehat{y}_{i} } \right|} - \varepsilon ,}} & {{otherwise}} \\ \end{array} } \right. $$
(11)
A cost function is defined by [44]:
$$ I = \frac{1}{2}\sum\limits_{j = 1}^m {b_j^2} + C\sum\limits_{i = 1}^n {L({y_i},{{\hat{y}}_i},\varepsilon )} $$
(12)
which combines a two-norm term of the regression coefficients and an error term multiplied by the error weight, C, a regularizing parameter which determines the trade-off between the training error and model complexity [52]. Through Lagrange optimization, the regression model can be found as:
$$ \hat{y} = \sum\limits_{i = 1}^n {({\alpha_i} - \alpha_i^*)} K({x_i},{x_j}) $$
(13)
$$ K({x_i},{x_j}) = \left\langle {\Phi ({x_i}),\Phi ({x_j})} \right\rangle $$
(14)
where α i and \( \alpha_i^* \) are the Lagrange multipliers, K(x i , x j) is the kernel function, and Φ is the mapping function from data X to feature space. In SVR, the radial basis function (RBF) is a commonly used kernel which is usually presented in the Gaussian form:
$$ K({x_i},{x_j}) = \exp ( - \gamma {\left\| {{x_i} - {x_j}} \right\|^2}). $$
(15)

Unlike the Lagrange multipliers which can be optimized automatically by the program, SVR requires the user to adjust the kernel parameters, the radius of the tube ε, and the regularizing parameter C. When applying the RBF kernel, the generalization property is dependent on the parameter γ which controls the amplitude of the kernel function. If γ is too large, all training objects are used as the support vectors, leading to overfitting. If γ is too small, all data points are regarded as one object, resulting in poor ability to generalize (i.e., predict beyond the training set) [44]. In addition, the penalty weight C and the tube size ε also require optimization. As the regularization parameter, C controls the trade-off between minimizing the training error and maximizing the margin. Generally, values of C that are too large or too small lead to regression models with poor prediction ability. When C is very low, the predictive ability of the model is exclusively determined by the weights of regression coefficients [49]. When C is large, the cost function controls the performance while the regression coefficients have little bearing even if their values are very high. Data points with prediction errors larger than ±ε are the support vectors which determine the predictive ability of the SVR model. A large number of support vectors occur at low ε, while sparse models are obtained when the value of ε is high. The optimal value of ε depends heavily on the individual datasets. Small values of ε should be used for low levels of noise, whereas higher values of ε are appropriate for large experimental errors. Thus, in order to find the optimized combination of the parameters γ, C, and ε, cross-validation via parallel grid search was performed.

The values of the optimal parameters γ, C, and ε as well as the predicted results of the optimal SVR models are shown in Table 9. For dataset A, the coefficient of determination \( R_{adj}^2 \) between the measured and predicted %Gal for the test set was 0.96 for the all-variable model. The predictive ability of the model with variables selected by GA gradually increased starting with five variables, reached a maximum at 30 variables, and then receded beyond this number. The \( R_{adj}^2 \) values for the test set were 0.98, 0.99, 0.99, 0.99, and 0.96, corresponding to 5, 10, 20, 30, and 40 variables.
Table 9

Statistical parameters obtained from the SVR models (RBF kernel) using stepwise and GA variable selection

 

All

Stepwise

Genetic algorithms

No. of variables

74

11

5

10

20

30

40

Model A

SVR parameters

ε

0.14

0.01

0.18

0.10

0.07

0.05

0.10

C

1.0 × 106

1.0 × 104

1.0 × 105

1.0 × 106

1.0 × 105

1.0 × 105

1.0 × 104

γ

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

No. of vectors

28

71

21

43

39

59

37

Dataset A

Training

       

 RMSE

0.22

0.28

0.36

0.28

0.27

0.25

0.24

 RSD

0.13

0.16

0.21

0.16

0.16

0.14

0.14

 \( R_{{\rm adj} }^2 \)

0.99

0.98

0.97

0.98

0.98

0.99

0.99

Test

       

 RMSE

0.43

0.25

0.28

0.23

0.22

0.21

0.41

 RSD

0.24

0.14

0.16

0.13

0.13

0.12

0.23

 \( R_{{\rm adj} }^2 \)

0.96

0.98

0.98

0.99

0.99

0.99

0.96

Dataset B

Training

       

 RMSE

0.21

0.18

0.27

0.17

0.17

0.16

0.15

 RSD

0.31

0.25

0.39

0.25

0.24

0.23

0.22

 \( R_{{\rm adj} }^2 \)

0.82

0.86

0.77

0.88

0.88

0.90

0.90

Test

       

 RMSE

0.39

0.23

0.25

0.20

0.18

0.16

0.36

 RSD

0.53

0.31

0.34

0.26

0.24

0.22

0.49

 \( R_{{\rm adj} }^2 \)

0.66

0.78

0.76

0.84

0.87

0.89

0.70

Model B

SVR parameters

ε

0

0.60

0.15

0.40

0.03

0.05

0.07

C

1.0 × 106

1.0 × 106

1.0 × 105

1.0 × 106

1.0 × 106

1.0 × 106

1.0 × 106

γ

1.0 × 10−5

1.0 × 10−5

1.0 × 10−3

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

1.0 × 10−5

No. of vectors

57

16

39

15

53

51

49

Dataset B

Training

       

 RMSE

0.02

0.21

0.14

0.14

0.10

0.07

0.04

 RSD

0.02

0.30

0.21

0.19

0.14

0.10

0.05

 \( R_{{\rm adj} }^2 \)

1.00

0.79

0.90

0.91

0.96

0.98

0.99

Test

       

 RMSE

0.20

0.24

0.23

0.18

0.16

0.10

0.15

 RSD

0.27

0.33

0.32

0.24

0.21

0.13

0.20

 \( R_{{\rm adj} }^2 \)

0.82

0.74

0.76

0.87

0.91

0.96

0.92

As with RR and PLSR, SVR model performance was poorer for dataset B than for dataset A. For the all-variable models, the RBF kernel yielded \( R_{{\rm adj} }^2 = {1}.00 \) for the training set, but only 0.82 for the test set, suggesting overfitting. The predictive ability of the models improved considerably using GA for variable selection with an appropriate number of variables. A maximum \( R_{adj}^2 \) of 0.96 for the test set was achieved at 30 variables.

Conclusions

In this study, the %Gal in heparin (primarily originating from the DS impurity) was predicted from 1H NMR spectral data by means of four multivariate analysis approaches, i.e., MLR, RR, PLSR, and SVR. Variable selection was performed by GAs or stepwise methods in order to build robust and reliable models. The results demonstrated that excellent prediction performance was achieved in the determination of %Gal by all four regression models under optimal conditions. Variable selection enhanced the predictive ability substantially of all models, particularly the MLR model. Simple models were obtained using a subset of selected variables that predicted %Gal with high coefficients of determination and low prediction errors.

In general, GA was superior to the stepwise method for variable selection. Because GAs can choose any number of variables, a series of variables from 5 to 40 was selected to build predictive models. Overfitted models based on the training sets due to the use of excessive variables led to poor predictive ability on the test sets. Similarly, under-fit models resulting from an insufficient number of variables for model building led to statistically unstable models. The optimal subsets for datasets A and B were 10 and 30 variables, respectively. After variable selection, the four regression models considered in this study produced very similar results.

The range of %Gal in the samples influences many factors, i.e., the selection of regression approach, the choice of variable selection method and number of variables; and the interpretation of the models. Dataset A covered the full range 0–10%Gal, while dataset B was the subset covering 0–2%Gal. As expected, the global model A performed best for dataset A while the local model B was preferred for dataset B, indicating that a multistage modeling approach may provide the best accuracy and range. Variable selection influenced the PLSR and SVR models only slightly for dataset A, but was required to achieve optimal results for dataset B. All four MVR approaches (MLR, RR, PLSR, and SVR) performed equally well and were robust under optimal conditions. However, SVR was slightly superior to the other three regression approaches when building models with dataset B.

The present study offers assistance in selecting the appropriate MVR approach to predict the %Gal in heparin based on the analysis of 1D 1H NMR data. Our results demonstrate that the combination of 1H NMR spectroscopy and chemometric techniques provides a rapid and efficient way to quantitatively determine the galactosamine content (as %Gal) in heparin. More generally, the present study underscores the importance in choosing the appropriate regression method, variable selection approach, and fitting parameters to build robust and highly predictive regression models for the rapid screening of heparin samples that may contain impurities and containments. Ongoing and future efforts will be directed toward the development of consensus or hierarchical frameworks in which multiple predictive techniques are pooled or tiered to augment predictive ability and to evaluate measures of the confidence of prediction.

Notes

FDA disclaimer

The findings and conclusions in this article have not been formally disseminated by the Food and Drug Administration and should not be construed to represent any agency determination or policy.

References

  1. 1.
    Ampofo SA, Wang HM, Linhardt RJ (1991) Disaccharide compositional analysis of heparin and heparan sulfate using capillary zone electrophoresis. Anal Biochem 199:249–255CrossRefGoogle Scholar
  2. 2.
    Rabenstein DL (2002) Heparin and heparan sulfate: structure and function. Nat Prod Rep 19:312–331CrossRefGoogle Scholar
  3. 3.
    Casu B (1990) Heparin structure. Haemostasis 20:62–73Google Scholar
  4. 4.
    Sudo M, Sato K, Chaidedgumjorn A, Toyoda H, Toida T, Imanari T (2001) 1H nuclear magnetic resonance spectroscopic analysis for determination of glucuronic and iduronic acids in dermatan sulfate, heparin, and heparan sulfate. Anal Biochem 297:42–51CrossRefGoogle Scholar
  5. 5.
    Linhardt RJ (1991) Hepairn: an important drug enters its seventh decade. Chem Ind 2:45–50Google Scholar
  6. 6.
    Lepor NE (2007) Anticoagulation for acute coronary syndromes: from heparin to direct thrombin inhibitors. Rev Cardiovasc Med 8(suppl 3):S9–S17Google Scholar
  7. 7.
    Fischer KG (2007) Essentials of anticoagulation in hemodialysis. Hemodial Int 11:178–189CrossRefGoogle Scholar
  8. 8.
    Maruyama T, Toida T, Imanari T, Yu G, Linhardt RJ (1998) Conformational changes and anticoagulant activity of chondroitin sulfate following its O-sulfonation. Carbohydr Res 306:35–43CrossRefGoogle Scholar
  9. 9.
    Guerrini M, Bisio A, Torri G (2001) Combined quantitative 1H and 13C nuclear magnetic resonance spectroscopy for characterization of heparin preparations. Semin Thromb Hemost 27:473–482CrossRefGoogle Scholar
  10. 10.
    Toida T, Maruyama T, Ogita Y, Suzuki A, Toyoda H, Imanari T, Linhardt RJ (1999) Preparation and anticoagulant activity of fully O-sulphonated glycosaminoglycans. Int J Biol Macromol 26:233–241CrossRefGoogle Scholar
  11. 11.
    Griffin CC, Linhardt RJ, Van Gorp CL, Toida T, Hileman RE, Schubert RL II, Brown SE (1995) Isolation and characterization of heparan sulfate from crude porcine intestinal mucosal peptidoglycan heparin. Carbohydr Res 276:183–197CrossRefGoogle Scholar
  12. 12.
    Pervin A, Gallo C, Jandik KA, Han XJ, Linhardt RJ (1995) Preparation and structural characterization of large heparin-derived oligosaccharides. Glycobiology 5:83–95CrossRefGoogle Scholar
  13. 13.
    Korir AK, Larive CK (2009) Advances in the separation, sensitive detection, and characterization of heparin and heparan sulfate. Anal Bioanal Chem 393:155–169CrossRefGoogle Scholar
  14. 14.
    Eldridge SL, Korir AK, Gutierrez SM, Campos F, Limtiaco JFK, Larive CK (2008) Heterogeneity of depolymerized heparin SEC fractions: to pool or not to pool? Carbohydr Res 343:2963–2970CrossRefGoogle Scholar
  15. 15.
    Casu B, Guerrini M, Naggi A, Torri G, De-Ambrosi L, Boveri G, Gonella S, Cedro A, Ferró L, Lanzarotti E, Paterno M, Attolini M, Valle MG (1996) Characterization of sulfation patterns of beef and pig mucosal heparins by nuclear magnetic resonance spectroscopy. Arzneimittelforschung 46:472–477Google Scholar
  16. 16.
    Guerrini M, Zhang Z, Shriver Z, Naggi A, Masuko S, Langer R, Casu B, Linhardt RJ, Torri G, Sasisekharan R (2009) Orthogonal analytical approaches to detect potential contaminants in heparin. PNAS 106:16956–16961CrossRefGoogle Scholar
  17. 17.
    Wielgos T, Havel K, Ivanova N, Weinberger R (2009) Determination of impurities in heparin by capillary electrophoresis using high molarity phosphate buffers. J Pharm Biomed Anal 49:319–326CrossRefGoogle Scholar
  18. 18.
    Limtiaco JF, Jones CJ, Larive CK (2009) Characterization of heparin impurities with HPLC-NMR using weak anion exchange chromatography. Anal Chem 81:10116–10123CrossRefGoogle Scholar
  19. 19.
    Trehy ML, Reepmeyer JC, Kolinski RE, Westenberger BJ, Buhse LF (2009) Analysis of heparin sodium by SAX/HPLC for contaminants and impurities. J Pharm Biomed Anal 49:670–673CrossRefGoogle Scholar
  20. 20.
    Beyer T, Diehl B, Randel G, Humpfer E, Schäfer H, Spraul M, Schollmayer C, Holzgrabe U (2008) Quality assessment of unfractionated heparin using 1H nuclear magnetic resonance spectroscopy. J Pharm Biomed Anal 48:13–19CrossRefGoogle Scholar
  21. 21.
    Domanig R, Jöbstl W, Gruber S, Freudemann T (2009) One-dimensional cellulose acetate plate electrophoresis–a feasible method for analysis of dermatan sulfate and other glycosaminoglycan impurities in pharmaceutical heparin. J Pharm Biomed Anal 49:151–155CrossRefGoogle Scholar
  22. 22.
    Perlin AS, Sauriol F, Cooper B, Folkman J (1987) Dermatan sulfate in pharmaceutical heparins. Thromb Haemost 58:792–793Google Scholar
  23. 23.
    Guerrini M, Beccati D, Shriver Z, Naggi A, Viswanathan K, Bisio A, Capila I, Lansing JC, Guglieri S, Fraser B, Al-Hakim A, Gunay NS, Zhang Z, Robinson L, Buhse L, Nasr M, Woodcock J, Langer R, Venkataraman G, Linhardt RJ, Casu B, Torri G, Sasisekharan R (2008) Oversulfated chondroitin sulfate is a contaminant in heparin associated with adverse clinical events. Nat Biotechnol 26:669–675CrossRefGoogle Scholar
  24. 24.
    Sitkowski J, Bednarek E, Bocian W, Kozerski L (2008) Assessment of oversulfated chondroitin sulfate in low molecular weight and unfractioned heparins diffusion ordered nuclear magnetic resonance spectroscopy method. J Med Chem 51:7663–7665CrossRefGoogle Scholar
  25. 25.
    Rudd TR, Skidmore MA, Guimond SE, Cosentino C, Torri G, Fernig DG, Lauder RM, Guerrini M, Yates EA (2009) Glycosaminoglycan origin and structure revealed by multivariate analysis of NMR and CD spectra. Glycobiology 19:52–67CrossRefGoogle Scholar
  26. 26.
    Ruiz-Calero V, Saurina J, Galceran MT, Hernández-Cassou S, Puignou L (2002) Estimation of the composition of heparin mixtures from various origins using proton nuclear magnetic resonance and multivariate calibration methods. Anal Bioanal Chem 373:259–265CrossRefGoogle Scholar
  27. 27.
    Ruiz-Calero V, Saurina J, Galceran MT, Hernández-Cassou S, Puignou L (2000) Potentiality of proton nuclear magnetic resonance and multivariate calibration methods for the determination of dermatan sulfate contamination in heparin samples. Analyst 125:933–938CrossRefGoogle Scholar
  28. 28.
    Ruiz-Calero V, Saurina J, Hernández-Cassou S, Galceran MT, Puignou L (2002) Proton nuclear magnetic resonance characterization of glycosaminolgycans using chemometric techniques. Analyst 127:407–415CrossRefGoogle Scholar
  29. 29.
    Keire DA, Ye H, Trehy ML, Ye W, Kolinski RE, Westenberger BJ, Buhse LF, Nasr M, Al-Hakim A (2010) Characterization of currently marketed heparin products: key tests for quality assurance. Anal Bioanal Chem (in press)Google Scholar
  30. 30.
    Weljie AM, Newton J, Mercier P, Carison E, Slupsky CM (2006) Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Anal Chem 78:4430–4442CrossRefGoogle Scholar
  31. 31.
    R Development Core Team. R: software, a language and environment for statistical computing. R Development Core Team, Foundation for Statistical Computing. www.r-project.org
  32. 32.
    Varmuza K, Filzmoser P (2009) Introduction to multivariate statistical analysis in chemometrics. CRC, Boca RatonCrossRefGoogle Scholar
  33. 33.
    Maindonald J, Braun J (2003) Data analysis and graphics using R. Cambridge University Press, CambridgeGoogle Scholar
  34. 34.
    Estienne F, Massart DL, Zanier-Szydlowski N, Marteau P (2000) Multivariate calibration with Raman spectroscopic data: a case study. Anal Chim Acta 424:185–201CrossRefGoogle Scholar
  35. 35.
    Carneiro RL, Braga JWB, Bottoli CBG, Poppi RJ (2007) Application of genetic algorithm for selection of variables for the BLLS method applied to determination of pesticides and metabolites in wine. Anal Chim Acta 595:51–58CrossRefGoogle Scholar
  36. 36.
    Broadhurst D, Goodacre R, Jones A, Rowland JJ, Kell DB (1997) Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Anal Chim Acta 348:71–86CrossRefGoogle Scholar
  37. 37.
    Leardi R (2001) Genetic algorithms in chemometrics and chemistry: a review. J Chemom 15:559–569CrossRefGoogle Scholar
  38. 38.
    Jouan-Rimbaud D, Massart D, Leardi R, De Noord OE (1995) Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal Chem 67:4295–4301CrossRefGoogle Scholar
  39. 39.
    Liebmann B, Friedl A, Varmuza K (2009) Determination of glucose and ethanol in bioethanol production by near infrared spectroscopy and chemometrics. Anal Chim Acta 642:171–178CrossRefGoogle Scholar
  40. 40.
    Gourvénec S, Capron X, Massart DL (2004) Genetic algorithms (GA) applied to the orthogonal projection approach (OPA) for variable selection. Anal Chim Acta 519:11–21CrossRefGoogle Scholar
  41. 41.
    Forshed J, Schuppe-Koistinen I, Jacobsson SP (2003) Peak alignment of NMR signals by means of a genetic algorithm. Anal Chim Acta 487:189–199CrossRefGoogle Scholar
  42. 42.
    Üstün B, Melssen WJ, Oudenhuijzen M, Buydens LMC (2005) Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization. Anal Chim Acta 544:292–305CrossRefGoogle Scholar
  43. 43.
    Huang J, Brennan D, Sattler L, Alderman J, Lane B, O’Mathuna C (2002) A comparison of calibration methods based on calibration data size and robustness. Chemom Intell Lab Syst 62:25–35CrossRefGoogle Scholar
  44. 44.
    Czekaj T, Wu W, Walczak B (2005) About kernel latent variable approaches and SVM. J Chemom 19:341–354CrossRefGoogle Scholar
  45. 45.
    Tistaert C, Dejaegher B, Nguyen Hoai N, Chataigné G, Riviere C, Nguyen Thi Hong V, Van Chau M, Quetin-Leclercq J, Vander Heyden Y (2009) Potential antioxidant compounds in Mallotus species fingerprints. Part I: indication, using linear multivariate calibration techniques. Anal Chim Acta 649:24–32CrossRefGoogle Scholar
  46. 46.
    Sun M, Zheng Y, Wei H, Chen J, Cai J, Ji M (2009) Enhanced replacement method-based quantitative structure–activity relationship modeling and support vector classification of 4-anilino-3-quinolinecarbonitriles as Src kinase inhibitors. QSAR Comb Sci 28:312–324CrossRefGoogle Scholar
  47. 47.
    Zhu D, Ji B, Meng C, Shi B, Tu Z, Qing Z (2007) The performance of ν-support vector regression on determination of soluble solids content of apple by acousto-optic tunable filter near-infrared spectroscopy. Anal Chim Acta 598:227–234CrossRefGoogle Scholar
  48. 48.
    Liu H, Zhang R, Yao X, Liu M, Hu Z, Fan B (2004) Prediction of electrophoretic mobility of substituted aromatic acids in different aqueous-alcoholic solvents by capillary zone electrophoresis based on support vector machine. Anal Chim Acta 525:31–41CrossRefGoogle Scholar
  49. 49.
    Vapnik V (1995) The nature of statistical learning theory. Springer, New YorkGoogle Scholar
  50. 50.
    Vapnik V (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
  51. 51.
    Li H, Liang Y, Xu Q (2009) Support vector machines and its applications in chemistry. Chemom Intell Lab Syst 95:188–198CrossRefGoogle Scholar
  52. 52.
    Thissen U, Pepers M, Üstün B, Melssen WJ, Buydens LMC (2004) Comparing support vector machines to PLS for spectral regression applications. Chemom Intell Lab Syst 73:169–179CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Qingda Zang
    • 1
    • 2
    • 3
  • David A. Keire
    • 4
  • Richard D. Wood
    • 2
  • Lucinda F. Buhse
    • 4
  • Christine M. V. Moore
    • 5
  • Moheb Nasr
    • 5
  • Ali Al-Hakim
    • 5
  • Michael L. Trehy
    • 4
  • William J. Welsh
    • 1
  1. 1.Department of Pharmacology, Robert Wood Johnson Medical SchoolUniversity of Medicine & Dentistry of New JerseyPiscatawayUSA
  2. 2.Snowdon, Inc.Monmouth JunctionUSA
  3. 3.Department of Health Informatics, School of Health Related ProfessionsUniversity of Medicine & Dentistry of New JerseyNewarkUSA
  4. 4.Division of Pharmaceutical Analysis, Food and Drug Administration, CDERSt LouisUSA
  5. 5.Office of New Drug Quality Assessment, Food and Drug Administration, CDERSilver SpringUSA

Personalised recommendations