Determination of galactosamine impurities in heparin samples by multivariate regression analysis of their ^{1}H NMR spectra
 176 Downloads
 18 Citations
Abstract
Heparin, a widely used anticoagulant primarily extracted from animal sources, contains varying amounts of galactosamine impurities. Currently, the United States Pharmacopeia (USP) monograph for heparin purity specifies that the weight percent of galactosamine (%Gal) may not exceed 1%. In the present study, multivariate regression (MVR) analysis of ^{1}H NMR spectral data obtained from heparin samples was employed to build quantitative models for the prediction of %Gal. MVR analysis was conducted using four separate methods: multiple linear regression, ridge regression, partial least squares regression, and support vector regression (SVR). Genetic algorithms and stepwise selection methods were applied for variable selection. In each case, two separate prediction models were constructed: a global model based on dataset A which contained the full range (0–10%) of galactosamine in the samples and a local model based on the subset dataset B for which the galactosamine level (0–2%) spanned the 1% USP limit. All four regression methods performed equally well for dataset A with low prediction errors under optimal conditions, whereas SVR was clearly superior among the four methods for dataset B. The results from this study show that ^{1}H NMR spectroscopy, already a USP requirement for the screening of contaminants in heparin, may offer utility as a rapid method for quantitative determination of %Gal in heparin samples when used in conjunction with MVR approaches.
Keywords
Heparin Galactosamine impurities Proton nuclear magnetic resonance (^{1}H NMR) Multivariate regression (MVR) Variable selectionIntroduction
Heparin is a naturally occurring polydisperse mixture of linear, highly sulfated carbohydrates composed of repeating disaccharide units, which generally comprise a 6Osulfated, Nsulfated glucosamine alternating with a 2Osulfated iduronic acid [1, 2, 3]. As a member of the glycosaminoglycan family, heparin has the highest negative charge density among known biological molecules. During heparin biosynthesis, the polysaccharide chains are incompletely modified and variably elongated, leading to heterogeneity in chemical structure, diversity in sulfation patterns, and polydispersity in molecular mass [4]. As one of the oldest drugs still in widespread clinical use, heparin is highly effective in kidney dialysis and cardiac surgery. Heparin is the most widely used anticoagulant for preventing or treating thromboembolic disorders and for inhibiting coagulation during hemodialysis and extracorporeal blood circulation [5, 6, 7, 8].
Pharmaceutical heparin is usually obtained by extracting animal tissues, such as porcine intestines or bovine lung after proteolytic digestion, and then precipitating the preparations as quaternary ammonium complexes or barium salts [9, 10, 11, 12]. To ensure the appropriate biological activity, chemical parameters, including purity, molecular mass distribution, degree of sulfation, as well as the presence of specific oligosaccharide sequences, must be strictly controlled. Due to the heterogeneity of heparin preparations, it is difficult to accurately determine the precise chemical structure and to measure the performance of purification protocols [13, 14, 15].
Heparin always contains varying amounts of undesirable impurities. Among these, chondroitin sulfate A and dermatan sulfate (DS) have been identified. These chondroitin derivatives differ from heparin in that they contain galactosamine, and the level of these galactosaminecontaining impurities in heparin is used as an indication of the purity of the drug substance [16, 17, 18, 19, 20]. DS is the most common chondroitin sulfate impurity in heparin with concentrations up to a few percent [21, 22]. DS is composed of alternating iduronic acid–galactosamine disaccharide units, and due to their similarity with the iduronic–glucosamine disaccharide units of heparin, commercial heparin preparations usually contain small amounts of DS.
Currently, the United States Pharmacopeia (USP) monograph for heparin purity specifies that the weight percent of galactosamine (%Gal) may not exceed 1% in total hexosamine content. The revised Stage 3 USP monograph has been proposed by the FDA to specify that %Gal may not exceed 1.0%. Therefore, for this work, we applied the 1.0% %Gal specification to delineate heparin samples that pass or fail this criterion.
The accurate measurement of %Gal in heparin is an important parameter to assure the safety and efficacy of the drug. The experimental determination of %Gal by acid digestion and highperformance liquid chromatography (HPLC) with a pulsed amperometric detector requires expert operators, expensive equipment, and careful sample preparation. In contrast, although the NMR approach requires more expensive equipment than the HPLC method, the sample preparation is minimal and the data are already required for other aspects of USP testing. Therefore, the development of simple computational methods for the prediction of %Gal values from NMR data is of particular interest.
^{1}H NMR spectroscopy is very sensitive to minor structural variations, so the repeating disaccharide units of heparin can be easily identified in ^{1}H NMR spectra by specific signals [4, 9, 15]. ^{1}H NMR spectroscopy has been widely applied for the characterization of the chemical composition of heparin and its derivatives, as well as for the identification of contaminants from various sources [16, 20, 23, 24]. With the help of chemometric techniques, useful chemical information from complex NMR spectra can be extracted and the characterization and quantification of analytes can be accomplished using the NMR signals in much the same way as fingerprints. Chemometric models have been successfully applied to the study of ^{1}H NMR spectra of several heparin samples [25, 26, 27, 28] in which spectral data were transformed into discrete variables for subsequent multivariate regression analysis (MVA).
The objective of this study was to employ MVA to build quantitative models for the prediction of %Gal in laboratory heparin samples based on analysis of their ^{1}H NMR spectral data. Several multivariate regression methods were implemented and compared, including multiple linear regression (MLR), ridge regression (RR), partial least squares regression (PLSR), and support vector regression (SVR). To obtain stable and robust models with high predictive ability, two variable selection techniques, viz., genetic algorithms (GAs) and stepwise methods, were employed to choose only the most informationrich subset of variables. The present results show that NMR spectroscopy together with chemometric techniques are useful for quantifying the %Gal content in heparin, thereby potentially obviating laborintensive and costly chemical analysis.
Methods
NMR spectroscopy measurement
All samples were analyzed using a Varian Inova 500 instrument at the Washington University Chemistry Department NMR Facility operating at 499.893 MHz for ^{1}H nuclei. Samples were run with the probe air temperature regulated at 25 °C. Spectral parameters include: a spectral window of 8,000 Hz centered on residual water at 4.77 ppm, 16 transients coadded, a 90° pulse width, acquisition time of 1.892 s, and a relaxation delay of 20 s. The total acquisition time per sample was 5.84 min. These acquisition parameters typically gave S/N values measured around the Nacetyl methyl proton signals at 2.045 ppm of approximately 1,000–2,000:1 for the heparin samples. The concentration of the heparin in the NMR tube was 27 mg/mL (20 mg/700 μL). All samples were made approx. 3 mM in 4,4dimethyl4silapentane1sulfonic acid (DSS) as an internal chemical shift reference.
Galactosamine content analysis
A detailed description of the methods employed for the experimental determination of percent galacosamine in total hexosamine can be found in [29].
Data processing
^{1}H NMR analytical data of over 100 heparin sodium active pharmaceutical ingredient (API) samples from different suppliers with varying levels of chondroitins (primarily in the form of DS) were obtained from the chromatographic and spectroscopic experiments. These samples contained up to 10% by weight of chondroitins in the API by the %Gal HPLC assay. ^{1}H NMR data were processed with MestReC (version 5.3.0) software. Phase and baseline corrections were applied. Chemical shifts were referenced to internal DSS. Each ^{1}H NMR spectrum was reduced and divided into segments with width of 0.03 ppm spanning the interval from 1.95 to 5.70 ppm, and peak integration was performed for each spectral region. All the heparin NMR spectra contain water from residual H_{2}O in the D_{2}O used (at 4.77 ppm). In addition, a number of batches contain other solvents and reagents, including methanol (at 3.35 ppm, singlet), ethanol (at 1.18 or 3.66 ppm, triplet or quartet), and acetate (at 1.92 ppm, singlet), in varying amounts [20]. These regions were excluded from the data acquisition, and the total dataset was reduced to 74 regions or variables. In order to compensate for the concentration difference between the heparin samples, the integral of each spectrum was normalized to the total of the summed integral values.
Summary properties of the range of samples for which %Gal was measured by HPLC analysis
 Number of samples  %Gal  

Minimum  Maximum  Median  Mean  
Dataset A  
Training set  76  0.01  9.68  0.86  1.74 
Test set  25  0.11  8.05  0.87  1.76 






Dataset B  
Training set  57  0.01  1.86  0.66  0.71 
Test set  19  0.11  1.74  0.72  0.73 
Computation programs
Mathematical treatments for data standardization, multivariate analysis, and statistical model building were performed using the R statistical analysis software for Windows (version 2.8.1) [31]. Stepwise variable selection, genetic algorithms, multiple linear regression, ridge regression, partial least squares regression, and support vector regression are implemented by the packages chemometrics, subselect, stats, MASS, pls, and e1071, respectively [32, 33].
Results and discussion
NMR spectra of heparin samples
Variable selection
Variable selection is a crucial step in regression analysis as it controls both the number of variables and the mathematical complexity of the model. The presence of variables not related to the response can produce background noise, and redundant variables may confound regression models, thereby reducing their predictive ability. The selection of variables for multivariate calibration is an optimization procedure whose goal is to select the subset of variables that produce simple and robust regression models with high prediction performance. Two wellestablished methods, the stepwise selection technique and the GA, were used here for variable selection from the original NMR spectral matrix.
Stepwise procedure
Stepwise variable selection procedure for dataset A
Model size  BIC  Selected variables (ppm)  Add(+)/Drop(−) 

1  190.97  2.08  +2.08 
2  124.81  2.02, 2.08  +2.02 
3  99.64  2.02, 2.08, 2.11  +2.11 
4  84.84  2.02, 2.08, 2.11, 4.31  +4.31 
5  76.98  2.02, 2.08, 2.11, 3.53, 4.31  +3.53 
6  56.00  2.02, 2.08, 2.11, 3.50, 3.53, 4.31  +3.50 
7  54.23  2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.61  +5.61 
8  50.01  2.02, 2.08, 2.11, 3.50, 3.53, 4.31, 5.34, 5.61  +5.34 
7′  45.47  2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.61  −2.11 
8′  45.05  2.02, 2.08, 3.50, 3.53, 4.31, 5.34, 5.43, 5.61  +5.43 
9  42.17  2.02, 2.08, 3.50, 3.53, 4.25, 4.31, 5.34, 5.43, 5.61  +4.25 
10  40.48  2.02, 2.08, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61  +3.59 
11  37.49  2.02, 2.08, 2.14, 3.50, 3.53, 3.59, 4.25, 4.31, 5.34, 5.43, 5.61  +2.14 
Stepwise variable selection procedure for dataset B
Model size  BIC  Selected variables (ppm)  Add(+)/Drop(−) 

1  69.71  2.08  +2.08 
2  42.48  2.02, 2.08  +2.02 
3  30.73  2.02, 2.08, 2.11  +2.11 
4  27.61  1.99, 2.02, 2.08, 2.11  +1.99 
5  24.13  1.99, 2.02, 2.08, 2.11, 4.37  +4.37 
6  20.93  1.99, 2.02, 2.08, 2.11, 4.22, 4.37  +4.22 
5′  17.17  1.99, 2.08, 2.11, 4.22, 4.37  −2.02 
6′  15.50  1.99, 2.08, 2.11, 2.20, 4.22, 4.37  +2.20 
5″  14.45  1.99, 2.08, 2.20, 4.22, 4.37  −2.11 
Genetic algorithms
GAs are numerical optimization tools and randomized search techniques which simulate biological evolution based on Darwin’s theory of natural selection. The basic operation of GAs consists of five steps: encoding the variables as chromosomes; generating the initial population of chromosomes; evaluating the fitness function; creating the next generation of chromosomes; and terminating the process [35, 36]. GAs have demonstrated their utility for selecting optimal variables in multivariate calibration [37, 38, 39, 40, 41, 42] and are especially suitable for datasets with a large number (~200) of variables, such as the present case for the heparin NMR datasets. GA training requires the selection of several parameters, i.e., the number of chromosomes, initial population, selection mode, crossover parameters, mutation rate, and convergence criteria, all of which can influence the final results. In the present study, the entire set of 74 variables was used as input to the GA for the selection of the optimal subset of variables for predicting %Gal.
An initial population size of chromosomes was set to 200, and the number of selected variables in the model was maintained between 5 and 40. The chromosome with the maximum fitness value was chosen during each population. Depending on the fitness values, a subset of pairs of chromosomes was selected to undergo crossover (analogous to reproduction) where two existing chromosomes exchange parts of their genetic content and two new chromosomes are formed. Following crossover, one or more mutations may occur where the bits of an individual’s string are randomly inverted such that the state of the gene is changed from “0” to “1” or vice versa. The crossover probability and mutation probability were set to 50% and 1%, respectively. The search was terminated after completing 100 generations.
Variables (ppm) selected by the GA method
No. of variables  Selected variables 

Dataset A  
5 variables  2.08, 2.11, 3.50, 3.53, 4.46 
10 variables  2.02, 2.08, 3.50, 3.53, 3.56, 3.71, 3.80, 5.49, 5.55, 5.67 
20 variables  2.08, 2.11, 2.17, 2.20, 3.50, 3.53, 3.56, 3.71, 3.74, 3.92, 4.01, 4.04, 4.40, 4.46, 4.52, 4.92, 5.01, 5.46, 5.58, 5.67 
30 variables  2.02, 2.08, 2.11, 2.14, 2.20, 3.53, 3.71, 3.74, 3.89, 3.98, 4.04, 4.13, 4.19, 4.34, 4.40, 4.46, 4.52, 4.92, 4.95, 4.98, 5.01, 5.04, 5.07, 5.22, 5.25, 5.37, 5.40, 5.58, 5.61, 5.67 
40 variables  1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59, 3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07, 4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07, 5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67 
Dataset B  
5 variables  2.08, 3.50, 3.56, 3.71, 4.46 
10 variables  2.02, 2.08, 2.14, 3.50, 3.56, 3.71, 4.46, 5.19, 5.49, 5.64 
20 variables  2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.71, 3.77, 4.07, 4.13, 4.37, 4.43, 4.46, 4.49, 4.58, 5.04, 5.10, 5.19, 5.49, 5.61 
30 variables  1.96, 2.02, 2.08, 2.14, 2.20, 3.50, 3.56, 3.62, 3.71, 3.92, 3.95, 3.98, 4.07, 4.13, 4.37, 4.43, 4.46, 4.49, 4.58, 4.64, 5.04, 5.07, 5.10, 5.13, 5.16, 5.19, 5.22, 5.31, 5.49, 5.52 
40 variables  1.96, 2.02, 2.05, 2.08, 2.11, 2.14, 2.20, 3.50, 3.56, 3.59, 3.62, 3.68, 3.71, 3.74, 3.83, 3.92, 3.95, 3.98, 4.01, 4.07, 4.10, 4.31, 4.34, 4.40, 4.43, 4.49, 4.58, 4.64, 5.04, 5.07, 5.13, 5.16, 5.22, 5.31, 5.34, 5.37, 5.40, 5.46, 5.61, 5.67 
Multiple linear regression
MLR is a simple and easy calibration method that avoids the need for adjustable parameters such as the factor number in partial least squares regression, the regularization parameter λ in ridge regression, and the kernel parameters in SVR. Consequently, MLR is among the most common approaches used to build multivariate regression models. However, overly complex MVR models with large numbers of independent variables may actually lose their predictive ability. This common problem occurs when too many variables are used to fit the calibration (training) set and can be avoided by crossvalidation and further external validation of the model using test samples reserved for this purpose.
Statistical parameters obtained from the MLR models using stepwise and GA variable selection methods
 All  Stepwise  Genetic algorithms  

No. of variables  74  11  5  10  20  30  40  
Model A  
Dataset A  Training  
RMSE  0.01  0.26  0.35  0.27  0.26  0.22  0.17  
RSD  0.01  0.15  0.20  0.15  0.15  0.13  0.10  
\( R_{{\rm adj} }^2 \)  1.00  0.98  0.97  0.98  0.98  0.99  0.99  
Test  
RMSE  1.34  0.33  0.29  0.23  0.29  0.31  0.55  
RSD  0.76  0.19  0.17  0.13  0.16  0.18  0.31  
\( R_{{\rm adj} }^2 \)  0.62  0.98  0.98  0.99  0.98  0.98  0.93  
Dataset B  Training  
RMSE  0.01  0.19  0.26  0.19  0.18  0.16  0.14  
RSD  0.01  0.27  0.39  0.27  0.25  0.22  0.20  
\( R_{{\rm adj} }^2 \)  1.00  0.86  0.78  0.86  0.89  0.90  0.92  
Test  
RMSE  1.47  0.29  0.29  0.20  0.27  0.28  0.55  
RSD  1.99  0.40  0.39  0.27  0.36  0.38  0.75  
\( R_{{\rm adj} }^2 \)  0.11  0.66  0.70  0.85  0.76  0.72  0.59  
Model B  
Dataset B  Training  
RMSE  NA  0.21  0.18  0.13  0.10  0.07  0.03  
RSD  NA  0.30  0.25  0.18  0.14  0.10  0.04  
\( R_{{\rm adj} }^2 \)  NA  0.80  0.85  0.92  0.95  0.98  1.00  
Test  
RMSE  NA  0.26  0.25  0.18  0.15  0.10  0.14  
RSD  NA  0.36  0.34  0.24  0.20  0.13  0.19  
\( R_{{\rm adj} }^2 \)  NA  0.69  0.73  0.86  0.92  0.96  0.94 
When the most informationrich variables were selected and variables that are redundant or uncorrelated to the response were discarded, the performance of the model was enhanced significantly. The predictive ability was remarkably improved for the MVR models containing up to 11 variables based on stepwise selection. Compared with the allvariable model, the \( R_{{\rm adj} }^2 \) for the test set increased from 0.62 to 0.98, even though the \( R_{{\rm adj} }^2 \) value for the training set dropped slightly from 1.0 to 0.98. Taken together, these results reflect the excellent agreement between the measured and predicted values after appropriate variable selection.
Using GA for variable selection, the model’s quality relied somewhat on the number of selected variables. Table 5 shows that \( R_{{\rm adj} }^2 \) for the training set improved continuously from 0.97 to 0.99 between 5 and 40 variables. In contrast, the test set followed a different pattern, i.e., the \( R_{{\rm adj} }^2 \) value initially increased to a maximum 0.99 at ten variables, after which it gradually decreased to 0.93 at 40 variables. Therefore, the minimum number of prediction errors occurred when the model was of moderate complexity. In the present case, the resulting model demonstrated good performance in estimating the %Gal concentrations using ten variables. As shown in Fig. 2b, the measured and predicted values were highly correlated over the entire concentration range for both the training and test datasets. Comparing the GA and stepwise selection methods, the statistical parameters \( R_{{\rm adj} }^2 \) and RMSE revealed a slight advantage for the former over the latter.
%Gal values for the 19 testset samples measured by HPLC analysis and predicted by MLR regression model using 30 variables selected by GA
Test sample  Measured  Predicted  Test sample  Measured  Predicted 

1  0.11  0.22  11  0.75  0.83 
2  0.16  0.10  12  0.83  0.81 
3  0.19  0.24  13  0.87  0.92 
4  0.25  0.27  14  1.01  0.97 
5  0.31  0.44  15  1.07  0.95 
6  0.39  0.30  16  1.17  1.30 
7  0.42  0.46  17  1.23  1.43 
8  0.51  0.64  18  1.63  1.52 
9  0.59  0.42  19  1.74  1.77 
10  0.72  0.77 
Ridge regression
Equation 10 is a linear function of the response variable y. The coefficient b _{ridge} is similar to the regression coefficient of MLR in Eq. 3, but the inverse is stabilized by the ridge parameter λ. The performance of ridge regression depends heavily on proper choice of the parameter λ, which is achieved using crossvalidation procedures.
Statistical parameters obtained from RR models using stepwise and GA variable selection
 All  Stepwise  Genetic algorithms  

No. of variables  74  11  5  10  20  30  40  
Model A (global)  
λ  0.01  0.28  0.18  0.56  0.64  0.34  0.27  
Dataset A  Training  
RMSE  0.02  0.26  0.35  0.27  0.26  0.23  0.17  
RSD  0.01  0.15  0.20  0.16  0.15  0.13  0.10  
\( R_{{\rm adj} }^2 \)  1.00  0.98  0.97  0.98  0.98  0.99  0.99  
Test  
RMSE  0.93  0.32  0.28  0.23  0.29  0.33  0.64  
RSD  0.53  0.18  0.16  0.13  0.17  0.19  0.36  
\( R_{{\rm adj} }^2 \)  0.80  0.98  0.98  0.99  0.98  0.97  0.90  
Dataset B  Training  
RMSE  0.02  0.18  0.27  0.18  0.17  0.15  0.14  
RSD  0.03  0.26  0.38  0.26  0.24  0.22  0.20  
\( R_{{\rm adj} }^2 \)  1.00  0.86  0.78  0.86  0.90  0.90  0.92  
Test  
RMSE  0.97  0.29  0.28  0.20  0.26  0.27  0.54  
RSD  1.30  0.38  0.37  0.27  0.35  0.36  0.73  
\( R_{{\rm adj} }^2 \)  0.31  0.69  0.69  0.85  0.77  0.75  0.60  
Model B (local)  
λ  0.01  0.27  0.06  0.02  0.05  0.03  0.01  
Dataset B  Training  
RMSE  0.01  0.21  0.17  0.13  0.10  0.07  0.03  
RSD  0.01  0.30  0.25  0.19  0.14  0.10  0.04  
\( R_{{\rm adj} }^2 \)  1.00  0.80  0.85  0.92  0.95  0.98  0.99  
Test  
RMSE  0.23  0.26  0.25  0.18  0.15  0.11  0.14  
RSD  0.31  0.35  0.34  0.24  0.20  0.14  0.19  
\( R_{{\rm adj} }^2 \)  0.78  0.69  0.73  0.86  0.91  0.95  0.95 
Prediction of the test data was achieved using the optimized regression coefficients. The statistical parameters calculated for the ridge regression models, including the adjusted coefficient \( R_{{\rm adj} }^2 \), RMSE, and RSD for both training and test sets, are presented in Table 7. For the allvariable model, the coefficient of determination \( R_{{\rm adj} }^2 \) for the test set increases from 0.62 for the MLR model to 0.80 for the ridge regression model for dataset A (%Gal = 0.0–10.0). The allvariable MLR model is unavailable for dataset B (%Gal = 0.0–2.0) since the number of variables exceeds the number of samples. Ridge regression is unconstrained by this condition, and the allvariable model yielded \( R_{{\rm adj} }^2 = {1}.00 \) for the training set and 0.78 for the test set (Table 7). However, the large errors (RSD = 0.31) for the test set are indicative of model overfitting and poor predictive ability. When variable selection was applied using either stepwise or GA methods, the predictive ability of the RR models approached that of the MLR models. Like the MLR models, the RR model showed poor predictive ability when the number of variables is too few (underfitting) or too many (overfitting). Therefore, selecting the appropriate number of variables was a key factor in achieving highly predictive models by ridge regression.
Partial least squares regression
PLSR is perhaps the most widely used multivariate regression method in chemometrics [32]. The aim of PLSR is to construct predictive models between two blocks of variables, the latent variables (principal components, or PCs) and the response variables, so that the covariance between them is maximized. The advantage of this method over MLR is its capacity to build a regression model based on highly correlated (colinear) variables. In PLSR, the X data are first transformed into a set of orthogonal PCs, a linear combination of the original variables, which serve as new variables for regression with a dependent variable y.
Statistical parameters obtained from PLSR models using stepwise and GA variable selection methods
 All  Stepwise  Genetic algorithms  

No. of variables  74  11  5  10  20  30  40  
Model A  
Optimal PCs  12  8  5  8  15  18  22  
Dataset A  Training  
RMSE  0.16  0.26  0.35  0.27  0.26  0.26  0.23  
RSD  0.09  0.15  0.20  0.16  0.15  0.15  0.13  
\( R_{{\rm adj} }^2 \)  0.99  0.98  0.97  0.98  0.98  0.98  0.99  
Test  
RMSE  0.39  0.31  0.29  0.22  0.28  0.33  0.37  
RSD  0.22  0.18  0.17  0.12  0.16  0.19  0.21  
\( R_{{\rm adj} }^2 \)  0.96  0.98  0.98  0.99  0.98  0.97  0.97  
Dataset B  Training  
RMSE  0.14  0.17  0.26  0.23  0.19  0.18  0.16  
RSD  0.20  0.25  0.38  0.33  0.27  0.25  0.24  
\( R_{{\rm adj} }^2 \)  0.91  0.87  0.78  0.82  0.86  0.87  0.90  
Test  
RMSE  0.29  0.27  0.29  0.20  0.26  0.27  0.28  
RSD  0.39  0.36  0.39  0.26  0.35  0.36  0.38  
\( R_{{\rm adj} }^2 \)  0.70  0.74  0.69  0.85  0.75  0.73  0.72  
Model B  
Optimal PCs  28  5  5  9  19  23  34  
Dataset B  Training  
RMSE  0.03  0.20  0.17  0.13  0.10  0.06  0.04  
RSD  0.04  0.28  0.25  0.18  0.13  0.09  0.05  
\( R_{{\rm adj} }^2 \)  0.99  0.80  0.85  0.92  0.96  0.98  0.99  
Test  
RMSE  0.20  0.26  0.25  0.18  0.15  0.09  0.14  
RSD  0.27  0.34  0.33  0.24  0.20  0.12  0.19  
\( R_{{\rm adj} }^2 \)  0.85  0.70  0.73  0.86  0.92  0.96  0.95 
Support vector regression
Unlike the Lagrange multipliers which can be optimized automatically by the program, SVR requires the user to adjust the kernel parameters, the radius of the tube ε, and the regularizing parameter C. When applying the RBF kernel, the generalization property is dependent on the parameter γ which controls the amplitude of the kernel function. If γ is too large, all training objects are used as the support vectors, leading to overfitting. If γ is too small, all data points are regarded as one object, resulting in poor ability to generalize (i.e., predict beyond the training set) [44]. In addition, the penalty weight C and the tube size ε also require optimization. As the regularization parameter, C controls the tradeoff between minimizing the training error and maximizing the margin. Generally, values of C that are too large or too small lead to regression models with poor prediction ability. When C is very low, the predictive ability of the model is exclusively determined by the weights of regression coefficients [49]. When C is large, the cost function controls the performance while the regression coefficients have little bearing even if their values are very high. Data points with prediction errors larger than ±ε are the support vectors which determine the predictive ability of the SVR model. A large number of support vectors occur at low ε, while sparse models are obtained when the value of ε is high. The optimal value of ε depends heavily on the individual datasets. Small values of ε should be used for low levels of noise, whereas higher values of ε are appropriate for large experimental errors. Thus, in order to find the optimized combination of the parameters γ, C, and ε, crossvalidation via parallel grid search was performed.
Statistical parameters obtained from the SVR models (RBF kernel) using stepwise and GA variable selection
 All  Stepwise  Genetic algorithms  

No. of variables  74  11  5  10  20  30  40  
Model A  
SVR parameters  ε  0.14  0.01  0.18  0.10  0.07  0.05  0.10 
C  1.0 × 10^{6}  1.0 × 10^{4}  1.0 × 10^{5}  1.0 × 10^{6}  1.0 × 10^{5}  1.0 × 10^{5}  1.0 × 10^{4}  
γ  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  
No. of vectors  28  71  21  43  39  59  37  
Dataset A  Training  
RMSE  0.22  0.28  0.36  0.28  0.27  0.25  0.24  
RSD  0.13  0.16  0.21  0.16  0.16  0.14  0.14  
\( R_{{\rm adj} }^2 \)  0.99  0.98  0.97  0.98  0.98  0.99  0.99  
Test  
RMSE  0.43  0.25  0.28  0.23  0.22  0.21  0.41  
RSD  0.24  0.14  0.16  0.13  0.13  0.12  0.23  
\( R_{{\rm adj} }^2 \)  0.96  0.98  0.98  0.99  0.99  0.99  0.96  
Dataset B  Training  
RMSE  0.21  0.18  0.27  0.17  0.17  0.16  0.15  
RSD  0.31  0.25  0.39  0.25  0.24  0.23  0.22  
\( R_{{\rm adj} }^2 \)  0.82  0.86  0.77  0.88  0.88  0.90  0.90  
Test  
RMSE  0.39  0.23  0.25  0.20  0.18  0.16  0.36  
RSD  0.53  0.31  0.34  0.26  0.24  0.22  0.49  
\( R_{{\rm adj} }^2 \)  0.66  0.78  0.76  0.84  0.87  0.89  0.70  
Model B  
SVR parameters  ε  0  0.60  0.15  0.40  0.03  0.05  0.07 
C  1.0 × 10^{6}  1.0 × 10^{6}  1.0 × 10^{5}  1.0 × 10^{6}  1.0 × 10^{6}  1.0 × 10^{6}  1.0 × 10^{6}  
γ  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−3}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  1.0 × 10^{−5}  
No. of vectors  57  16  39  15  53  51  49  
Dataset B  Training  
RMSE  0.02  0.21  0.14  0.14  0.10  0.07  0.04  
RSD  0.02  0.30  0.21  0.19  0.14  0.10  0.05  
\( R_{{\rm adj} }^2 \)  1.00  0.79  0.90  0.91  0.96  0.98  0.99  
Test  
RMSE  0.20  0.24  0.23  0.18  0.16  0.10  0.15  
RSD  0.27  0.33  0.32  0.24  0.21  0.13  0.20  
\( R_{{\rm adj} }^2 \)  0.82  0.74  0.76  0.87  0.91  0.96  0.92 
As with RR and PLSR, SVR model performance was poorer for dataset B than for dataset A. For the allvariable models, the RBF kernel yielded \( R_{{\rm adj} }^2 = {1}.00 \) for the training set, but only 0.82 for the test set, suggesting overfitting. The predictive ability of the models improved considerably using GA for variable selection with an appropriate number of variables. A maximum \( R_{adj}^2 \) of 0.96 for the test set was achieved at 30 variables.
Conclusions
In this study, the %Gal in heparin (primarily originating from the DS impurity) was predicted from ^{1}H NMR spectral data by means of four multivariate analysis approaches, i.e., MLR, RR, PLSR, and SVR. Variable selection was performed by GAs or stepwise methods in order to build robust and reliable models. The results demonstrated that excellent prediction performance was achieved in the determination of %Gal by all four regression models under optimal conditions. Variable selection enhanced the predictive ability substantially of all models, particularly the MLR model. Simple models were obtained using a subset of selected variables that predicted %Gal with high coefficients of determination and low prediction errors.
In general, GA was superior to the stepwise method for variable selection. Because GAs can choose any number of variables, a series of variables from 5 to 40 was selected to build predictive models. Overfitted models based on the training sets due to the use of excessive variables led to poor predictive ability on the test sets. Similarly, underfit models resulting from an insufficient number of variables for model building led to statistically unstable models. The optimal subsets for datasets A and B were 10 and 30 variables, respectively. After variable selection, the four regression models considered in this study produced very similar results.
The range of %Gal in the samples influences many factors, i.e., the selection of regression approach, the choice of variable selection method and number of variables; and the interpretation of the models. Dataset A covered the full range 0–10%Gal, while dataset B was the subset covering 0–2%Gal. As expected, the global model A performed best for dataset A while the local model B was preferred for dataset B, indicating that a multistage modeling approach may provide the best accuracy and range. Variable selection influenced the PLSR and SVR models only slightly for dataset A, but was required to achieve optimal results for dataset B. All four MVR approaches (MLR, RR, PLSR, and SVR) performed equally well and were robust under optimal conditions. However, SVR was slightly superior to the other three regression approaches when building models with dataset B.
The present study offers assistance in selecting the appropriate MVR approach to predict the %Gal in heparin based on the analysis of 1D ^{1}H NMR data. Our results demonstrate that the combination of ^{1}H NMR spectroscopy and chemometric techniques provides a rapid and efficient way to quantitatively determine the galactosamine content (as %Gal) in heparin. More generally, the present study underscores the importance in choosing the appropriate regression method, variable selection approach, and fitting parameters to build robust and highly predictive regression models for the rapid screening of heparin samples that may contain impurities and containments. Ongoing and future efforts will be directed toward the development of consensus or hierarchical frameworks in which multiple predictive techniques are pooled or tiered to augment predictive ability and to evaluate measures of the confidence of prediction.
Notes
FDA disclaimer
The findings and conclusions in this article have not been formally disseminated by the Food and Drug Administration and should not be construed to represent any agency determination or policy.
References
 1.Ampofo SA, Wang HM, Linhardt RJ (1991) Disaccharide compositional analysis of heparin and heparan sulfate using capillary zone electrophoresis. Anal Biochem 199:249–255CrossRefGoogle Scholar
 2.Rabenstein DL (2002) Heparin and heparan sulfate: structure and function. Nat Prod Rep 19:312–331CrossRefGoogle Scholar
 3.Casu B (1990) Heparin structure. Haemostasis 20:62–73Google Scholar
 4.Sudo M, Sato K, Chaidedgumjorn A, Toyoda H, Toida T, Imanari T (2001) ^{1}H nuclear magnetic resonance spectroscopic analysis for determination of glucuronic and iduronic acids in dermatan sulfate, heparin, and heparan sulfate. Anal Biochem 297:42–51CrossRefGoogle Scholar
 5.Linhardt RJ (1991) Hepairn: an important drug enters its seventh decade. Chem Ind 2:45–50Google Scholar
 6.Lepor NE (2007) Anticoagulation for acute coronary syndromes: from heparin to direct thrombin inhibitors. Rev Cardiovasc Med 8(suppl 3):S9–S17Google Scholar
 7.Fischer KG (2007) Essentials of anticoagulation in hemodialysis. Hemodial Int 11:178–189CrossRefGoogle Scholar
 8.Maruyama T, Toida T, Imanari T, Yu G, Linhardt RJ (1998) Conformational changes and anticoagulant activity of chondroitin sulfate following its Osulfonation. Carbohydr Res 306:35–43CrossRefGoogle Scholar
 9.Guerrini M, Bisio A, Torri G (2001) Combined quantitative ^{1}H and ^{13}C nuclear magnetic resonance spectroscopy for characterization of heparin preparations. Semin Thromb Hemost 27:473–482CrossRefGoogle Scholar
 10.Toida T, Maruyama T, Ogita Y, Suzuki A, Toyoda H, Imanari T, Linhardt RJ (1999) Preparation and anticoagulant activity of fully Osulphonated glycosaminoglycans. Int J Biol Macromol 26:233–241CrossRefGoogle Scholar
 11.Griffin CC, Linhardt RJ, Van Gorp CL, Toida T, Hileman RE, Schubert RL II, Brown SE (1995) Isolation and characterization of heparan sulfate from crude porcine intestinal mucosal peptidoglycan heparin. Carbohydr Res 276:183–197CrossRefGoogle Scholar
 12.Pervin A, Gallo C, Jandik KA, Han XJ, Linhardt RJ (1995) Preparation and structural characterization of large heparinderived oligosaccharides. Glycobiology 5:83–95CrossRefGoogle Scholar
 13.Korir AK, Larive CK (2009) Advances in the separation, sensitive detection, and characterization of heparin and heparan sulfate. Anal Bioanal Chem 393:155–169CrossRefGoogle Scholar
 14.Eldridge SL, Korir AK, Gutierrez SM, Campos F, Limtiaco JFK, Larive CK (2008) Heterogeneity of depolymerized heparin SEC fractions: to pool or not to pool? Carbohydr Res 343:2963–2970CrossRefGoogle Scholar
 15.Casu B, Guerrini M, Naggi A, Torri G, DeAmbrosi L, Boveri G, Gonella S, Cedro A, Ferró L, Lanzarotti E, Paterno M, Attolini M, Valle MG (1996) Characterization of sulfation patterns of beef and pig mucosal heparins by nuclear magnetic resonance spectroscopy. Arzneimittelforschung 46:472–477Google Scholar
 16.Guerrini M, Zhang Z, Shriver Z, Naggi A, Masuko S, Langer R, Casu B, Linhardt RJ, Torri G, Sasisekharan R (2009) Orthogonal analytical approaches to detect potential contaminants in heparin. PNAS 106:16956–16961CrossRefGoogle Scholar
 17.Wielgos T, Havel K, Ivanova N, Weinberger R (2009) Determination of impurities in heparin by capillary electrophoresis using high molarity phosphate buffers. J Pharm Biomed Anal 49:319–326CrossRefGoogle Scholar
 18.Limtiaco JF, Jones CJ, Larive CK (2009) Characterization of heparin impurities with HPLCNMR using weak anion exchange chromatography. Anal Chem 81:10116–10123CrossRefGoogle Scholar
 19.Trehy ML, Reepmeyer JC, Kolinski RE, Westenberger BJ, Buhse LF (2009) Analysis of heparin sodium by SAX/HPLC for contaminants and impurities. J Pharm Biomed Anal 49:670–673CrossRefGoogle Scholar
 20.Beyer T, Diehl B, Randel G, Humpfer E, Schäfer H, Spraul M, Schollmayer C, Holzgrabe U (2008) Quality assessment of unfractionated heparin using ^{1}H nuclear magnetic resonance spectroscopy. J Pharm Biomed Anal 48:13–19CrossRefGoogle Scholar
 21.Domanig R, Jöbstl W, Gruber S, Freudemann T (2009) Onedimensional cellulose acetate plate electrophoresis–a feasible method for analysis of dermatan sulfate and other glycosaminoglycan impurities in pharmaceutical heparin. J Pharm Biomed Anal 49:151–155CrossRefGoogle Scholar
 22.Perlin AS, Sauriol F, Cooper B, Folkman J (1987) Dermatan sulfate in pharmaceutical heparins. Thromb Haemost 58:792–793Google Scholar
 23.Guerrini M, Beccati D, Shriver Z, Naggi A, Viswanathan K, Bisio A, Capila I, Lansing JC, Guglieri S, Fraser B, AlHakim A, Gunay NS, Zhang Z, Robinson L, Buhse L, Nasr M, Woodcock J, Langer R, Venkataraman G, Linhardt RJ, Casu B, Torri G, Sasisekharan R (2008) Oversulfated chondroitin sulfate is a contaminant in heparin associated with adverse clinical events. Nat Biotechnol 26:669–675CrossRefGoogle Scholar
 24.Sitkowski J, Bednarek E, Bocian W, Kozerski L (2008) Assessment of oversulfated chondroitin sulfate in low molecular weight and unfractioned heparins diffusion ordered nuclear magnetic resonance spectroscopy method. J Med Chem 51:7663–7665CrossRefGoogle Scholar
 25.Rudd TR, Skidmore MA, Guimond SE, Cosentino C, Torri G, Fernig DG, Lauder RM, Guerrini M, Yates EA (2009) Glycosaminoglycan origin and structure revealed by multivariate analysis of NMR and CD spectra. Glycobiology 19:52–67CrossRefGoogle Scholar
 26.RuizCalero V, Saurina J, Galceran MT, HernándezCassou S, Puignou L (2002) Estimation of the composition of heparin mixtures from various origins using proton nuclear magnetic resonance and multivariate calibration methods. Anal Bioanal Chem 373:259–265CrossRefGoogle Scholar
 27.RuizCalero V, Saurina J, Galceran MT, HernándezCassou S, Puignou L (2000) Potentiality of proton nuclear magnetic resonance and multivariate calibration methods for the determination of dermatan sulfate contamination in heparin samples. Analyst 125:933–938CrossRefGoogle Scholar
 28.RuizCalero V, Saurina J, HernándezCassou S, Galceran MT, Puignou L (2002) Proton nuclear magnetic resonance characterization of glycosaminolgycans using chemometric techniques. Analyst 127:407–415CrossRefGoogle Scholar
 29.Keire DA, Ye H, Trehy ML, Ye W, Kolinski RE, Westenberger BJ, Buhse LF, Nasr M, AlHakim A (2010) Characterization of currently marketed heparin products: key tests for quality assurance. Anal Bioanal Chem (in press)Google Scholar
 30.Weljie AM, Newton J, Mercier P, Carison E, Slupsky CM (2006) Targeted profiling: quantitative analysis of ^{1}H NMR metabolomics data. Anal Chem 78:4430–4442CrossRefGoogle Scholar
 31.R Development Core Team. R: software, a language and environment for statistical computing. R Development Core Team, Foundation for Statistical Computing. www.rproject.org
 32.Varmuza K, Filzmoser P (2009) Introduction to multivariate statistical analysis in chemometrics. CRC, Boca RatonCrossRefGoogle Scholar
 33.Maindonald J, Braun J (2003) Data analysis and graphics using R. Cambridge University Press, CambridgeGoogle Scholar
 34.Estienne F, Massart DL, ZanierSzydlowski N, Marteau P (2000) Multivariate calibration with Raman spectroscopic data: a case study. Anal Chim Acta 424:185–201CrossRefGoogle Scholar
 35.Carneiro RL, Braga JWB, Bottoli CBG, Poppi RJ (2007) Application of genetic algorithm for selection of variables for the BLLS method applied to determination of pesticides and metabolites in wine. Anal Chim Acta 595:51–58CrossRefGoogle Scholar
 36.Broadhurst D, Goodacre R, Jones A, Rowland JJ, Kell DB (1997) Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Anal Chim Acta 348:71–86CrossRefGoogle Scholar
 37.Leardi R (2001) Genetic algorithms in chemometrics and chemistry: a review. J Chemom 15:559–569CrossRefGoogle Scholar
 38.JouanRimbaud D, Massart D, Leardi R, De Noord OE (1995) Genetic algorithms as a tool for wavelength selection in multivariate calibration. Anal Chem 67:4295–4301CrossRefGoogle Scholar
 39.Liebmann B, Friedl A, Varmuza K (2009) Determination of glucose and ethanol in bioethanol production by near infrared spectroscopy and chemometrics. Anal Chim Acta 642:171–178CrossRefGoogle Scholar
 40.Gourvénec S, Capron X, Massart DL (2004) Genetic algorithms (GA) applied to the orthogonal projection approach (OPA) for variable selection. Anal Chim Acta 519:11–21CrossRefGoogle Scholar
 41.Forshed J, SchuppeKoistinen I, Jacobsson SP (2003) Peak alignment of NMR signals by means of a genetic algorithm. Anal Chim Acta 487:189–199CrossRefGoogle Scholar
 42.Üstün B, Melssen WJ, Oudenhuijzen M, Buydens LMC (2005) Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization. Anal Chim Acta 544:292–305CrossRefGoogle Scholar
 43.Huang J, Brennan D, Sattler L, Alderman J, Lane B, O’Mathuna C (2002) A comparison of calibration methods based on calibration data size and robustness. Chemom Intell Lab Syst 62:25–35CrossRefGoogle Scholar
 44.Czekaj T, Wu W, Walczak B (2005) About kernel latent variable approaches and SVM. J Chemom 19:341–354CrossRefGoogle Scholar
 45.Tistaert C, Dejaegher B, Nguyen Hoai N, Chataigné G, Riviere C, Nguyen Thi Hong V, Van Chau M, QuetinLeclercq J, Vander Heyden Y (2009) Potential antioxidant compounds in Mallotus species fingerprints. Part I: indication, using linear multivariate calibration techniques. Anal Chim Acta 649:24–32CrossRefGoogle Scholar
 46.Sun M, Zheng Y, Wei H, Chen J, Cai J, Ji M (2009) Enhanced replacement methodbased quantitative structure–activity relationship modeling and support vector classification of 4anilino3quinolinecarbonitriles as Src kinase inhibitors. QSAR Comb Sci 28:312–324CrossRefGoogle Scholar
 47.Zhu D, Ji B, Meng C, Shi B, Tu Z, Qing Z (2007) The performance of νsupport vector regression on determination of soluble solids content of apple by acoustooptic tunable filter nearinfrared spectroscopy. Anal Chim Acta 598:227–234CrossRefGoogle Scholar
 48.Liu H, Zhang R, Yao X, Liu M, Hu Z, Fan B (2004) Prediction of electrophoretic mobility of substituted aromatic acids in different aqueousalcoholic solvents by capillary zone electrophoresis based on support vector machine. Anal Chim Acta 525:31–41CrossRefGoogle Scholar
 49.Vapnik V (1995) The nature of statistical learning theory. Springer, New YorkGoogle Scholar
 50.Vapnik V (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
 51.Li H, Liang Y, Xu Q (2009) Support vector machines and its applications in chemistry. Chemom Intell Lab Syst 95:188–198CrossRefGoogle Scholar
 52.Thissen U, Pepers M, Üstün B, Melssen WJ, Buydens LMC (2004) Comparing support vector machines to PLS for spectral regression applications. Chemom Intell Lab Syst 73:169–179CrossRefGoogle Scholar