INTRODUCTION

A protein’s interaction with itself and with other proteins affects important characteristics such as its solubility (1), aggregation (2) and ability to crystallize (3). Measurement of second virial coefficients, B 22 (4), provides one method to quantify protein interactions at the molecular level. B 22 is a measure of the entirety of two body protein self-interactions that includes contributions from excluded volume, electrostatic factors (attractive and repulsive) and hydrophobic interactions. In terms of McMillan–Meyer solution theory (5), B 22 is related to a potential of mean force which describes all of the interaction forces between protein molecules in a dilute solution. Positive B 22 values correspond to net repulsive forces of the protein and are correlated with increased protein solubility in solution (1,6) whereas values in the negative range correspond to the net attractive forces required for protein insolubility (i.e. precipitation or crystallization conditions (3)). Identified as one indicator of the physical stability of proteins in solution (7), the second virial coefficient depends on a variety of solution formulation parameters including temperature, pH and the type and concentration of salts and excipients (additives). As these additives interact with a protein’s surface, they naturally change that surface with respect to shape change and other interaction parameters.

The second virial coefficient can provide functional insight at various stages of the drug discovery process. The initial evaluation of a protein’s function in human pathology is often facilitated by study of the protein’s structure by means of x-ray diffraction. George and Wilson have shown (3) that proteins generally crystallize when their B 22 values are in a “crystallization slot” ranging from approximately −0.2 to −8 (×10−4 mol ml/g2). This B 22 range, confirmed by several research groups, represents slightly to moderately attractive forces between proteins, a condition that appears to be important for nucleation and subsequent crystal formation (2,810).

The determination of solution conditions yielding diffraction quality crystals, as well as high protein solubility and/or low nonspecific aggregation of proteins expressed in prokaryotic and eukaryotic systems, represent major bottlenecks in high-throughput protein structure (11,12). Although there have been advances in the ability to recover bioactive protein from the inclusion bodies of various expression systems (13), these techniques require customization to the protein of interest, a requirement that is not conducive to high throughput methods. The mathematical relationship between the B 22 value and solubility, derived by Haas et al. (1), indicates a marked increase in solubility with increasing B 22 value. This relationship has been validated experimentally for a variety of proteins (1,6,8). Thus, a second application of the second virial coefficient involves its use as a diagnostic for protein solubility.

Protein solubility and stability are as important in the evaluation of therapeutic proteins as they are in the study of proteins involved in disease pathology. The Food and Drug Administration’s (FDA) evaluation of a drug candidate includes two primary criteria: solubility and membrane permeability (17). In a recent overview of pharmaceutical drug screening techniques (18), three methods of solubility screening were identified: UV absorption, nephelometry and flow cytometry. These methods, developed for analysis of small molecules, are used to calculate current or potential solubility of a specific drug formulation and can be performed in a high throughput manner. However, they do not directly quantify the protein self-interactions that influence solubility and aggregation of protein therapeutic molecules.

Measurement of the second virial coefficient, performed using static light scattering (SLS) (3), consumes a significant amount of protein and time (multiple light scattering readings are necessary to calculate one B 22 value) and it requires careful attention to solution clarity. In contrast, a second method for determining the second virial coefficient known as self-interaction chromatography (SIC) provides advantages to each of the constraints of the SLS method as referenced by Tessier et al. (20). SIC initially requires chemical coupling of protein to a solid support followed by careful packing of the support in a small chromatography column. Once prepared, however, the column is stable and can be repeatedly used to measure B 22 values, making it more applicable to high-throughput techniques. Each measurement consists of flowing a mobile microgram injection of the protein across the immobilized protein particles using an HPLC. The retention time of the mobile protein is directly related to its interaction with immobilized column protein (19), thereby providing a direct measurement of how two proteins (bound and injected) interact with one another. Formulas relating the chromatographic retention time to B 22 values can be found in the paper of Tessier et al. (20). This technique has been successfully used with low throughput screens (16 conditions) to measure the interactive effects of two formulation parameters on B 22 (21).

In this study we used self-interaction chromatography to rapidly measure the B 22 value of hen egg-white lysozyme in 81 solution formulations. The screen measures the pair-wise effects of nine different additives on the self-interaction of the lysozyme protein. The well known incomplete factorial experimental design technique, applied to crystallization screening by Carter and Carter (22), is used to ensure wide coverage of the search space with a reduced number of test conditions. The incomplete factorial design is accomplished by mapping the parameters of interest (pH, salts, additives, concentrations) onto an orthogonal array (23,24). Mapping parameters to an orthogonal array allows equal representation of parameter levels throughout the search space while reducing the 12,636 possible parameter combinations down to a reasonable screen size of 81 conditions. The B 22 values are measured to quantify the degree of lysozyme self-interaction in each of the formulations.

The results of the screen are first analyzed by manually examining the linear and quadratic trends of each formulation parameter on B 22 value. Parameters with the most statistically significant effect on protein-protein interaction (B 22 value) of lysozyme are identified within the screen. These parameters with strong influence on protein interactions (such as NaCl) are shown to have an effect on B 22 value regardless of the presence of other additives in varying formulations. This allows for the rapid identification of additives that could be used to modify protein–protein interactions.

While a manual examination of parameter effects can identify the strong correlations of single parameters, this initial analysis does not examine the effect of parameter interactions. To analyze the effect of additive combinations on protein–protein interaction we modeled the results of the B 22 value screen using an artificial neural network (ANN). Artificial neural networks have utility when the effect of specific combinations of a large number of variables/parameters, as well as each variable’s level (i.e. concentration of various chemicals), must be analyzed to determine the optimal combination to yield a desired outcome. The large number of potential additive combinations and their possible levels defines a search space that precludes manual inspection of the data as a reasonable method for finding the optimum parameters and parameter concentrations. Artificial neural networks are able to utilize an incomplete factorial subset of parameter combinations to determine correlations between discrete variable combinations and their respective levels. Neural network modeling has been used to predict novel crystallization conditions (25) and to confirm theoretical calculations of B 22 for very small molecules (26).

An artificial neural network is essentially a set of non-linear weighted functions which map input variables (screen parameters) into output variables (B 22 value) (27). The weights are initialized to random values resulting in a random mapping of the screen parameters onto B 22 values. The subsequent training process to determine optimal weights is performed by iteratively updating the weights to reduce error between the ANN output and observed values (B 22 screen). For each iteration, the ANN attempts to produce B 22 values closer to the observed B 22 values for the given input parameters. After the training process is complete, the neural network model is used to produce B 22 value predictions of lysozyme for all possible formulations of one or two additives.

ANN B 22 value predictions of lysozyme for 20 different formulation conditions were experimentally validated via SIC B 22 values of lysozyme dissolved in each condition. The chosen conditions included 10 from the most positive and negative B 22 value predictions combined with 10 spread throughout the range of predicted B 22 values. The results demonstrate that an artificial neural network, trained using an incomplete factorial additive screen, can accurately predict the second virial coefficient of the protein in previously untested formulations.

Finally, the ANN model is compared with a more traditional generalized linear model (GLM). Identical parameters used as inputs for the artificial neural network are included for consideration by the GLM. In the stepwise procedure the GLM uses an iterative process to determine which parameters significantly influence the second virial coefficient. For GLM analysis, the gradual process of ANN weight determination during each iteration is replaced by linear regression to calculate optimal linear model coefficients. The significance of each parameter is considered during each iteration, with new parameters added or removed based on a predetermined alpha value threshold. After significant parameters are identified by this stepwise process, linear model coefficients are calculated by linear regression. The GLM, like the trained ANN, can be used to predict the B 22 values of the protein for untested formulation conditions. Comparison of the GLM predictions to the ANN predictions indicates that, for this application, the ANN produces a more accurate and robust model than the GLM.

MATERIALS AND METHODS

Screen Conditions

Hen egg-white lysozyme was purchased from Calbiochem. The chromatography particles, Toyopearl AF-Formyl-650M (65 μm diameter particle, 0.1 μm diameter pore), were purchased from Tosoh Bioscience. Buffer formulation chemicals include glycerol, glycine, glutamic acid, mannitol, sodium citrate, sodium acetate and acetic acid; all purchased from Fisher Scientific. Additional formulation chemicals PEG4000, MPD and trehalose were purchased from Sigma-Aldrich under the Fluka brand name. Sigma-Aldrich was also the source for chromatography bead capping agent, ethanolamine, as well as the formulation chemicals Na2SO4, Na HEPES, HEPES acid and citric acid. The final two formulation chemicals, succinic acid and arginine, were purchased from Acros Organics.

Each of the 81 solution formulations contain buffer, salt and one or two co-solvents listed in Table I. The salt and each co-solvent can appear in low, medium or high concentration which varies depending on the solubility of the individual salt or co-solvent added. If all combinations of two solvents and four salts at three individual levels of concentration are combined with four pH levels represented by the buffers then the full factorial of 12,636 conditions is determined. To reduce the number of conditions in which the B 22 of lysosyme is measured, the identity of each screen formulation was determined by mapping the parameters onto an orthogonal array design as described by Sloane (28). This mapping produces formulation targets in which each pair of variables are equally represented throughout the screen (thereby producing a balanced screen with respect to the influence of individual parameters).

Table I Formulation Parameters

The water source for formulations was pre-filtered at 18 MΩ by a Millipore MilliQ system with trace sodium azide added to retard bacteria growth. Sodium and acid forms of 0.1 M buffers are mixed at their pKa in the presence of co-solvents (except in the case of the succinic buffer which was adjusted to pH with NaOH). The pH of each solution was confirmed via a Corning 430 pH meter with the final pH adjusted solutions filtered (0.22 μm (Fisher Scientific) syringe filter) and stored at room temperature.

Protein Immobilization

Lysozyme (LYZ) was immobilized to AF-Formyl-650M beads as described by Valente et al. (29) with only slight modification. One ml of 1 M K2HPO4 at pH 7.0 was added to 350 μl of AF-Formyl-650M beads followed by centrifugation (bench-top, 30 s 7k rpm). The wash was performed an additional two times to remove excess packing buffer. LYZ (5 mg) was dissolved in the phosphate buffer and incubated with the beads. Fifteen mg of sodium cyanoborohydride was added to the bead mixture to activate the binding chemistry and mixed via rotary mixer at room temperature for 90 min. A 5 μl sample of the supernatant containing unbound LYZ was diluted with 45 μl of 0.1 M sodium acetate buffer pH 4.7 and assayed via a bicinchoninic acid (BCA) assay (Thermo Scientific). The beads were centrifuged and washed twice with phosphate buffer plus 5% (w/v) NaCl and twice with phosphate buffer sans NaCl to remove any remaining LYZ. After binding and washing, unreacted formyl groups were capped by adding 1 ml of 1 M ethanolamine at pH 8.0 and 10 mg sodium cyanoborohydride, followed by additional rotary mixing for 90 min. After this final step of immobilization the beads were washed twice with 1 mL of the sodium acetate buffer.

Self-Interaction Chromatography

Immobilized beads were packed into a micro-column consisting of teflon FEP tubing (i.d. 0.03″, o.d. 1/16″) and were blocked at one end by a 2 μm stainless steel screen (Valco). Two 1.1 cm lengths of packed tubing (∼5 μl each) were cut from the packing end, diluted with 45 μl of sodium acetate buffer and assayed using the BCA assay (Pierce Biotechnology) to determine protein binding density on the column. The packing end of the column was then cut to 18 cm length, sealed with an additional screen. When not in use the column was stored with 0.1 M sodium acetate buffer pH 4.7 at 4°C. A second column, referred to as the dead column, was packed with beads that have been subjected to only the capping portion of the immobilization procedure. Acetone was used as a non-interacting void-volume marker and was dissolved in water at 3% (v/v) for injections. The protein injection solution consists of 5 mg of lysozyme dissolved in 1 ml of each of the four separate 0.1 M buffers (Table I).

All chromatograms were generated using a high performance liquid chromatography (Shimadzu) system consisting of two pumps, an auto-sampler for sample injection, column oven, 280 nm UV detector and software for automatic retention-time calculation. Each screen formulation was run through the column at 60 μl/min and the auto-sampler was used to inject 1 μl of the 5 mg/ml LYZ solution in buffer identical to the formulation buffer applied to the column. Column temperature was maintained at 23°C. Injections were performed in triplicate over the same column and B 22 values measured for the entire 81-condition screen on two columns with the final B 22 values averaged over both column. Solutions with outlying (1.5*IQR) variance between two columns (N = 9) were measured on a third column. If the B 22 of two columns were within the average standard deviation between two columns (1.7 B 22 units) the disagreeing measurement was excluded. Sample chromatograms shown in Fig. 1 demonstrate the influence of NaCl on retention time measured at peak elution. In the primary equation used to calculate B 22 values, Eq. 1, NA is Avagadro’s number and MW is the molecular weight of the protein.

$$B = \frac{{N_{\text{A}} }}{{{\text{MW}}^2 }}\left( {B_{{\text{HS}}} - \frac{{k'}}{{\phi \rho }}} \right)$$
(1)

The phase ratio, ϕ, is the ratio of the available surface area per unit of null volume and has been calculated for a variety of different chromatography particles (30). The density of protein immobilized on the column is ρ. In this study binding density varied from 17.5 to 22.4 mg/ml (measured by Pierce BCA assay). The variation in protein binding determines the magnitude with which variations in protein retention time affect B 22 value. The variable k′ is the chromatographic retention factor calculated from the protein retention time (t r) and acetone retention time (t 0) given by the equation:

$$k' = \frac{{t_{\text{r}} - t_0 }}{{t_0 }}$$
(2)

In this equation, Eq. 2, the acetone retention time (t 0) acts as a non-interacting marker to establish the relationship between non-interacting molecules with bound protein compared to interacting molecules with bound protein.

Fig. 1
figure 1

Retention times for lysozyme in 5% NaCl and 0% NaCl in 0.1 M sodium acetate buffer demonstrates the affect of NaCl on lysozyme self-interaction. The retention time for 3% acetone in the same buffer with 5% NaCl provides a reference point for conversion of retention times to B 22 values.

The chromatograms shown in Fig. 1 hold additional importance as the method by which column integrity is verified throughout the screening process. The B 22 value of lysozyme in NaCl concentrations of 5% and 0% (w/v) was measured after every eight formulation conditions, consisting of three injections each, or every 24 protein injections. The column was expected to be fairly stable because protein was covalently bound to chromatography media and unbound active groups were rendered relatively inert by the capping process. Regular validation of the column ensured that the addition or loss of protein from the column did not significantly alter B 22 measurements throughout the screening process. The standard deviation of lysozyme B 22 value for NaCl conditions was only 1.1 B 22 units throughout the lifetime of the column (81 screen formulation conditions). This gave assurance that the chromatography column does not experience a significant change in activity due to the addition or subtraction of protein to the column. To ensure the dead volume of the column was not significantly altered from packing of column material, acetone retention time was also measured after every eight formulation conditions. At a fixed protein retention time the standard deviation of B 22 measurements due to variation in acetone retention time (including effects from column packing) was 0.8 B 22 units.

Static Light Scattering

The traditional static light scattering (SLS) experiment requires measurement of the scattered light intensity from a protein solution in excess of background as a function of protein concentration. The traditional SLS experiment was modified in two important ways in order to minimize both time and protein required for a single B 22 measurement (31). The first modification is the incorporation of a low volume (∼1 μL) scattering cell. The second modification is a configuration allowing the simultaneous measurement of scattering intensity and protein concentration. This is accomplished by using a bifurcated fiber to deliver both the incident laser beam for scattering and the incident UV beam for absorption (protein concentration) measurements. The advantage of this configuration is that the simultaneous measurement of light scattering intensity and protein concentration allows the determination of the second virial coefficient from a single injection of protein sample into a flow system. Typically, 5–10 μl of protein solution at 1–2 mg/ml protein concentration were required for a single B 22 measurement.

The intensity and concentration data were treated according to the SLS working equation (32):

$$\frac{{Kc}}{{R_{90} }} = \frac{1}{M} + 2B_{22} c$$
(3)

where K is an optical constant (cm2 mol g−2) given by K = 4π2 (dn/dc)2 n o 2/(N A λ 4), c is the protein concentration (g cm−3), R 90 is the Rayleigh factor (cm−1) at angle 90°, M is the molecular weight of the protein (g mol−1), B 22 is the second virial coefficient (mol ml g−2), dn/dc is the refractive index increment (cm3 g−1), n 0 is the solvent refractive index, N A is Avogadro’s number (mol−1), and λ is the wavelength (cm) of the incident light in a vacuum. According to Eq. 3, a plot of Kc/R 90 vs c (often called a single angle Zimm plot) linearizes the SLS data and B 22 is determined from the limiting slope.

Artificial Neural Network (ANN)

Artificial neural network modeling was performed using the Java Object Oriented Neural Engine (JOONE) (33). Fig. 2a shows the overall network topology of the neural network used in this study including inputs, node configuration and B 22 value output. Each node represents a nonlinear transformation of inputs and is grouped into one of two layers according to distance from the input parameters. Regardless of position in the topology, the output of each node is calculated by two steps shown in Fig. 2b. First, a weighted sum of inputs to the node is calculated, z. The hyperbolic tangent is taken of this weighted sum to calculate node output. Each node in layer 1 takes as input all formulation parameters while each node in layer 2 takes all outputs from layer 1 as input. The final B 22 value output is calculated as a simple weighted sum of layer 2 outputs without a nonlinear transformation. This permits the range of output values to match the range of screened B 22 values rather than the (−1,1) range of the hyperbolic tangent function. Through calculation of each layer’s outputs in sequence this architecture is able to estimate the protein B 22 value for a given set of condition formulation parameters. The weights associated with each input node are the variables subject to training, thereby creating a network function that most accurately represents protein B 22 values over all given formulation parameters.

Fig. 2
figure 2

The artificial neural network topology (a) uses parameters of a single formulation as input to each node in Layer 1. Each node’s output (b) is calculated by an activation function (tanh) whose input is a weighted sum of the node input. The output of nodes in layer one are forwarded as the input to Layer 2. The output of nodes in Layer 2 are weighted and summed to produce a B 22 value prediction based on the input formulations.

This architecture (input vector, layers and output) is generally referred to as a feed-forward multilayer perceptron and is capable of modeling a continuous function to arbitrary accuracy given a sufficient number of nodes (27). Arbitrary accuracy is apparent if one considers a network topology containing one node for each formulation condition (N = 81). After training the weight parameters, the response of each node could represent the measured B 22 value for each specific formulation condition. Such an exact fit to the screen would result in over-fitting to the error inherent to the screen and would not provide a good, generalized response to formulation conditions outside those on which it was trained.

When the training algorithm is responsible for adjusting neural network weights to over-fit output to a specific training set it is referred to as over-training. To address the problem of over-training of the neural network, we split the set of screen conditions into a training set (90%) and a validation set (10%) and used a technique called early-termination to determine when to stop the training procedure. During training, the weights are iteratively adjusted using the gradient decent algorithm of back-propagation. This algorithm assigns an error contribution and updates each weight based on the root mean square error (RMSE) between the neural network output and the measured protein second virial coefficient for each formulation condition in the training set. RMSE is also calculated between the neural network output and measured B 22 values in the validation set for each iteration. The validation set RMSE is not used to improve weight values, but instead acts as the basis for deciding when to terminate the training procedure. The network weights are fixed at the minimum validation RMSE over a set number of iterations (1,000). Validation set RMSE is also used as a measure of how well a network topology is able to generalize to untested formulation conditions. All network topologies from 1 × 1 to 6 × 6 nodes were evaluated by a validation set RMSE to determine the 3 × 2 network topology used for this study. Further details about neural network algorithms and methods can be found in Bishop’s review (27) of the subject as well as in the JOONE software documentation (33).

Stepwise Generalized Linear Model (GLM)

The stepwise generalized linear model was performed using the JMP (34) statistical software package. The neural network inputs shown in Fig. 2 were also the parameters used for the GLM. The GLM algorithm requires explicit identification of interaction and high order terms for consideration. In addition to the neural network inputs, all pairwise interactions and square terms of the formulation screen were included for consideration. The stepwise algorithm was configured to include terms with a significance of alpha <0.20 with higher order and interaction terms restricted to only those whose lower order terms were also significant.

Prediction Verification

The second virial coefficient for all combinations of buffer, salt and a maximum of two excipients (12,636 conditions) were predicted by the trained ANN. Five conditions from the most positive B 22 values and five of the most negative B 22 values as well as ten equally spaced throughout the range of predicted B 22 values were selected for experimental confirmation. These 20 verification formulations (not included in the training process) were prepared and B 22 values of lysozyme in each were experimentally measured using the identical method as the original 81 screen conditions.

The question of whether 81 screen conditions are necessary or if a smaller subset would suffice was addressed by evaluating the ability of the neural network to predict the verification B 22 values while training on a reduced set of the initial screen. First a condition was randomly removed from the original training set of the neural network. The training process described above was repeated on the reduced training set with the same validation set size remaining constant (deemed a valid indication of the overall population). Then the neural network, trained on a reduced set of the original 81 condition screen, was used to calculate predictions for the verification B 22 values. Progressively reducing the sample size, followed by training and prediction, allows error as a function of sample size to be evaluated.

To determine how sample size affects neural network B 22 value predictions, the validation set was kept constant while iteratively removing a random condition from the training set. As there is no consensus in the literature as to how this type of analysis should be performed, a constant validation set was chosen as a good measure of the ability for the network to generalize. We were able to see how available training data influences accuracy by keeping the same validation set through repeated reductions in the training set size. After iterative removal of a random condition from the training set, the ANN is re-trained and then asked to predict the same 20 verification formulations chosen for the initial evaluation of ANN performance. This process was performed three times (with a different sequence of random removals each time) and the error for a given training set size was taken as the average of all three series of removals.

RESULTS AND DISCUSSION

Confirmation by Static Light Scattering

A strong correlation (r = 0.97) between static light scattering and self-interaction chromatography was observed (Table II) for ten test conditions as has been previously reported by other laboratories (20,29). The primary differences between the two measurements were found at two of the most positive B 22 values. B 22 values in this range were expected to exhibit greater error since large positive B 22 values have been shown to correspond to very high levels of solubility (1,6). Thus a small difference in B 22 value represents a larger difference in solubility. Therefore, from a practical perspective, all high positive B 22 values represent regions of high protein solubility even though individual B 22 value errors are larger in this region.

Table II Comparison Between B 22 Values Measured by Static Light Scattering (SLS) and Self-Interaction Chromatography (SIC)

Screen Results

The B 22 value results of lysozyme for the 81 formulation conditions demonstrates some characteristics expected of the protein. For example, the mean B 22 of the screen is positive 1.1 × 10−4 mol ml/g2 which is reflective of the general soluble nature of lysozyme. Additionally, a majority of the formulation conditions (55%) reside in the crystallization slot identified by George and Wilson (3) which is approximately [−8,−0.2] × 10−4 mol ml/g2. This is indicative of the ease with which lysozyme crystals are formed. It is also of interest to note that the average standard deviation between measurements was 1.7 × 10−4 mol ml/g2. This suggests that B 22 measurements produced using self-interaction chromatography are reproducible throughout a large range of different solution conditions.

Interesting trends are also observed when viewing the influence of a single parameter throughout the screen. Fig. 3 shows a graph of B 22 value versus three individual parameter concentrations (NaCl, MPD, Glycine). The variation between plotted B 22 values at a fixed concentration is due to the fact that other additives change with each condition. Error bars around each point indicate the error from measurement to measurement for each specific formulation. The increasing lysozyme self-interaction (decreasing B 22) with increased concentration of sodium chloride (Fig. 3a) is expected and has been demonstrated in other studies by both SIC and SLS (29). At the mid and high concentrations of NaCl, four of the five conditions with positive B 22 values contain MPD. This combined with the fact that MPD shows a trend (Fig. 3b) of decreasing lysozyme self-interaction (increasing B 22) with increasing concentration identifies MPD as a potential solubilizing agent for lysozyme. Quadratic relationships between additive concentration and B 22 value, such as that apparent in glycine (Fig. 3c) could also indicate an additive which might help stabilize protein self-interaction at a specific level. These single factor cross sections are useful for identifying individual additives which have a strong influence on B 22 value. However, the prediction capability of single variable linear and quadratic regression models is obviously not sufficient to capture the variability in protein-protein interactions caused by formulations with multiple additives.

Fig. 3
figure 3

Response of B 22 value for lysozyme by a NaCl (F test; df = 1; p = 0.0006), b MPD (F test; df = 1; p = 0.001) and c glycine (F test; df = 2; p = 0.006) throughout all screen conditions containing the additive of interest. Error bars represent standard error between SIC measurements between whereas variability between points at a fixed additive concentration is attributed to changes in formulation parameters outside the additive of interest. Scatter along the abscissa is added to prevent overlapping of error bars.

Modeling and Prediction Results

The neural network trained on all conditions, except for nine (10%) reserved for validation, produces a model which predicts the original screen with a RMSE of 1.7 × 10−4 mol ml/g2. This is equal to the observed average standard deviation between measured B 22 values and reinforces the notion that early termination of training based on the validation set error prevents over-training of the neural network to the screen results. Upon completion of training, the neural network is used to predict B 22 values for all possible variable combinations with one or two additives (12,636 formulation conditions). From this entire number of predictions, 20 predictions were chosen for verification. These 20 conditions were chosen to represent the entire solubility range, with some from the most positive and negative predicted B 22 values. These formulation conditions and their predicted second virial coefficients are shown in Table III. The experimental formulations in Table III were prepared and their effect on lysozyme’s second virial coefficient measured via SIC. A plot of measured B 22 values versus ANN predicted values are shown in Fig. 4. This figure demonstrates that the neural network is able to predict second virial coefficients with an accuracy of 2.6 × 10−4 mol ml/g2.

Fig. 4
figure 4

ANN predicted B 22 value vs measured B 22 values of the 20 verification formulations (F test; df = 1; p < 0.0001; RMSE = 2.6 × 10−4 mol ml/g2).

Table III ANN Predictions of 20 Formulations Selected for Verification

Screen sample size plays a role in how accurately the ANN model is able to predict untested formulation conditions. Fig. 5 shows the relationship between screen sample size and the prediction error of the ANN. As the size of the training set decreases the prediction error of the ANN increases. However, a B 22 value prediction error of 3 B 22 units is still attainable with a training set size of 45 screen conditions. The addition of formulation conditions to the training set provides a diminishing improvement to the prediction error. It is interesting to note in Fig. 5 that the error curve does not completely flatten at a screen sample size of 81 formulations. An extrapolation of this suggests a screen of over 100 formulation conditions could permit ANN B 22 predictions with an error close to 1.7 × 10−4 mol ml/g2; the variability between B 22 values measured on separate columns.

Fig. 5
figure 5

ANN RMSE vs sample size. Incremental reduction in sample size shows an increase in error for artificial neural network predictions of the 20 verification formulations. Dashed line indicates the error between B 22 value measurements by SIC between columns (1.7 mol ml/g2).

The standard generalized linear model provides a comparison of the ANN with a standard linear regression technique used for data analysis/predictions. The terms of the GLM were determined by considering all single terms, interaction terms and square terms and incrementally adding the most significant remaining parameter until there are no more parameters with a significance of alpha <0.20. The GLM parameters and their significance level generated by this method are listed in Table IV. This table demonstrates one benefit of the GLM over ANN. Incremental analysis of each parameter produces a list of factors and their p-value significance. This helps identify specific formulation parameters which could increase solubility. However, when predicting the second virial coefficient of protein in previously unformulated conditions the GLM does not perform as well as the ANN. The plot in Fig. 6 shows the same 20 measured B 22 values for ANN validation versus the GLM predictions. Although both predictions are statistically significant (F test; df = 1; p < 0.0001), the GLM is accurate with a RMSE of 3.3 × 10−4 mol ml/g2 which implies the ANN is approximately 25% more accurate than the GLM. However both techniques are useful for formulation prediction based on a small subset of conditions.

Fig. 6
figure 6

GLM predicted B 22 values vs measured B 22 values of the 20 verification formulations (F test; df = 1; p < 0.0001; RMSE = 3.3 × 10−4 mol ml/g2).

Table IV Additives with Statistically Significant Influence as Determined by Stepwise GLM

Limitation

A limitation of this screen and formulation prediction technique is in the ability to predict formulation conditions with parameter concentrations well outside the screened range. The inability for statistical models to extrapolate results outside their original input range is well known. This implies that the range of pH and salt/additives concentrations must be chosen based on an estimation of the effective range for each parameter. For example the pH range of interest might be a region in relation to the expected pI of the protein. It is important to note that once parameter ranges are determined the screen and resulting statistical models will not be able to predict the B 22 value of formulations with parameters significantly outside these ranges. However, this does not diminish the fact that the statistical models can accurately predict the B 22 value of a large number of novel formulation conditions based on parameter combinations not measured in the original screen.

Conclusions

As hypothesized in previous publications (9,21,25), high throughput screening of second virial coefficients shows promise for evaluating the interactions of proteins in solution. We have demonstrated that an incomplete factorial screen combined with a neural network model can be used to accurately predict second virial coefficients for untested formulations. A B 22 value screen of only 81 formulation conditions was used to predict the B 22 values for 12,636 possible formulations with an accuracy of 2.6 × 10−4 mol ml/g2. These preliminary studies suggest that a high-throughput chromatographic SIC system with increased automation may enhance and accelerate determinations of the optimum conditions that improve the physical solubility/stability of drug formulations. It also suggests this same strategy may be useful to predict formulation adjustments required for optimized protein expression and/or crystallization.

The strong correlation between SIC and SLS measurements of B 22 value lends further evidence that SIC may be useful as a replacement for the SLS method. The use of SIC in lieu of SLS offers several significant advantages including: (1) SIC requires less protein per experiment, (2) SIC is easily performed with aqeous or membrane proteins whereas SLS is difficult or impossible to use with membrane proteins, (3) SIC is much faster than SLS, (4) SIC is useful with a wider variety of additives due to additive interference with the SLS signal, (5) SIC can be miniaturized and performed in a high-throughput manner thereby enabling studies on a large sample set (i.e. incomplete factorial).

The current time required to run self-interaction chromatography in triplicate is approximately 30 min. While 30 min per experiment by SIC is much faster than previous SLS methods (20), the use of B 22 values for these applications would benefit significantly by increased throughput via parallelization, robotic automation and integration of analysis techniques into a single platform.