An essential part of the OCHEM platform is the modeling framework. Its main purpose is to provide facilities for the development of predictive computational models for physicochemical and biological properties of compounds. The framework is integrated with the database of experimental data and includes all the necessary steps required to build a computational model: data preparation, calculation and filtering of molecular descriptors, application of machine learning methods and analysis of a models’ performance. This section gives an overview of these features and of the steps required to build a computational model in the OCHEM.
Concisely, the main features of the modeling framework within the OCHEM include:
Support of regression and classification models
Calculation of various molecular descriptors ranging from molecular fragments to quantum chemical descriptors. Both whole-molecule and per-atom descriptors are supported.
Tracking of each compound from the training and validation sets
Basic and detailed model statistics and evaluation of model performance on training and validation sets
Assessment of applicability domain of the models and their prediction accuracy
Pre-filtering of descriptors: manual selection, decorrelation filter, principal component analysis (PCA) based selection
Various machine learning methods including both linear and non-linear approaches
N-fold cross-validation and bagging validation of models
Multi-learning: models can predict several properties simultaneously
Combining data with different conditions of measurements and the data in different measurement units
Distribution of calculations to an internal cluster of Linux and Mac computers
Scalability and expendability for new descriptors and machine learning methods
The steps of a typical QSAR research in the OCHEM system and the corresponding features are summarized in a diagram in Fig. 5.
To create a new QSAR model in OCHEM the user must prepare the training and (optional) validation sets, configure the preprocessing of molecules (standardization and 3D optimization), choose and configure the molecular descriptors and the machine learning method, select the validation protocol (N-fold cross-validation or bagging) and, when the model has been calculated, review the predictive statistics and save or discard the model.
The following sections describe each of the aforementioned steps in detail.
Training and validation sets, machine learning method and validation
Training and validation datasets. One of the most important steps in model development is the preparation of input data, i.e., a training set that contains experimentally measured values of the predicted property.
The property that will be predicted by the model is identified automatically based on the contents of the training set. If the training set contains multiple properties, they will be predicted by the model simultaneously. This allows knowledge about different (but related) properties to be combined into a single model, so called multi-learning . Multi-learning was shown to significantly increase the overall performance in comparison to models developed for each property separately .
The OCHEM system allows a user to combine heterogeneous data reported in different units of measurements into a single unit set. For every property, the user must select a unit; all the input data will be automatically converted to this unit and, therefore, the final model will give predictions in this unit.
Machine learning method. After assigning the training and validation sets (see Fig. 6), the user selects a machine learning method and a validation protocol. Currently OCHEM supports linear and Kernel Ridge Regression (KRR) , ASsociative Neural Networks (ASNN) , Kernel Partial Least Squares (KPLS) , a correction-based LogP-LIBRARY model , Support Vector Machines (SVM) , Fast Stagewise Multivariate Linear Regression (FSMLR) and k Nearest Neighbors (kNN) .
Validation of models. OCHEM offers two possibilities to validate a model: N-fold cross-validation and bagging. We recommend always using one of the two validation options to avoid a common pitfall of model over-fitting, which results to misleading predictions [27, 28]. The validation procedure is also used to calibrate the estimation of prediction accuracy .
If cross-validation is chosen, the whole process of model development, including the filtering of descriptors, is repeated N (by default 5) times with a different split of the initial set into training and validation sets. Only the respective training set is used in each step for model development.
In the case of bagging validation, the system generates N (by default 100) training sets and builds N models, based on these sets. The N sets are generated from the initial training set by random sampling with replacement. The compounds not included in the training set are used to validate the performance of this model; the final prediction for each compound is the mean prediction over all the models where this compound was in the validation set .
No matter what validation method is chosen, duplicated molecules (regardless of stereochemistry) are used either in training or validation sets but never in both simultaneously, which ensures the proper assessment of the model predictive ability.
Before chemical compounds are passed to the further steps of the modeling, they undergo a user-specified preprocessing procedure. Currently OCHEM standardizes different forms of the same molecule, i.e., mesomers and tautomers, by replacing them with a single unique representation. Since most descriptor generating software cannot work with salts (when two or more disconnected parts are present) all salts are automatically replaced with the largest component of the compound. For the same reason charged compounds are neutralized by adding/deleting hydrogens.
Furthermore, preprocessing steps are performed only during the modeling stage. In the database, all compounds are stored “as is”, i.e., original representations are kept as they were uploaded or provided by users.
The descriptors available in OCHEM are grouped by the software name that contributes them: ADRIANA.Code , CDK descriptors , Chemaxon descriptors , Chirality codes [34–38], Dragon descriptors , E-State indices , ETM descriptors [41, 42], GSFrag molecular fragments , inductive descriptors , ISIDA molecular fragments , quantum chemical MOPAC 7.1 descriptors , MERA and MerSy descriptors [47–50], MolPrint 2D descriptors , ShapeSignatures  and logP and aqueous solubility calculated with ALOGPS program . The descriptor selection screen is shown in Fig. 7.
The following section briefly describes available descriptors. If at least one descriptor in the block requires 3D structures, the block is marked as 3D and as 2D otherwise.
ADRIANA.Code (3D) comprises a unique combination of methods for calculating molecular descriptors on a sound geometric and physicochemical basis [31, 53]. Thus, they are all prone to an interpretation and allow the understanding of the influence of various structural and physicochemical effects on the property under investigation. ADRIANA.Code performs calculations with user-supplied 3D structures or applies built-in methods to generate 3D structures based on rapid empirical models. In addition, it contains a hierarchy of increasing levels of sophistication in representing chemical compounds from the constitution through the 3D structure to the surface of a molecule. At each level, a wide range of physicochemical effects can be included in the molecular descriptors.
ALOGPS descriptors (2D) predict logP  and the aqueous solubility  of chemicals. This program was recently top-ranked amid 18 competitors for logP prediction using >96,000 in house molecules from Pfizer and Nycomed . It was also reported to be “the best available ‘off-the-shelf package’ for intrinsic aqueous solubility prediction” at F. Hoffmann-La Roche . ALOGPS does not have additional configuration options.
CDK descriptors (3D) are calculated by the DescriptorsEngine, which is a part of the Chemistry Development Kit (CDK) . The CDK descriptors include 204 molecular descriptors of 5 types: topological, geometrical, constitutional, electronic and hybrid descriptors. The CDK also provides local atomic and bond-based descriptors, which will be included in OCHEM in future.
Chemaxon descriptors (also known as calculator plugins) calculate a range of physico-chemical and life-science related properties from chemical structures and are developed by ChemAxon. These calculators are usually part of the Marvin and JChem cheminformatics platforms. The descriptors are divided into 7 different groups: elemental analysis, charge, geometry, partitioning, protonation, isomers and “other” descriptors, which is a collective group for all heterogeneous descriptors that do not directly fall under any of the previous categories. For some descriptors, such as distribution coefficient (logD), the pH value is essential for calculation. By default, the descriptor value is calculated over the spectrum of pH from 0 to 14 with 1 pH unit intervals. However, it is possible to explicitly designate a specific pH value or range of pH values.
Chirality codes (3D) are molecular descriptors that represent chirality using a spectrum-like, fixed-length code, and include information on geometry and atomic properties. Conformation-independent chirality codes (CICC)  are derived from the configuration of chiral centers, properties of the atoms in their neighborhoods, and bond lengths. Conformation-dependent chirality codes (CDCC)  characterize the chirality of a 3D structure considered as a rigid set of points (atoms) with properties (atomic properties), connected by bonds. Physicochemical atomic stereo-descriptors (PAS)  were implemented to represent the chirality of an atomic chiral center on the basis of empirical physicochemical properties of its ligands—the ligands are ranked according to a specific property, and the chiral center takes an “S/R-like” descriptor relative to that property. The procedure is performed for a series of properties, yielding a chirality profile. All three types of chirality descriptors can distinguish between enantiomers. Examples of applications include the prediction of chromatographic elution order , the prediction of enantioselectivity in chemical reactions [34, 37], and the representation of metabolic reactions catalyzed by racemates and epimerases of E.C. subclass 5.1 .
DRAGON (v. 5.4) descriptors (3D) include more than 1,600 descriptors organized into 20 different sub-types that can be selected separately. DRAGON is an application for the calculation of molecular descriptors developed by the Milano Chemometrics and QSAR Research Group. The DRAGON descriptors include not only the simplest atom type, functional group and fragment counts, but also several topological and geometrical descriptors; molecular properties such as logP, molar refractivity, number of rotatable bonds, H-donors, H-acceptors, and topological surface area (TPSA)  are also calculated by using well-known published models.
E-State indices (2D) are separated on atom/bond type. In addition to indices it is also possible to select E-state counts, which correspond to counts of atom or bond types according to the respective indices. In some studies E-state counts were reported to produce better models than E-state indices .
ETM (Electronic-Topological Method [41, 42]) descriptors (3D) are based on the comparison of 3D structures of molecules. The molecules are represented as matrices where diagonal elements are atom charges and non-diagonal elements are distances between them. The molecules are compared with a template molecule and common fragments become ETM descriptors (i.e., 3D pharmacophores). There are usually two templates representing the most active and inactive molecules.
GSFRAG descriptors (2D) are the occurrence numbers of certain special fragments containing 2–10 non-hydrogen atoms; GSFRAG-L is an extension of GSFRAG that considers fragments that contain a labeled vertex, allowing one to capture the effect of heteroatoms. It was shown that the occurrence numbers of these fragments produce a unique code of a chemical structure for wide sets of compounds .
Inductive descriptors (3D) have been derived from the LFER (Linear Free Energy Relationships)-based equations for inductive and steric substituent constants and can be computed for bound atoms, groups and molecules using intra-molecular distances, atomic electronegativities and covalent radii .
ISIDA descriptors (2D) include two types of fragments: sequences and atom centered fragments, each of which includes explicitly atoms, bonds or both . In the current version of OCHEM only sequences of atoms and bonds are used. The user can specify their minimum and maximum length.
MERA descriptors (3D) are calculated using the non-parametrical 3D MERA algorithm and include (a) geometrical MERA descriptors (linear and quadratic geometry descriptors, descriptors related to molecular volume, proportions of a molecule, ratios of molecular sizes, quantitative characteristics of molecular symmetry, dissymmetry, chirality), (b) energy characteristics (inter- and intra-molecular Van der Waals and Coulomb energies; decomposition of intermolecular energies) and (c) physicochemical characteristics (probabilities of association, heat capacity, entropy, pKa) [47–50, 59].
MerSy (MERA Symmetry, 3D) descriptors are calculated using 3D representation of molecules in the framework of the MERA algorithm (see above) and include the quantitative estimations of molecular symmetry with respect to symmetry axes from C2 to C6 and the inversion-rotational axis from S1 to S6 in the space of principal rotational invariants about each orthogonal component. Additionally, the molecular chirality is quantitatively evaluated in agreement with the negative criterion of chirality (the absence of inversion-rotational axes in the molecular point group) [47–50, 59].
MolPrint descriptors  (2D) are circular fingerprints  based on Sybyl mol2 atom types. They are very efficient and can be easily calculated even for libraries comprising millions of molecules. Circular fingerprints capture a lot of information that relates molecular structure to its bioactivity. It has been shown in large-scale comparative virtual screening studies that MolPrint descriptors often outperform other fingerprinting algorithms in enrichment [61, 62]. Given the binary nature of MolPrint 2D fingerprints, they are ideally suited for virtual screening and clustering of molecules, as well as to the generation of numerical bioactivity models, which are able to accommodate the presence/absence nature of the descriptor.
MOPAC descriptors (3D) include whole-molecule and atom-type descriptors. The latter can be used to model local (atom-dependent) properties of molecules, such as pKa or the site of metabolism .
Shape Signatures (3D) can be viewed as a very compact descriptor that encodes molecular shape and electrostatics in a single entity. It reduces the dimensionality of 3D molecular shape and surface charge by representing complex 3D molecules as simple histograms. These signatures lend themselves to rapid comparison with each other for virtual screening of large chemical databases. Shape Signatures can be used by itself or in conjunction with currently available computational modeling approaches commonly employed in drug discovery and predictive toxicology, such as traditional virtual screening, descriptor-based (e.g., QSAR) models, ligand-receptor docking, and structure-based drug design [52, 64–67].
The set of descriptors can be easily extended by incorporating new modules that could also be provided by external contributors. It is possible to use the output of previously created models as input for a new model: this option is sometimes referred to as a feature net .
For the descriptors that require 3D structures of molecules, users can either rely on 3D structures generated by CORINA  or retrieve molecules optimized by MOPAC and the AM1 Hamiltonian  calculated by the web services implemented within the CADASTER project (http://mopac.cadaster.eu). If additional parameters are required for the calculation of descriptors, e.g., pKa value for ChemAxon descriptors, they are specified explicitly in the interface of the corresponding descriptor block. In this case, the parameters are saved with descriptors and are then used exactly in the same form for new molecular sets during the model application.
ATOMIC descriptors (3D) are defined for a particular atom (active center) of a molecule. Atomic descriptors can be used to describe reactive centers (e.g., for pKa calculation, prediction of reactivity). These descriptors are applicable for the prediction of particular “local” properties of molecules that depend on the specified active center (currently, only macro pKa constants are supported). The currently available atomic descriptors are based on MOPAC descriptors and E-State indices.
New development includes descriptors that characterize ligand–protein interactions. These descriptors will allow using the information about 3D structure of proteins for modeling. For example, a number of docking derived descriptors based on Vina software  were added recently and are currently in use for an ongoing study for prediction of CYP450 inhibitors . There is also a plan to extend OCHEM with other types of descriptors, e.g., those used in the COMBINE method .
Users can export most of the descriptor values (with an exception of commercial descriptors) for offline model development. Descriptor values can be exported as an Excel file or as a simple text file in a comma-separated values (CSV) format.
Before descriptors are passed to the machine-learning method, it is possible to filter part of them out by several criteria to avoid redundancy. Currently, OCHEM supports the following filtering options for descriptors: the number of unique descriptor values, the pairwise correlation of descriptors and the variance of principal components, obtained from the principal component analysis (PCA). Thus, it is possible to exclude highly correlated descriptors or descriptors that do not pass user specified thresholds.
Conditions of experiments
A unique feature of OCHEM is the possibility to use the conditions of experiments in modeling. Usually, chemical properties and biological activities depend on a number of conditions under which the experiment was carried out. Exemplary conditions are temperature, pressure, pH, measurement method, etc. OCHEM allows using these conditions in the modeling process as descriptors and as such permits combining data measured under different conditions into one modeling set. For example, boiling point data measured under different pressures can be combined into a single training set and used to develop a computational model. Another example is a combination of logP values measured in different buffers, e.g., pure water and 30% methanol.
The obligatory conditions of the experiments are selected in the same dialog as the molecular descriptors. For every selected condition, the user must provide (a) the default value that will be used for the records where the condition has not been specified, and (b) the unit to convert all the values to.
Configuration of the machine learning method
There are a number of configuration options that are specific for every particular machine learning method. These options are configured in separate dialog windows. Here, we briefly provide an overview of the methods and their parameters.
k Nearest Neighbors (kNN) predicts the property using the average property value of those k compounds from the training set that are nearest (in the descriptor space) to the target compound. The configurable options are: metrics type (Euclidean distance or the Pearson correlation coefficient) and the number of neighbors. By default the number of neighbors is determined automatically by the method itself.
ASsociative Neural Network (ASNN). This method uses the correlation between ensemble responses (each molecule is represented in the space of neural network models as a vector of predictions of neural network models) as a measure of distance amid the analyzed cases for the nearest neighbor technique [22, 72]. Thus ASNN performs kNN in the space of ensemble predictions. This provides an improved prediction by the bias correction of the neural network ensemble. The configurable options are: the number of neurons in the hidden layer, the number of iterations, the size of the model ensemble and the method of neural network training.
Fast Stagewise Multivariate Linear Regression (FSMLR) is a procedure for stagewise building of linear regression models by means of greedy descriptor selection. It can be viewed as a special case of the additive regression procedure (regression boosting) specially designed to be compatible with the three-set approach based on the use of three different sets for learning: training set, internal validation set and external test set . The main configurable parameters are: (1) shrinkage—its lower values result in the large number of required iterations but may provide higher generalization performance, and (2) the relative size of the internal validation set used for stopping descriptor selection procedure.
Kernel Partial Least Squares (KPLS) and Kernel Ridge regression (KRR) are modifications of partial least squares (PLS) and ridge regression (RR) that use a non-linear kernel (for an introduction to kernel methods see book by Schölkopf and Smola ). The most important parameter for kernel-based methods is the type of kernel, as that determines non-linear relations. Available kernels are: linear, polynomial, and radial basis functions, as well as the iterative similarity optimal assignment kernel (ISOAK) . The first three kernels are used with molecular descriptors. The ISOAK kernel is defined directly on the molecular structure graph. The individual parameters for every kernel can be either specified manually or configured to be selected automatically by the method itself in an inner loop of cross-validation.
The LOGP-LIBRARY model does not require any additional configuration options. This model is based on ASNN and corrects the ALOGPS logP model  using so called LIBRARY correction  with the training set. The idea of this method is to adjust the LogP model to predict other properties. The success of this methodology was shown for prediction of logD of chemical compounds at pH 7.4 [76, 77] and it was extended to prediction of arbitrary properties in the OCHEM database.
Multiple Linear Regression Analysis (MLRA) uses step-wise variable selection. The method eliminates at each step one variable that has regression coefficients that are not significantly different from zero (according to the t test). Thus, MLRA has only one parameter which corresponds to the p value of variables to be kept for the regression.
Support Vector Machine (SVM) uses the LibSVM program . The SVM method has two important configurable options: the SVM type (ε-SVR and μ-SVR) and the kernel type (linear, polynomial, radial basis function and sigmoid). The other options can be optimized by the method automatically using grid search.
Monitoring of the model calculation
After assigning the training and validation sets, specifying descriptors and configuration parameters, the user is forwarded to a screen that displays the current status of the model calculation. As it can be quite a long process—up to several days or even weeks, if large datasets and/or large number of descriptors are used, it is possible to fetch results afterwards, which would allow working with OCHEM while the model is being calculated. The calculated results will be stored until they are retrieved by the user for further inspection.
All completed models are stored until the user checks them and decides to save or discard them. However, all completed models are deleted automatically after 1 month. It is possible to check the status of the pending models and to continue working with them (see Fig. 8).
The “pending tasks” dialog shows:
models being calculated at the moment and the current status of the calculation
models that successfully finished the training process and are waiting for the user’s decision to save or to discard them. These models are denoted as “ready” in the model “status” column.
failed models (e.g., terminated by user or failed because of errors during calculation process). The corresponding error message is displayed in the “Details” column.
From the pending tasks dialog, it is possible to terminate, delete or recalculate the models.
Analysis of models
OCHEM provides a variety of statistical instruments to analyze the performance of models, to find outliers in the training and validation sets, to discover the reasons for the outliers and to assess the applicability domain of the model. In this section, we briefly overview these instruments.
Regression models. Commonly used measures of a regression model performance are the root mean square error (RMSE), the mean absolute error (MAE) and the squared correlation coefficient (r
2). The OCHEM system calculates these statistical parameters for both the training and the validation sets.
For a convenient visual inspection of the results OCHEM is equipped with a graphical tool that allows the user to create an observed-versus-predicted chart. This type of chart is traditionally used to visualize the model performance and to discover outliers. Each compound from the input dataset is represented as a dot on this chart, where the x-coordinate of the dot corresponds to the value of the experimentally observed property and the y-coordinate is the value predicted by the model. Each dot on the chart is interactive; a click on the dot opens a window with the detailed information about the compound: name, measured and predicted property values, publication, conditions of experiment, etc. The possibility to track each compound to the reference source is a very important step for understanding the reasons why the compound is considered to be an outlier. A user can quickly check why a bad prediction happened. Is it due to an error in the dataset, differences in the experimental conditions or due to the failure of the model to predict the compound properly? (see Fig. 9). This seemingly simple feature is a good example of the advantage of integrating the database with the modeling framework.
Classification models. The OCHEM system uses the average correct classification rate (as a percentage) as a measure of the performance of the classification models. The correct classification rate is complemented with a confusion matrix that shows a number of compounds classified correctly for every class as well as details of the misclassified compounds, e.g., how many compounds from a class A are classified to belong to a class B (see Fig. 10).
Applicability domain assessment
A unique feature of OCHEM is the automatic assessment of the prediction accuracy. The estimation of the accuracy is based on the concept of “distance to a model” (DM) , i.e., some numerical value estimated solely from molecular structures and experimental conditions, which correlates with the average model performance. Currently several DMs are supported: the standard deviation of an ensemble of models (STDEV), the correlation in the space of models (CORREL)  and Mahalanobis distance (LEVERAGE). The DMs are calibrated against the accuracy of models for the training set using N-fold cross-validation as described elsewhere . The estimated accuracy of predictions as a function of the respective DM is visualized on the accuracy averaging plot (see Fig. 11), which shows the absolute values of the prediction residuals versus a respective DM. The DMs are used to estimate the prediction accuracies for new molecules. The same methodology has been recently extended for classification models . Currently, estimation of the prediction accuracy is readily available for the ASNN (CORREL, STDEV) and linear regression models (LEVERAGE). For other methods, e.g., kNN, KRR and KPLS, the estimation of the prediction accuracy can be performed using the bagging approach which generates an ensemble of models and uses the standard deviation of the ensemble predictions (referred to as BAGGING-STDEV).
Comparison of models
Often it is useful to compare different models that are built using the same data but with different descriptors and machine learning methods. OCHEM supports a collective view of the models with the same training set. This screen is available from the basket browser as the model overview link.
Sustainability of models
Since OCHEM is a public database that is populated by users, the data may contain errors. Therefore, data may be changed during verification and correction by other users over time. It may lead to a significant alteration of the training sets and to invalidation of the previously developed models. To address this problem, OCHEM provides the possibility to recalculate the existing model preserving the previous workflow parameters (e.g., by applying the same machine learning method with the same parameters and descriptors) and to compare new results with the original model. This option is available solely for the user who has published the original model and for the OCHEM system administrators.
Application of models
List of available models
After the model has been successfully trained and saved, it can be applied to the prediction of new compounds (see a screen on Fig. 12).
To predict new compounds, the models are selected (by checking the rightmost checkbox) from previously trained and saved models. The model browser displays a brief summary of the model: the name, the predicted property, the training and validation sets and their sizes, the date of creation of the model. The following additional options are available from the model browser:
Download of model summary in Excel or CSV formats. Summary includes descriptors (where available), observed and predicted values and applicability domain values for training and validation sets.
Access to the model profile screen that displays complete information on the model and its statistics. In the model profile, it is possible to publish the model and recalculate it
Inspection of the training and validation sets by clicking on their names in the list
Deletion of models (only for non-published models)
Unpublished models are accessible only by the owner of a model and users of the same working group. Published models are available for revision for everyone. For security reasons, the OCHEM system assigns a random public identifier for each model. Any user who knows the identifier can access the model in a read-only mode. This is a convenient way to share a model with other users without publicly revealing it. It may be very helpful, for example, to provide an internal link for reviewers while submitting a paper for publication. A model becomes visible and accessible to all users once it is published on the OCHEM website. Published models cannot be deleted.
Predictions for new compounds
Any model in OCHEM can be applied to predict the target property for new compounds. A set of new compounds can either be provided in an SDF file, drawn manually in a molecule editor or selected as a basket, if the analyzed molecules are already present in the OCHEM database. After the molecules have been selected, the user is forwarded to the waiting screen. When the model calculation is completed the user is provided with the predictions by the selected models for all the target compounds. The predictions can be exported into an Excel file for further offline analysis.
For every prediction OCHEM estimates its accuracy (see the “Applicability domain assessment” section), which is very helpful for the users to decide whether the results of the given model are adequate for the purposes of their study.
Additionally, the predictions of a model are accessible via web services technology, which could be seamlessly integrated with other developing approaches in this area. The user can submit a molecule or a set of molecules using the web services and retrieve the predicted values. A tutorial and several examples on how to access predictions via the web services technology are provided on the OCHEM web site.