jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1186/1758-2946-3-3
- Cite this article as:
- Hinselmann, G., Rosenbaum, L., Jahn, A. et al. J Cheminform (2011) 3: 3. doi:10.1186/1758-2946-3-3
Abstract
Background
The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.
Results
We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.
Conclusions
jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
Background
The decomposition of a chemical graph into a list of features is a convenient way to assess the similarity between chemical compounds by comparing the resulting lists of features. Such representations are also called chemical fingerprints [1]. These encodings are important for data mining applications like similarity-based machine learning approaches or similarity searches [2].
The goal of this work is to introduce an open source molecular fingerprinting library for data mining purposes which provides exact definitions of its fingerprinting algorithms. The algorithms can be parametrized with various options to adapt the encodings, for example, by applying a custom labeling function or by altering the search depth parameter. Additionally, the library can be used as a basis for new implementations. It is based on the Chemistry Development Kit [3], which also provides several fingerprints in its API. However, there are several differences. The first aim of jCompoundMapper is to focus on the exact definition of its encodings, which is crucial to describe the features in data mining experiments. The second aim is to provide the functionality to export the fingerprints or pairwise similarity matrices to formats of popular machine learning toolboxes. A label or property of an input compound to be trained by a machine learning algorithm can be included.
Most fingerprint algorithms rely on either the geometrical or the topological distance between the atoms of a structure. The topological information is stored in the all-shortest path matrix, which encodes the minimum topological distance between two atoms (vertices) by the shortest path using the bonds (edges). Organic compounds are usually weakly connected because the number of covalent bonds (vertex degree) of an organic molecule is limited. In contrast, the geometry of a structure can be interpreted as a fully connected graph. The complexity of both approaches can reduced by limiting the search depth for topological fingerprints or by introducing a distance cut-off for geometrical fingerprints.
jCompoundMapper offers a variety of topological (e.g. radial atom environments [4], extended connectivity fingerprints [5], depth-first search fingerprints [6], or auto-correlation vectors [7]) and geometrical (e.g. two-point and three-point encodings [8, 9] or geometrical atom environments [10]) fingerprints. If applicable, it allows for a parameterization of an encoding, such as the search depth, the distance cut-off, the geometrical scaling factor, the atom typing scheme, or the hash space.
After the feature generation step, the list of features can be mapped to a vectorial format. One possibility is to encode a set of features as a hashed fingerprint. Here, a unique identifier of a feature is used to initialize a pseudo random number generator which produces numbers in $[0,h]\in {\mathbb{N}}^{+}$, where h is the maximum size of the hash space. Thus, the dimensionality of the original feature space can be considerably reduced. For example, the Fingal fingerprint [11], uses the cyclic redundancy check algorithm to generate seeds for the hashing of chemical graph patterns. For an introduction into hashed fingerprints, please refer to the review by Brown [1]. Another strategy reserves fixed bit positions in a vector for specific feature types, like patterns obtained at a certain parameter (such as depth or distance) with a limited number of possible combinations. The definition of the CATS2D [7] vector is an example for this approach.
jCompoundMapper supports native formats of common open source machine learning libraries. The exporters can be used to write feature maps to comma-separated format, LIBSVM [12] format (sparse and matrix), and WEKA ARFF [13]. Therefore, various data mining libraries can be directly applied on the output files. Furthermore, the library provides efficient data structures to compare sets of features in the case that the computation of a similarity matrix is required.
The quality of the encodings was compared on QSAR and toxicity benchmark problems in the results section. First, we conducted experiments using the support vector regression of LIBSVM on the well-known Sutherland QSAR benchmark set [14]. Second, we used a large Ames toxicity classification benchmark [15] and LIBLINEAR [16] to evaluate the performance using binary hashed sparse fingerprints. On the Sutherland data sets, the averaged squared correlation of the all-shortest-path and the atom triplet fingerprint was at least 5% better on ACHE than the best encoding given by Sutherland et al. On BZR and DHFR, the all-shortest path fingerprint achieved a squared correlation of 0.57 and 0.76 respectively. The performance was comparable on two data sets. On the remaining three data sets, the best encoding was more than 5% worse than the results of the best encoding published by Sutherland et al. On the Ames toxicity data set, the implementation of the extended connectivity fingerprint achieved an AUC ROC performance of 0.87, which is comparable to the performance by a non-linear support vector machine trained on state-of-the-art descriptors. Nevertheless, the goal was not an exhaustive comparison but to show that the implementations are able to obtain similar results when compared against literature results. jCompoundMapper features a command-line interface but can also be used as a Java API. It depends solely on open source libraries and is licensed under the LPGL. The source code and an executable is available at Sourceforge.
The library originated from various implementations of literature fingerprints and descriptors used in comparison studies. The encodings were employed either as part of a new approach or as a reference method [17–21].
To sum up, jCompoundMapper is an open source library for the encoding of chemical graphs as fingerprints. It can be used from the command-line interface or as a Java API. Hence, a further use in applications, like in KNIME http://www.knime.org nodes, is possible. The overall performance of the fingerprints in machine learning experiments indicates that structured-based models of reasonable quality can be obtained.
Methods
Prerequisites
Notation
The binned geometrical distance matrix G_{ ij } encodes the spatial distances between two heavy atoms. The topological distance matrix T_{ ij } encodes the shortest topological distance between atoms i and j. The labeling function $l({a}_{i})\to {\widehat{a}}_{i}$ types an atom according to a specific labeling scheme. The maximum distance allowed between two atoms (geometrical or topological) d defines a distance cut-off for features, all features with g_{ ij } > d or t_{ ij } > d are omitted. A labeled path p is a sequence of atoms connected by bonds $p=({\widehat{a}}_{0},{b}_{0},{\widehat{a}}_{1},{b}_{1},\dots ,{b}_{d-1},{\widehat{a}}_{d})$, where bond b_{ i } connects ${\widehat{a}}_{i}$ with ${\widehat{a}}_{i+1}$. p_{ ij }denotes a path connecting the ith atom with the jth atom. The depth d for topological patterns is the maximum number of bonds allowed for connecting the first atom with the last atom. Analogously to the definition of topological paths, a geometrical pattern must consist of different atoms, i.e. for two atoms a_{ i }, a_{ j }it holds that i ≠ j. Finally, a ⊕ b is defined as the concatenation of alphanumerical string symbols separated by an unique delimiter. In the following, we assume a hydrogen-depleted molecular graph C with n atoms.
where an encoding algorithm F maps some compound C to a set of features X. m depends, with the exception of fixed-vector fingerprints, on C. A feature f has an unique $id\in \mathbb{N}$ and a string representation f.nom. f.id does not necessarily depend on f.nom. However, in most cases it is convenient to use a hash code of the string representation of a feature.
Fundamental Matrices
The topological distance matrix is defined as T_{ ij } . The element i, j contains the shortest path between the ith and the jth atom (t_{ ij } ). t_{ ij } is computed by the Floyd-Warshall algorithm. Therefore, the computation time for the matrix is O(n^{3}).
Feature Extraction
The molecular similarity is based on the numerical identifiers f_{ x }.id of a feature x. Two features are regarded as equal if f_{ x }.id = f_{ y }.id. In all implementations the features of a compound C are distinguishable by recurrence, which means that we include a feature if the id of a feature is different from the previously extracted features. If a feature with the same id is generated again, the count for the feature is incremented. All atom pair encodings are extracted by regarding the upper half of the distance matrix only. For each atom pair, the string representation is generated in both reading directions. Only the version with the greater hash code is included in the final set of descriptors.
The modified depth-first search applied in this library generates all possible paths originating from a root atom. Therefore, the feature space can be approximated by an m-ary tree and is therefore O(nm^{ d } ), where n is the number of heavy atoms and m the number of children in an m-ary tree, d is the depth of the tree. In organic compounds, every atom has at most 4 neighbors (m = 4 - 1 because one of the neighboring bonds has already been visited). Thus, the hypothetical worst case has a complexity of O(n3 ^{ d } ) at a search depth of d. If we assume an average branching factor α, which is slightly above 1 for organic compounds [6], the depth-first search has a complexity of O(nα^{ d } ). The average branching factor depends on the average degree of a vertex, which is about 2 in organic molecules. We define DFS(a_{ i } , d) as the set of all possible paths originating from a root atom a_{ i } with a depth up to d.
For some of the definitions, we defined a can function that maps a set of features to a single canonical pattern. In an implementation this function can be realized by first sorting the patterns, which is possible if a natural order can be defined on the features. Then, the list of sorted patterns can be merged to a single canonical representation.
Atom Types and Pharmacophore Types
- 1.
Element symbol (e.g. C, O, N, ...)
- 2.
CDK atom types (e.g. C.sp2, O.minus, N.amine, ...)
- 3.
Element plus the number of neighboring heavy atoms (e.g. C.2, O.1, N.2, ...)
- 4.
Element plus ring type plus the number of neighboring heavy atoms (e.g. C.r.2, C.a.2, O.1, N.2, ...) where r is an arbitrary ring, and a is an aromatic system. If a_{ i } is not contained in a ring, no ring type is set. The precedence is a > r.
- 5.
Daylight-Invariants (plus optional ring flag) have the following properties, separated by a dot: Atomic number, number of heavy atom neighbors, valency minus the number of connected hydrogens, atomic mass, atomic charge, number of connected hydrogens, and a flag if the atom is member of at least one ring. (e.g. 6.2.3.12.0.1.1 for a carbon in a benzole ring)
- 1.
Hydrogen-bond donor (D): [#6H] oxygen atom of an OH-group; [#7 H,#7H2] nitrogen atom of an NH or NH_{2} group
- 2.
Hydrogen-bond acceptor (A): oxygen atom [#6]; [#7H0] nitrogen atom not adjacent to a hydrogen atom
- 3.
Positive (P): [*+] atom with a positive charge; [#7H2] nitrogen atom of an NH_{2} group
- 4.
Negative (N): [*-] atom with a negative charge; [C&$(C(= O)#8H1), P&$(P(= O)O), S&$(S(= O)O)] carbon, sulfur or phosphorus atom of a COOH, SOOH, or POOH group (SMARTS replaced by a direct graph search)
- 5.
Lipophilic (L): [Cl, Br, I] chlorine, bromine, or iodine atom; [S;D2;$(S(C)(C))] sulfur; atom adjacent to exactly two carbon atoms; sulfur atom adjacent to only carbon atoms (SMARTS replaced by a direct graph search)
Encodings
Topological Fingerprints
All encodings described in the following section rely on the d parameter which constrains the maximum topological distance allowed between two atoms a_{ i }, a_{ j } in a feature.
All-Path Encoding (DFS)
All-Shortest Path Encoding (ASP)
Thus, the all-shortest path encoding is a subset of the paths contained in the DFS fingerprint. It is similar to topological atom pair approaches [8] with the exception that all-shortest paths between two atoms are explicitly stored. Borgwardt et al. proposed a graph kernel based on the set of all-shortest paths [22], however, only the vertex pairs and their shortest-path distances were included in this work. The explicit generation of paths is necessary because the Floyd-Warshall algorithm computes only the shortest distances between two vertices.
Topological Atom Pairs (AP2D)
For 2-point patterns this can be easily conducted by regarding the upper half of the distance matrix T_{ ij } only (i.e. i > j). O(n^{2}) is needed for the generation of features and O(n^{3}) for the computation of T_{ ij } .
Thus, the total computation time is cubic.
Topological Atom Triplets (AT2D)
Topological Autocorrelation Keys (CATS2D)
where offset(p) returns the predefined start index for the pattern p, d_{2D}(p) returns the topological distance between two atoms in the PPP pair, and φ _{ p }(X) counts the number of occurrences of a pattern p in X.
Pharmacophore Pair and Triplet Encodings (PHAP2PT2D, PHAP3PT2D)
where t_{ ij }, t_{ jk }, t_{ ki } ≤ d. Actually, there are three additional inner loops over all valid pharmacophore points at atoms i, j, k. The complexity of these inner loops is theoretically 5^{3} because the cardinality of the set of PPPs is 5. However, this complexity is further reduced because the meaning of some PPP definitions is contradicting for some combinations, such as "atom is positively charged" and "atom is negatively charged". The overall complexity is O(n^{3}) because of the constant computation time of the inner loops.
SHED Key (SHED)
where PPP_{ i } (l) denotes the ith PPP pair that is separated by a topological distance l. If pattern i was not found, the value of the ith dimension is set to 0. The distribution of a PPP pair is calculated by regarding the different distances 1, 2, ...,l, ..., d. The resulting vector has 15 real-valued entries.
Extended Connectivity Fingerprints (ECFP)
We implemented a variant of the ECFP as described by Rogers and Hahn [5]. Each ECFP feature represents a circular substructure around a center atom. The algorithm starts with the initial atom identifier of the center atom and grows a circular substructure around this atom throughout a defined number of iterations (search depth). For each round, the current extended version of the feature is added to the final set of features. In contrast to other radial fingerprints, the bonding information is included. Therefore, a feature can be extracted, for example, as canonical SMILES.
The current implementation of the ECFP in jCompoundMapper differs slightly from the original implementation. In the original algorithm, the identifiers of the alpha atoms of a center atom are used to calculate an updated identifier for the center atom. The algorithm only includes the alpha atoms of a center atom in each iteration and thus the connectivity information is completely discarded between the layers. However, the identifier of a center atom implicitly contains information from further and further away of the center atom in each iteration because the atom identifiers of the previous iteration are used. We explicitly model the growing substructure by using the initial atom identifiers in each iteration and keeping the connectivity information between the layers. After an iteration, new possible attachment points for a specific circular substructure are kept in memory and those attachment points are extended in the next iteration.
Topological Molprint-like fingerprints (RAD2D)
Local Path Environments (LSTAR)
This fingerprint is a radial fingerprint similar to RAD2D. The major difference is that all paths up to depth d are stored in a shell. First, the tree of all paths originating from an atom a_{ i } is generated. Then, all paths of a certain length are assigned to a shell s(a_{ i } ) _{ d } containing the paths originating from root atom a_{ i } of length d. This is equal to a canonical representation of DFS(a_{i, }d) in a single canonical feature. The paths in a shell are sorted in lexicographical order to be comparable. The resulting fingerprint contains all shells ≤ 1, 2, ..., l, ..., d. The major difference to the Molprint-like fingerprints is that the bond information is still included.
Geometrical Fingerprints
All geometrical encodings support the d parameter which defines the distance cut-off between two atoms. Another important parameter is the scaling factor s, as described at the beginning of this section.
Geometrical Atom Pairs and Atom Triplets (AP3D, AT3D)
where i ≠ j and g_{ ij } ≤ d.
where i ≠ j ≠ k and g_{ ij }, g_{ jk }, g_{ ki } ≤ d.
This is a standard encoding implemented in several toolkits; a kernel based on such patterns was published by Mahé et al [9].
Geometrical CATS fingerprints (CATS3D)
where d_{3D}(p) returns the geometrical distance of the two atoms, which equals g_{ ij } between any atoms i, j contained in a feature.
Geometrical pharmacophore fingerprints (PHAP2PT3 D, PHAP3PT3D)
These fingerprints are derived from their topological variants PHAP2PT2D and PHAP3PT2D by replacing the T_{ ij } matrix by G_{ ij } . Let P_{ i } denote the set of valid PPP for the ith atom then P_{ i } ⊕ g_{ ij } ⊕ P_{ j } is a valid two-point pharmacophores and P_{ i } ⊕ g_{ ij } ⊕ P_{ j } ⊕ g_{ jk } ⊕ P_{ k } ⊕ g_{ ki } is a valid three-point pharmacophore, where g_{ ij } , g_{ jk } , g_{ ki } ≤ d.
Again, P_{ i } denotes the set of valid PPPs for the ith atom.
Geometrical Molprint-like fingerprints (RAD3D)
Example of encodings
Examples of Encodings
Encoding |
Eq.^{ a } |
c param^{ b } |
Pattern produced by f |
---|---|---|---|
DFS |
4 |
0 |
C.2-N.3-C.3:1, C.3-C.2-N.3:1, C.3-N.3-C.3:1, C.3-N.3:1, C.2-C.3-N.3:1, C.3-C.3-N.3:1, ... |
ASP |
5 |
1 |
N.3-C.3 = O.1:1, C.1-C.3-N.3:1, C.2-N.3:1, C.2-N.3-C.3:1, C.3-C.2-N.3:1, C.3-N.3:1, ... |
AP2D |
6 |
2 |
N.3-1-C.2:1, N.3-1-C.3:1, N.3-2-C.2:1, N.3-2-C.1:1, N.3-2-C.3:1, O.1-2-N.3:1 |
AT2D |
7 |
3 |
C.2-2-N.3-1-C.3-1:1, N.3-2-C.2-2-C.2-1:1, N.3-2-C.2-2-C.3-2:1, ... |
CATS2D |
8 |
6 |
0:5, 2:2, 3:4, ... |
PHAP2POINT2D |
9 |
8 |
A-2-A:1, L-2-A:1, N-2-A:1 |
PHAP3POINT2D |
10 |
9 |
A-2-A-2-L-2:1, A-2-A-2-L-2:1, ... |
SHED |
11 |
14 |
AA:2.8, AL:3.596, AN:2.872, ... |
ECFP |
- |
12 |
[*]N([*])C(= O)C:1, [*] = C([*])N(C[*])C([*])[*]:1, [*]N([*])[*]:1, ... |
RAD2D |
12 |
15 |
0[N]1[C C C]:1, 0[N]1[C C C]2[C C C C O]:1 |
LSTAR |
13 |
13 |
[N.3-C.2, N.3-C.3, N.3-C.3]:1, [N.3-C.2-C.3, N.3-C.3-C.1, N.3-C.3-C.2, N.3-C.3-C.3, N.3-C.3 = O.1]:1 |
AP3D |
14 |
4 |
N.3-1-C.2:1, N.3-1-C.3:1, N.3-2-C.1:1, N.3-2-C.2:1, N.3-2-C.3:1, O.1-2-N.3:1 |
AT3D |
15 |
5 |
C.3-1-O.1-2-N.3-1:1, O.1-2-C.1-2-N.3-2:1, C.2-2-C.2-2-N.3-1:1, ... |
CATS3D |
16 |
7 |
0:5, 2:2, 3:4, ... |
PHAP2POINT3D |
17 |
10 |
A-2-A:1, L-2-A:1, N-2-A:1 |
PHAP3POINT3D |
18 |
11 |
A-2-A-2-L-2:1, A-2-L-2-A-2:1, L-2-A-2-A-2:1 |
RAD3D |
19 |
16 |
0[N.3]1[C.2 C.3 C.3]2[C.1 C.2 C.3 C.3 O.1]:1, 0[N.3]1[C.2 C.3 C.3]:1 |
Implementation
Third-party libraries
The underlying chemical expert system is the Chemistry Development Kit (CDK) [3, 26] in its current development version 1.35. It provides the basic functionality for parsing, typing, and graph algorithms for molecular data. For the command-line interface we employed the Apache Commons command-line parser 1.2 http://commons.apache.org/cli/. The access via the API or the binary using the command-line interface enables the user to utilize the library for batch processing. The language level is Java 1.6.
Additional functionality
Import and Export of Data
The valid input format is MDL SD format with attached hydrogens for the command-line tool. The CDK molecule objects can be processed using the API.
There are exporters for various formats of popular machine learning toolboxes. The ARFF format is the native WEKA [13] format, the support vector machine libraries LIBSVM and LIBLINEAR are supported by their sparse hashed format and precomputed matrix format. Alternatively, there are several comma-separated formats which support hashed or string features, which can be imported into toolboxes like R or MATLAB.
jCompoundMapper includes a buffered random access reader for parsing the input files. Thus, it can read files of the maximum size supported by the Java runtime environment. The memory requirements are low if the encodings are exported sequentially (such as the sparse LIBSVM format) because only a single encoding has to be stored at a time. If the computation of a similarity matrix is required, all encodings are kept in memory to ensure a fast computation of similarities. For this reason, a matrix computation requires additional memory for large data sets.
The label or class for learning tasks is read from the SD property and is integrated into the specific output format. As for the ARFF format, a nominal or numeric class label is created, depending on the distribution of labels in the input format. The user may overwrite the default threshold for the number of classes (currently, this is set to five).
Hashing
is used to project the set of features to a binary vector of the dimension h. The size depends on the expected number of features (see Table 1). H generates the hashed bit of a pattern depending on the numerical seed f.id assigned to each feature. In most cases, the seed equals a hash code of the string representation. The hashing step is also useful to obtain nominal features for fast comparisons. A nominal feature is a feature f with a finite set of possible values like f ∈ {red, green, blue} or, convenient for chemical graphs, f ∈ {pattern included, pattern not included}.
Similarity Matrices
jCompoundMapper offers a FeatureMap data structure, which stores the features in a wrapped hash map. Different metrics are defined on this data structure, such as MinMax and Tanimoto [27]. Thus, it is possible to compute distance matrices within seconds on an average desktop computer. Now, we assume two mappings F (C_{ A } ) = A and F (C_{ B } ) = B. Further, let φ_{ p } (X) count the number of occurrences of pattern p ∈ X.
The feature maps permit to compute similarity matrices with jCompoundMapper on the full set of features of a compound C, without introducing noise by hashing. Nevertheless, it is also possible to generate hashed binary fingerprint objects of any of the encodings.
Results and Discussion
Computation Times Benchmarks
Conversion Time
Encoding |
param^{ b } |
molecules/s |
mean f.^{ a } |
max f.^{ a } |
median f.^{ a } |
complex. Mem.^{ c } |
---|---|---|---|---|---|---|
DFS |
d = 8, a = EN |
68.6 |
396 |
3,554 |
362 |
O(nα^{ d }) |
ASP |
d = 8, a = EN |
112.1 |
216 |
1,198 |
204 |
O(nα^{ d }) |
AP2D |
d = 8, a = EN |
339.8 |
95 |
256 |
96 |
O(n^{2}) |
AT2D |
d = 5, a = EN |
91.6 |
1,848 |
7,922 |
1,811 |
O(n^{3}) |
CATS2D |
d = 9, a = PPP |
6.9 |
150 |
150 |
150 |
O(1) |
PHAP2PT2D |
d = 8, a = PPP |
6.8 |
35 |
132 |
34 |
O(p^{2}) |
PHAP3PT2D |
d = 5, a = PPP |
6.7 |
300 |
1,664 |
277 |
O(p^{3}) |
SHED |
d = 8, a = PPP |
8.0 |
15 |
15 |
15 |
O(1) |
ECFP |
d = 4, a = DIR |
181.0 |
77 |
349 |
77 |
O(nα^{ d }) |
RAD2D |
d = 3, a = EN |
232.5 |
55 |
192 |
55 |
O(nd) |
LSTAR |
d = 6, a = EN |
136.6 |
144 |
884 |
143 |
O(nd) |
AP3D |
d = 10, a = EN |
332.5 |
112 |
336 |
113 |
O(n^{2}) |
AT3D |
d = 6, a = EN |
71.4 |
3,450 |
27,774 |
3,188 |
O(n^{3}) |
CATS3D |
d = 9, a = PPP |
6.6 |
150 |
150 |
150 |
O(1) |
PHAP2PT3D |
d = 10, a = PPP |
7.2 |
41 |
189 |
40 |
O(p^{2}) |
PHAP3PT3D |
d = 6, a = PPP |
7.1 |
660 |
5,844 |
601 |
O(p^{3}) |
RAD3D |
d = 4, a = EN |
227.1 |
85 |
611 |
85 |
O(nd) |
Machine Learning Performance
A major application of molecular encodings are structure-based machine learning and data mining methods. The aim of such applications is either the prediction of molecular properties or the ranking of compounds according to a trained model. In the following experiments, we wanted to assess the quality of the encodings implemented in jCompoundMapper for several established regression and classification benchmark problems. The encodings were used with the default parameters as given in Table 2. The compounds were prepared using CORINA [30] for initial coordinates and were refined using Schrödinger MacroModel [31] with the OPLS 2005 force field.
QSAR Regression Problems with LIBSVM
LIBSVM [12] is a library for support vector machines. For the experiments on the regression benchmarks, we decided to train ϵ-support vector regression on the benchmark compilation of eight pIC50 QSAR problems published by Sutherland et al. [14]. The Gram matrices were precomputed by the MinMax similarity, which is also a valid kernel function. We conducted these experiments to find out whether there are significant differences between the performances of the different encodings.
Nested Cross-validation MSE Performance on the Sutherland Data Sets
Encoding |
ACE |
ACHE |
BZR |
COX2 |
DHFR |
GPB |
THERM |
THR |
---|---|---|---|---|---|---|---|---|
DFS |
1.73 ± 0:74 |
0.66 ± 0.28 |
0.61 ± 0.31 |
1.13 ± 0.30 |
0.57 ± 0.21 |
0.66 ± 0.54 |
2.10 ± 1.31 |
0.60 ± 0.38 |
ASP |
1.70 ± 0.72 |
0.62 ± 0.26 |
0.53 ± 0.27 |
1.11 ± 0.30 |
0.58 ± 0.21 |
0.63 ± 0.48 |
2.09 ± 1.32 |
0.59 ± 0.38 |
AP2D |
1.50 ± 0.70 |
0.85 ± 0.37 |
0.70 ± 0.37 |
1.03 ± 0.30 |
0.73 ± 0.29 |
0.61 ± 0.45 |
2.19 ± 1.20 |
0.50 ± 0.31 |
AT2D |
1.57 ± 0.69 |
0.74 ± 0.34 |
0.69 ± 0.35 |
0.97 ± 0.27 |
0.66 ± 0.30 |
0.60 ± 0.47 |
1.97 ± 1.20 |
0.49 ± 0.32 |
CATS2D |
1.76 ± 0.72 |
0.93 ± 0.33 |
0.89 ± 0.45 |
1.35 ± 0.43 |
0.69 ± 0.19 |
0.64 ± 0.45 |
2.28 ± 1.14 |
0.52 ± 0.32 |
PHAP2PT2D |
1.77 ± 0.71 |
0.96 ± 0.33 |
0.91 ± 0.45 |
1.38 ± 0.44 |
0.72 ± 0.20 |
0.65 ± 0.48 |
2.18 ± 1.10 |
0.53 ± 0.31 |
PHAP3PT2D |
1.81 ± 0.69 |
0.96 ± 0.33 |
0.82 ± 0.39 |
1.23 ± 0.41 |
0.67 ± 0.21 |
0.56 ± 0.49 |
1.89 ± 1.16 |
0.57 ± 0.37 |
SHED |
2.08 ± 0.76 |
1.05 ± 0.50 |
1.09 ± 0.46 |
1.64 ± 0.48 |
1.49 ± 0.35 |
0.70 ± 0.33 |
2.71 ± 1.54 |
0.49 ± 0.28 |
ECFP |
1.80 ± 0.77 |
0.72 ± 0.29 |
0.66 ± 0.32 |
1.01 ± 0.28 |
0.57 ± 0.20 |
0.68 ± 0.55 |
2.19 ± 1.36 |
0.51 ± 0.33 |
RAD2D |
1.87 ± 0.75 |
0.77 ± 0.33 |
0.79 ± 0.37 |
1.08 ± 0.30 |
0.71 ± 0.27 |
0.72 ± 0.59 |
2.20 ± 1.33 |
0.50 ± 0.35 |
LSTAR |
1.97 ± 0.79 |
0.72 ± 0.29 |
0.69 ± 0.30 |
1.04 ± 0.27 |
0.62 ± 0.19 |
0.76 ± 0.61 |
2.31 ± 1.39 |
0.50 ± 0.31 |
AP3D |
1.60 ± 0.69 |
0.69 ± 0.32 |
0.59 ± 0.32 |
0.93 ± 0.27 |
0.67 ± 0.23 |
0.67 ± 0.51 |
2.73 ± 1.35 |
0.57 ± 0.33 |
AT3D |
1.77 ± 0.68 |
0.64 ± 0.28 |
0.67 ± 0.36 |
0.99 ± 0.28 |
0.57 ± 0.18 |
0.74 ± 0.60 |
2.75 ± 1.37 |
0.60 ± 0.28 |
CATS3D |
1.75 ± 0.70 |
0.90 ± 0.38 |
0.81 ± 0.36 |
1.31 ± 0.41 |
0.73 ± 0.20 |
0.79 ± 0.49 |
2.47 ± 1.32 |
0.62 ± 0.32 |
PHAP2PT3D |
1.75 ± 0.70 |
0.87 ± 0.36 |
0.81 ± 0.40 |
1.32 ± 0.41 |
0.73 ± 0.20 |
0.83 ± 0.54 |
2.53 ± 1.30 |
0.65 ± 0.33 |
PHAP3PT3D |
1.99 ± 0.77 |
0.82 ± 0.29 |
0.81 ± 0.36 |
1.14 ± 0.37 |
0.59 ± 0.17 |
0.86 ± 0.69 |
2.84 ± 1.46 |
0.69 ± 0.30 |
RAD3D |
2.17 ± 0.78 |
0.78 ± 0.31 |
0.73 ± 0.35 |
1.10 ± 0.30 |
0.57 ± 0.17 |
0.82 ± 0.69 |
2.79 ± 1.37 |
0.58 ± 0.31 |
An analogue setup was used to compute nested leave-one-out cross-validation results to compare the predictive performance of the support vector machine in combination with the jCompoundMapper encodings to literature results. Again, we optimized the parameters C and ϵ (we used a 10-fold cross-validation repeated 2 times to select the best parameter combination in the inner loop) and trained a model for each of the n leave-one-out sets and predicted the external sample for each model.
Nested Leave-one-out MSE Performance on Sutherland Data Sets
Encoding |
ACE |
ACHE |
BZR |
COX2 |
DHFR |
GPB |
THERM |
THR |
---|---|---|---|---|---|---|---|---|
DFS |
1.93 |
0.61 |
0.61 |
1.09 |
0.57 |
0.66 |
2.08 |
0.55 |
ASP |
1.93 |
0.56 |
0.54 |
1.07 |
0.56 |
0.63 |
2.08 |
0.53 |
AP2D |
1.59 |
0.79 |
0.68 |
1.04 |
0.72 |
0.64 |
2.06 |
0.45 |
AT2D |
1.68 |
0.69 |
0.67 |
0.92 |
0.69 |
0.65 |
1.92 |
0.45 |
CATS2D |
1.83 |
0.83 |
0.96 |
1.34 |
0.65 |
0.62 |
2.20 |
0.45 |
PHAP2PT2D |
1.88 |
0.92 |
0.98 |
1.37 |
0.67 |
0.62 |
2.10 |
0.46 |
PHAP3PT2D |
1.83 |
0.91 |
0.85 |
1.20 |
0.66 |
0.58 |
2.04 |
0.50 |
SHED |
2.11 |
1.00 |
1.13 |
1.71 |
1.41 |
0.72 |
2.94 |
0.43 |
ECFP |
2.01 |
0.66 |
0.66 |
0.96 |
0.57 |
0.65 |
2.17 |
0.47 |
RAD2D |
1.99 |
0.73 |
0.79 |
1.03 |
0.72 |
0.67 |
2.18 |
0.43 |
LSTAR |
2.29 |
0.66 |
0.68 |
1.00 |
0.60 |
0.71 |
2.31 |
0.46 |
AP3D |
1.88 |
0.64 |
0.59 |
0.90 |
0.67 |
0.70 |
2.61 |
0.54 |
AT3D |
2.04 |
0.60 |
0.65 |
0.97 |
0.58 |
0.71 |
2.70 |
0.59 |
CATS3D |
1.91 |
0.85 |
0.84 |
1.26 |
0.70 |
0.74 |
2.62 |
0.58 |
PHAP2PT3D |
1.92 |
0.81 |
0.85 |
1.30 |
0.70 |
0.76 |
2.72 |
0.62 |
PHAP3PT3D |
2.40 |
0.73 |
0.81 |
1.11 |
0.59 |
0.82 |
2.82 |
0.67 |
RAD3 |
2.43 |
0.73 |
0.73 |
1.04 |
0.57 |
0.75 |
2.75 |
0.55 |
Nested Leave-one-out Correlation Performance on Sutherland Data Sets
Encoding |
ACE |
ACHE |
BZR |
COX2 |
DHFR |
GPB |
THERM |
THR |
---|---|---|---|---|---|---|---|---|
DFS |
0.79 |
0.78 |
0.71 |
0.68 |
0.86 |
0.69 |
0.70 |
0.68 |
ASP |
0.79 |
0.80 |
0.75 |
0.69 |
0.87 |
0.71 |
0.70 |
0.69 |
AP2D |
0.83 |
0.70 |
0.67 |
0.71 |
0.82 |
0.71 |
0.70 |
0.75 |
AT2D |
0.82 |
0.74 |
0.67 |
0.74 |
0.83 |
0.70 |
0.73 |
0.75 |
CATS2D |
0.81 |
0.68 |
0.49 |
0.60 |
0.84 |
0.72 |
0.68 |
0.75 |
PHAP2PT2D |
0.80 |
0.64 |
0.48 |
0.59 |
0.83 |
0.72 |
0.70 |
0.74 |
PHAP3PT2D |
0.80 |
0.64 |
0.56 |
0.65 |
0.84 |
0.73 |
0.71 |
0.71 |
SHED |
0.78 |
0.61 |
0.34 |
0.46 |
0.61 |
0.66 |
0.57 |
0.76 |
ECFP |
0.78 |
0.76 |
0.68 |
0.73 |
0.86 |
0.70 |
0.68 |
0.74 |
RAD2D |
0.78 |
0.73 |
0.59 |
0.70 |
0.82 |
0.68 |
0.68 |
0.76 |
LSTAR |
0.75 |
0.77 |
0.68 |
0.72 |
0.86 |
0.66 |
0.66 |
0.74 |
AP3D |
0.80 |
0.77 |
0.72 |
0.74 |
0.83 |
0.67 |
0.60 |
0.68 |
AT3D |
0.78 |
0.80 |
0.68 |
0.72 |
0.86 |
0.66 |
0.58 |
0.65 |
CATS3D |
0.79 |
0.67 |
0.57 |
0.63 |
0.83 |
0.64 |
0.60 |
0.66 |
PHAP2PT3D |
0.79 |
0.69 |
0.56 |
0.62 |
0.83 |
0.63 |
0.58 |
0.62 |
PHAP3PT3D |
0.73 |
0.73 |
0.58 |
0.68 |
0.86 |
0.59 |
0.55 |
0.58 |
RAD3 |
0.73 |
0.73 |
0.63 |
0.70 |
0.86 |
0.63 |
0.57 |
0.69 |
Comparison with Literature Results on Sutherland Data Sets
Data set |
Best Encoding |
${Q}_{loo}^{2}$ |
Sutherland |
${Q}_{loo}^{2}$ |
ACE |
AP2D |
0.69 |
HQSAR |
0.72 |
ACHE |
ASP, AT3D |
0.57 |
CoMSIA |
0.49 |
BZR |
ASP |
0.56 |
CoMSIA |
0.45 |
COX2 |
AT2D |
0.55 |
CoMSIA |
0.57 |
DHFR |
ASP |
0.76 |
CoMSIA |
0.69 |
GPB |
PHAP3PT2D |
0.53 |
HQSAR |
0.66 |
THERM |
AT3D |
0.53 |
2.5D |
0.66 |
THR |
RAD2 D, RAD3D |
0.58 |
CoMSIA |
0.72 |
Classification of Toxic Compounds with LIBLINEAR
Another increasingly important task is to build models on large data sets of chemicals. A machine that can cope with such a setup is LIBLINEAR [16], a linear large-scale support vector machine. The large Ames data set was published by Hansen et al. [15] and contains 6512 compounds and their measured toxicity in an Ames test. We skipped the encodings based on the PPP typer because several compounds do not have any pharmacophore point according to the PPP definition. The results were obtained by tuning the C parameter in log_{2} ∈ {-8, -7, ..., 2} within a 2-fold cross-validation on the training set and evaluating the model performance on five defined splits as described in [15].
Comparison with Literature Results on the Ames Toxicity Benchmark
Encoding |
AUC ROC^{ a } |
---|---|
ECFP |
0.87 ± 0.01 |
LSTAR |
0.86 ± 0.01 |
SVM-dragonX^{ b } |
0:86 ± 0.01 |
RAD2D |
0.85 ± 0.02 |
RAD3D |
0.85 ± 0.01 |
ASP |
0.85 ± 0.02 |
GP-dragonX^{ b } |
0.84 ± 0.01 |
AT2D |
0.83 ± 0.01 |
RF-dragonX^{ b } |
0.83 ± 0.01 |
AT3D |
0.81 ± 0.01 |
kNN-dragonX^{ b } |
0.79 ± 0.01 |
DFS |
0.78 ± 0.01 |
AP2D |
0.78 ± 0.03 |
AP3D |
0.76 ± 0.02 |
Java API and Command-line Interface
Java API Usage Example
The core of the library is a Java API. The API enables to process chemical information from an abstract level, similar to a workflow tool. The example given in Appendix 1 reads an MDL SD file, converts the compounds to feature maps and calculates all pairwise similarities.
Command-Line Interface Example
Using the defaults (or via -ff 0), jCompoundMapper generates a hashed LIBSVM output format using the depth-first search encoding with element plus neighbor count atom types.
In the following, we process the training and the known test set from the environmental toxicity challenge http://www.cadaster.eu/node/65) which were converted to MDL SD format. The label (MDL property) to be learned is log(IGC50-1). Using these settings, the structures of the training set were mapped to hashed fingerprints with the default settings.
java -jar jCMapperCLI.jar -f challenge_train.sdf -l "log(IGC50-1)"
After the computation, an overall statistic is plotted showing e.g. the average number of features in the fingerprints. In the next step, we map the test file to the same representation. Bits in the test file were set in exactly the same positions in the vector because the random numbers are generated by using the seed value defined by the features.
java -jar jCMapperCLI.jar -f challenge_test_known.sdf -l "log(IGC50-1)"
In the next step, a cross-validation is conducted by using the precompiled binary distribution of LIBSVM that can be downloaded from the LIBSVM homepage. The parameters are set as follows: -t 0 sets the linear kernel (dot product), -s 3 sets ∈ regression, and -c 2 sets the error weight to 2. The file for training was produced in the previous step.
svmtrain -t 0 -s 3 -v 10 -c 2 challenge_train.DFS.LIBSVM_SPARSE
LIBSVM produces no model in cross-validation mode. However, the LIBSVM cross-validations statistics showed that the model has an MSE of 0.32 and an Q^{2} of 0.71, indicating a reasonable parametrization.
Cross Validation Mean squared error = 0.324891
Cross Validation Squared correlation coefficient = 0.712412
Finally, the model is trained by omitting the cross-validation flag -v.
svmtrain -t 0 -s 3 -c 2 challenge_train.DFS.LIBSVM_SPARSE
This step produces a separate model file, which can be used to predict the external test set that was generated during the second step. This is conducted by calling svmpredict.
svmpredict challenge_test_known.DFS.LIBSVM_SPARSE challenge_train.DFS.LIBSVM_SPARSE.model result
The results are printed by LIBSVM highlighting that the performance on the external test set is MSE = 0.29 and R^{2} = 0.74. The result on the known test of the environmental toxicity prediction challenge would be in the top ranks of the competition. The prediction values can be obtained by opening the LIBSVM output file result in an editor.
Mean squared error = 0.291011 (regression)
Squared correlation coefficient = 0.742283 (regression)
Conclusions
jCompoundMapper is an open source library for molecular fingerprinting with a focus on machine learning and data mining applications. A command-line interface exists for the user who is not familiar with programming, which allows a simple usage from the shell or the application in scripts. The architecture provides the functionality to derive fingerprints from existing ones or to integrate own encodings. In contrast to closed source fingerprinting toolkits, a scientist knows exactly how the fingerprint is computed (like the labeling function, distance cut-offs) and can even inspect the source code of the generation routine. We compared the performance using linear and non-linear support vector machines on standard machine learning benchmarks in the research field. The results show that the machine learning performance using the encodings with default parameters is already close to more sophisticated state-of-the-art descriptors. The binary version provides a command-line interface allowing for the generation of models from the shell with open source software such as LIBSVM or WEKA in reasonable time on average desktop computers. The library itself uses only functionality of open source software licensed under the LGPL. Therefore, the library can be used in any project compatible with the CDK. Further projects with the library, such as a KNIME node wrapping jCompoundMapper, are planned.
Availability
- 1.
External library, which can be integrated as Java jar library file
- 2.
External library, including sources
- 3.
Binary command-line tool (requires a Java runtime environment) and a short tutorial with a prepared data set
Appendix
Appendix 1 - Usage of the API
Example of using the API: Read molecules, map the compounds to encodings, and compute a similarity matrix.
1 // read sd file
2 RandomAccessMDLReader reader = new RandomAccessMDLReader(new File (" molecules . sdf " ) );
3
4 // convert all compounds in the data set to feature maps
5 final ArrayList <FeatureMap> featureMaps = new ArrayList <FeatureMap >( );
6 final FingerPrinter finger printer = new Encoding2DAllShortestPaths ( ) ;
7 for (int i = 0; i < reader. getSize ( ); i++) {
8 ArrayList <IFeature > rawFeatures = finger printer . getFingerprint (reader. getMol (i));
9 featureMaps . add ( new FeatureMap (rawFeatures ) ) ;
10 }
11
12 //compute all pairwise distances by feature maps
13 final int dim = featureMaps . size ( ) ; double [ ] [ ] matrix = new double [ dim ] [ dim ] ;
14 IDistanceMeasure similarity = new DistanceTanimoto ( ) ;
15 for ( int i = 0; i < dim; i++) {
16 for ( int j = i ; j < dim ; j++) {
17 matrix [ i ] [ j ] = similarity . getSimilarity (featureMaps . get ( i ), featureMaps . get ( j ));
18 }
19 }
Supplementary material
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.