jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.
We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.
jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
- Brown, N (2009) Chemoinformatics - An Introduction for Computer Scientists. ACM Comput Surv 41: pp. 8:1-8:38 CrossRef
- Willett, P, Barnard, JM, Downs, GM (1998) Chemical Similarity Searching. J Chem Inf Comput Sci 38: pp. 983-996
- Steinbeck, C, Han, Y, Kuhn, S, Horlacher, O, Luttmann, E, Willighagen, E (2003) The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 43: pp. 493-500
- Bender, A, Mussa, HY, Glen, RC, Reiling, S (2004) Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance. J Chem Inf Comput Sci 44: pp. 1708-1718
- Rogers, D, Hahn, M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50: pp. 742-754 CrossRef
- Ralaivola, L, Swamidass, SJ, Saigo, H, Baldi, P (2005) Graph kernels for chemical informatics. Neural Networks 18: pp. 1093-1110 CrossRef
- Renner, S, Fechner, U, Schneider, G Alignment-free Pharmacophore Patterns - A Correlation Vector Approach. In: Langer, T, Hoffmann, R eds. (2006) Pharmacophores and Pharmacophore Searches, Pharmacophores and Pharmacophore Searches. Wiley-VCH, Weinheim, pp. 49-79
- Carhart, RE, Smith, DH, Venkataraghavan, R (1985) Atom Pairs as Features in Structure-Activity Studies: Definition and Applications. J Chem Inf Comput Sci 25: pp. 64-73
- Mahé, P, Ralaivola, L, Stoven, V, Vert, JP (2006) The Pharmacophore Kernel for Virtual Screening with Support Vector Machines. J Chem Inf Model 46: pp. 2003-2014 CrossRef
- Bender, A, Mussa, HY, Gill, GS, Glen, RC (2004) Molecular Surface Point Environments for Virtual Screening and the Elucidation of Binding Patterns (MOLPRINT 3D). J Med Chem 47: pp. 6569-6583 CrossRef
- Brown, N, McKay, B, Gasteiger, J (2005) Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling. QSAR Comb Sci 24: pp. 480-484 CrossRef
- Chang, CC, Lin, CJ (2001) LIBSVM: A Library for Support Vector Machines.
- Hall, M, Frank, E, Holmes, G, Pfahringer, B, Reutemann, P, Witten, IH (2009) The WEKA Data Mining Software: An Update. SIGKDD Explorations 11: pp. 10-18 CrossRef
- Sutherland, JJ, O'Brien, LA, Weaver, DF (2004) A Comparison of Methods for Modeling Quantitative Structure-Activity Relationships. J Med Chem 47: pp. 5541-5554 CrossRef
- Hansen, K, Mika, S, Schroeter, T, Sutter, A, ter Laak, A, Steger-Hartmann, T, Heinrich, N, Müller, KR (2009) Benchmark Data Set for in Silico Prediction of Ames Mutagenicity. J Chem Inf Model 49: pp. 2077-2081 CrossRef
- Fan, RE, Chang, KW, Hsieh, CJ, Wang, XR, Lin, CJ (2008) LIBLINEAR: A Library for Large Linear Classification. J Mach Learn Res 9: pp. 1871-1874
- Fechner, N, Jahn, A, Hinselmann, G, Zell, A (2010) Estimation of the applicability domain of kernel-based machine learning models for virtual screening. J Cheminf 2: pp. 2 CrossRef
- Hinselmann, G, Fechner, N, Jahn, A, Eckert, M, Zell, A (2010) Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 74: pp. 219-229 CrossRef
- Hinselmann, G, Jahn, A, Fechner, N, Zell, A (2009) Chronic Rat Toxicity Prediction of Chemical Compounds Using Kernel Machines. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 7th European Conference (EvoBio 2009). Springer, Tübingen, Germany, pp. 25-36 CrossRef
- Jahn, A, Hinselmann, G, Fechner, N, Zell, A (2009) Optimal Assignment Methods for Ligand-Based Virtual Screening. J Cheminf 1: pp. 14 CrossRef
- Jahn, A, Hinselmann, G, Fechner, N, Henneges, C, Zell, A (2010) Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches. Molecular Informatics 29: pp. 441-455 CrossRef
- Borgwardt, KM, Ong, CS, Schönauer, S, Vishwanathan, SVN, Smola, AJ, Kriegel, HP (2005) Protein function prediction via graph kernels. Bioinformatics 21: pp. 47-56 CrossRef
- Schneider, G, Neidhart, W, Giller, T, Schmid, G (1999) Scaffold-Hopping by Topological Pharmacophore Search: A Contribution to Virtual Screening. Angew Chem., Int Ed 38: pp. 2894-2896 CrossRef
- Gregori-Puigjané, E, Mestres, J (2006) SHED: Shannon Entropy Descriptors from Topological Feature Distributions. J Chem Inf Model 46: pp. 1615-1622 CrossRef
- Bender, A, Mussa, HY, Glen, RC (2005) Screening for Dihydrofolate Reductase Inhibitors Using MOLPRINT 2D, a Fast Fragment-Based Method Employing the Naive Bayesian Classifier: Limitations of the Descriptor and the Importance of Balanced Chemistry in Training and Test Sets. J Biomol Screen 10: pp. 658-666 CrossRef
- Steinbeck, C, Hoppe, C, Kuhn, S, Floris, M, Guha, R, Willighagen, E (2006) Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12: pp. 2111-2120 CrossRef
- Swamidass, SJ, Chen, J, Bruand, J, Phung, P, Ralaivola, L, Baldi, P (2005) Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 21: pp. 359-368 CrossRef
- Nasr, R, Swamidass, SJ, Baldi, P (2009) Large scale study of multiple-molecule queries. J Cheminf 1: pp. 7 CrossRef
- Chen, J, Swamidass, SJ, Dou, Y, Baldi, P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: pp. 4133-4139 CrossRef
- Gasteiger, J, Rudolph, C, Sadowski, J (1992) Automatic Generation of 3D-Atomic Coordinates for Organic Molecules. Tetrahedron Comput Methodol 3: pp. 537-547 CrossRef
- Schrödinger, LLC (2008) Schrödinger MacroModel 9.6. Schrödinger, LLC, New York, NY
- Bouckaert, RR, Frank, E Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. In: Dai, H, Srikant, R, Zhang, C eds. (2004) Advances in Knowledge Discovery and Data Mining - Proceedings of 8th Pacific-Asia Conference, PAKDD 2004. pp. 3-12
- Fechner, N, Jahn, A, Hinselmann, G, Zell, A (2009) Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR. J Chem Inf Model 49: pp. 549-560 CrossRef
- Talete srl, Milano, Italy: [http://www.talete.mi.it/] dragonX 1.4 for Linux (Molecular Descriptor Calculation Software).
- jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
Journal of Cheminformatics
- Online Date
- January 2011
- Online ISSN
- Chemistry Central
- Additional Links