jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
The decomposition of a chemical graph is a convenient approach to encode information of the corresponding organic compound. While several commercial toolkits exist to encode molecules as so-called fingerprints, only a few open source implementations are available. The aim of this work is to introduce a library for exactly defined molecular decompositions, with a strong focus on the application of these features in machine learning and data mining. It provides several options such as search depth, distance cut-offs, atom- and pharmacophore typing. Furthermore, it provides the functionality to combine, to compare, or to export the fingerprints into several formats.
We provide a Java 1.6 library for the decomposition of chemical graphs based on the open source Chemistry Development Kit toolkit. We reimplemented popular fingerprinting algorithms such as depth-first search fingerprints, extended connectivity fingerprints, autocorrelation fingerprints (e.g. CATS2D), radial fingerprints (e.g. Molprint2D), geometrical Molprint, atom pairs, and pharmacophore fingerprints. We also implemented custom fingerprints such as the all-shortest path fingerprint that only includes the subset of shortest paths from the full set of paths of the depth-first search fingerprint. As an application of jCompoundMapper, we provide a command-line executable binary. We measured the conversion speed and number of features for each encoding and described the composition of the features in detail. The quality of the encodings was tested using the default parametrizations in combination with a support vector machine on the Sutherland QSAR data sets. Additionally, we benchmarked the fingerprint encodings on the large-scale Ames toxicity benchmark using a large-scale linear support vector machine. The results were promising and could often compete with literature results. On the large Ames benchmark, for example, we obtained an AUC ROC performance of 0.87 with a reimplementation of the extended connectivity fingerprint. This result is comparable to the performance achieved by a non-linear support vector machine using state-of-the-art descriptors. On the Sutherland QSAR data set, the best fingerprint encodings showed a comparable or better performance on 5 of the 8 benchmarks when compared against the results of the best descriptors published in the paper of Sutherland et al.
jCompoundMapper is a library for chemical graph fingerprints with several tweaking possibilities and exporting options for open source data mining toolkits. The quality of the data mining results, the conversion speed, the LPGL software license, the command-line interface, and the exporters should be useful for many applications in cheminformatics like benchmarks against literature methods, comparison of data mining algorithms, similarity searching, and similarity-based data mining.
- Brown N: Chemoinformatics - An Introduction for Computer Scientists. ACM Comput Surv 2009, 41:8:1–8:38. CrossRef
- Willett P, Barnard JM, Downs GM: Chemical Similarity Searching. J Chem Inf Comput Sci 1998,38(6):983–996.
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 2003,43(2):493–500.
- Bender A, Mussa HY, Glen RC, Reiling S: Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance. J Chem Inf Comput Sci 2004,44(5):1708–1718.
- Rogers D, Hahn M: Extended-connectivity fingerprints. J Chem Inf Model 2010,50(5):742–754. CrossRef
- Ralaivola L, Swamidass SJ, Saigo H, Baldi P: Graph kernels for chemical informatics. Neural Networks 2005,18(8):1093–1110. CrossRef
- Renner S, Fechner U, Schneider G: Alignment-free Pharmacophore Patterns - A Correlation Vector Approach. In Pharmacophores and Pharmacophore Searches, Pharmacophores and Pharmacophore Searches. Edited by: Langer T, Hoffmann R. Weinheim: Wiley-VCH; 2006:49–79.
- Carhart RE, Smith DH, Venkataraghavan R: Atom Pairs as Features in Structure-Activity Studies: Definition and Applications. J Chem Inf Comput Sci 1985, 25:64–73.
- Mahé P, Ralaivola L, Stoven V, Vert JP: The Pharmacophore Kernel for Virtual Screening with Support Vector Machines. J Chem Inf Model 2006,46(5):2003–2014. CrossRef
- Bender A, Mussa HY, Gill GS, Glen RC: Molecular Surface Point Environments for Virtual Screening and the Elucidation of Binding Patterns (MOLPRINT 3D). J Med Chem 2004,47(26):6569–6583. CrossRef
- Brown N, McKay B, Gasteiger J: Fingal: A Novel Approach to Geometric Fingerprinting and a Comparative Study of Its Application to 3D-QSAR Modelling. QSAR Comb Sci 2005, 24:480–484. CrossRef
- Chang CC, Lin CJ: [http://www.csie.ntu.edu.tw/~cjlin/libsvm] LIBSVM: A Library for Support Vector Machines. 2001.
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA Data Mining Software: An Update. SIGKDD Explorations 2009, 11:10–18. CrossRef
- Sutherland JJ, O'Brien LA, Weaver DF: A Comparison of Methods for Modeling Quantitative Structure-Activity Relationships. J Med Chem 2004,47(22):5541–5554. CrossRef
- Hansen K, Mika S, Schroeter T, Sutter A, ter Laak A, Steger-Hartmann T, Heinrich N, Müller KR: Benchmark Data Set for in Silico Prediction of Ames Mutagenicity. J Chem Inf Model 2009,49(9):2077–2081. CrossRef
- Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ: LIBLINEAR: A Library for Large Linear Classification. J Mach Learn Res 2008, 9:1871–1874.
- Fechner N, Jahn A, Hinselmann G, Zell A: Estimation of the applicability domain of kernel-based machine learning models for virtual screening. J Cheminf 2010, 2:2. CrossRef
- Hinselmann G, Fechner N, Jahn A, Eckert M, Zell A: Graph kernels for chemical compounds using topological and three-dimensional local atom pair environments. Neurocomputing 2010, 74:219–229. CrossRef
- Hinselmann G, Jahn A, Fechner N, Zell A: Chronic Rat Toxicity Prediction of Chemical Compounds Using Kernel Machines. In Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 7th European Conference (EvoBio 2009). Volume 5483. Tübingen, Germany: Springer; 2009:25–36. CrossRef
- Jahn A, Hinselmann G, Fechner N, Zell A: Optimal Assignment Methods for Ligand-Based Virtual Screening. J Cheminf 2009, 1:14. CrossRef
- Jahn A, Hinselmann G, Fechner N, Henneges C, Zell A: Probabilistic Modeling of Conformational Space for 3D Machine Learning Approaches. Molecular Informatics 2010,29(5):441–455. CrossRef
- Borgwardt KM, Ong CS, Schönauer S, Vishwanathan SVN, Smola AJ, Kriegel HP: Protein function prediction via graph kernels. Bioinformatics 2005, 21:47–56. CrossRef
- Schneider G, Neidhart W, Giller T, Schmid G: Scaffold-Hopping by Topological Pharmacophore Search: A Contribution to Virtual Screening. Angew Chem., Int Ed 1999,38(19):2894–2896. CrossRef
- Gregori-Puigjané E, Mestres J: SHED: Shannon Entropy Descriptors from Topological Feature Distributions. J Chem Inf Model 2006,46(4):1615–1622. CrossRef
- Bender A, Mussa HY, Glen RC: Screening for Dihydrofolate Reductase Inhibitors Using MOLPRINT 2D, a Fast Fragment-Based Method Employing the Naive Bayesian Classifier: Limitations of the Descriptor and the Importance of Balanced Chemistry in Training and Test Sets. J Biomol Screen 2005,10(7):658–666. CrossRef
- Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen E: Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. Curr Pharm Des 2006,12(17):2111–2120. CrossRef
- Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 2005, 21:359–368. CrossRef
- Nasr R, Swamidass SJ, Baldi P: Large scale study of multiple-molecule queries. J Cheminf 2009, 1:7. CrossRef
- Chen J, Swamidass SJ, Dou Y, Baldi P: ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 2005, 21:4133–4139. CrossRef
- Gasteiger J, Rudolph C, Sadowski J: Automatic Generation of 3D-Atomic Coordinates for Organic Molecules. Tetrahedron Comput Methodol 1992, 3:537–547. CrossRef
- Schrödinger LLC: Schrödinger MacroModel 9.6. Schrödinger, LLC, New York, NY; 2008.
- Bouckaert RR, Frank E: Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms. In Advances in Knowledge Discovery and Data Mining - Proceedings of 8th Pacific-Asia Conference, PAKDD 2004. Volume 3056. Edited by: Dai H, Srikant R, Zhang C. Springer; 2004:3–12.
- Fechner N, Jahn A, Hinselmann G, Zell A: Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR. J Chem Inf Model 2009,49(3):549–560. CrossRef
- Talete srl, Milano, Italy: [http://www.talete.mi.it/] dragonX 1.4 for Linux (Molecular Descriptor Calculation Software).
- jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints
- Open Access
- Available under Open Access This content is freely available online to anyone, anywhere at any time.
Journal of Cheminformatics
- Online Date
- January 2011
- Online ISSN
- Chemistry Central
- Additional Links