Skip to main content

Ontology of core data mining entities

Abstract

In this article, we present OntoDM-core, an ontology of core data mining entities. OntoDM-core defines the most essential data mining entities in a three-layered ontological structure comprising of a specification, an implementation and an application layer. It provides a representational framework for the description of mining structured data, and in addition provides taxonomies of datasets, data mining tasks, generalizations, data mining algorithms and constraints, based on the type of data. OntoDM-core is designed to support a wide range of applications/use cases, such as semantic annotation of data mining algorithms, datasets and results; annotation of QSAR studies in the context of drug discovery investigations; and disambiguation of terms in text mining. The ontology has been thoroughly assessed following the practices in ontology engineering, is fully interoperable with many domain resources and is easy to extend. OntoDM-core is available at http://www.ontodm.com.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. OBO: http://www.obofoundry.org (accessed 1 June 2014).

  2. GO: http://www.geneontology.org (accessed 1 June 2014).

  3. NCI Thesaurus: http://ncit.nci.nih.gov (accessed 1 June 2014).

  4. FMA: http://sig.biostr.washington.edu/projects/fm(accessed 1 June 2014).

  5. SNOMED-CT: http://www.ihtsdo.org/snomed-ct (accessed 1 June 2014).

  6. OBI: http://www.obi-ontology.org (accessed 1 June 2014).

  7. BioPortal: http://bioportal.bioontology.org (accessed 1 June 2014).

  8. BFO: http://www.ifomis.org/bfo (accessed 1 June 2014).

  9. RO: http://obofoundry.org/ro (accessed 1 June 2014).

  10. PMML: http://www.dmg.org/ (accessed 1 June 2014).

  11. DOLCE: http://www.loa.istc.cnr.it/old/DOLCE.html (accessed 1 June 2014).

  12. IAO: http://code.google.com/p/information-artifact-ontology (accessed 1 June 2014).

  13. OWL-DL: http://www.w3.org/TR/owl-guide (accessed 1 June 2014).

  14. Protégé: http://protege.stanford.edu (accessed 1 June 2014).

  15. OBO Foundry principles: http://obofoundry.org/crit.shtml (accessed 1 June 2014).

  16. In Table 7 from the Appendix, we list all relations used in OntoDM-core.

  17. OntoFox: http://ontofox.hegroup.org (accessed 1 June 2014).

  18. In the remainder of the article, italic formatting denotes ontology class.

  19. Table 8 in the Appendix lists the typical competency questions OntoDM-core is designed to answer.

  20. Oxford dictionary: http://oxforddictionaries.com/definition/scenario (accessed 1 June 2014).

  21. Clus: http://sourceforge.net/projects/clus/ (accessed 1 June 2014).

  22. Clus OntoDM-core instances: http://ontodm.com/lib/exe/fetch.php?media=clus_instances.owl (accessed 1 June 2014).

  23. Hermit reasoner: http://www.hermit-reasoner.com/ (accessed 1 June 2014).

  24. OntoDM-core inferred segment: http://ontodm.com/lib/exe/fetch.php?media=clus_inferred.owl (accessed 1 June 2014).

  25. BioPortal SPARQL endpoint: http://sparql.bioontology.org (accessed 1 June 2014).

  26. OWL2Query: http://krizik.felk.cvut.cz/km/owl2query/index.html (accessed 1 June 2014).

  27. SPARQLer: http://www.sparql.org/sparql.html (accessed 1 June 2014).

  28. Robot Scientist Project: http://goo.gl/6wazqw and http://goo.gl/Iq6WGS (accessed 1 June 2014).

  29. BAO: http://bioassayontology.org (accessed 1 June 2014).

  30. QSAR Chemoinformatics repository: http://cheminformatics.org/datasets/#qsar (accessed 1 June 2014).

  31. Example MDL Molfile http://mychem.sourceforge.net/doc/apes06.html (accessed 1 June 2014).

  32. CHEMBLE repository: https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL1135798 (accessed 1 June 2014).

  33. A pair of molecules consisting of one chiral molecule and the mirror image of this molecule. The molecules making up an enantiomeric pair rotate the plane of polarized light in equal, but opposite, directions.

  34. ART: http://www.aber.ac.uk/en/cs/research/cb/projects/art/art-corpus (accessed 1 June 2014).

  35. NACTEM centre: http://www.nactem.ac.uk/cheta (accessed 1 June 2014).

  36. ChEBI: http://www.ebi.ac.uk/chebi (accessed 1 June 2014).

  37. FIX: http://obofoundry.org/cgi-bin/detail.cgi?fix (accessed 1 June 2014).

  38. REX: http://www.obofoundry.org/cgi-bin/detail.cgi?id=rex (accessed 1 June 2014).

  39. OGMS: www.obofoundry.org/cgi-bin/detail.cgi?id=OGMS (accessed 1 June 2014).

  40. CEO: http://goo.gl/AUktCK (accessed 1 June 2014).

  41. WSMO: http://www.wsmo.org (accessed 1 June 2014).

  42. DMO Jamboree 2010: http://kt.ijs.si/janez_kranjc/dmo_jamboree/ (accessed 1 June 2014).

References

  • Avery MA, Alvim-Gaston M, Woolfrey JR (1999) Sythesis and structure-activity relationships of peroxidic antimalarials based on artemisinin. Adv Med Chem 4:125–217. doi:10.1016/S1067-5698(99)80005-4

    Article  Google Scholar 

  • Avery MA, Alvim-Gaston M, Rodrigues CR, Barreiro EJ, Cohen FE, Sabnis YA, Woolfrey JR (2002) Structure activity relationships of the antimalarial agent artemisinin: the development of predictive in vitro potency models using CoMFA and HQSAR methodologies. J Med Chem 45:292–303. doi:10.1021/jm0100234

    Article  Google Scholar 

  • Bakir GH, Hofmann T, Schölkopf B, Smola AJ, Taskar B, Vishwanathan SVN (2007) Predicting structured data. Neural information processing. The MIT Press, Cambridge, MA

    Google Scholar 

  • Bayardo RJ (2002) The many roles of constraints in data mining: letter from the guest editor (special issue on constraints in data mining). SIGKDD Explorations 4(1):i–ii

  • Bernstein A, Provost F, Hill S (2005) Toward intelligent assistance for a data mining process: an ontology-based approach for cost-sensitive classification. IEEE Trans Knowl Data Eng 17(4):503–518. doi:10.1109/TKDE.2005.67

    Article  Google Scholar 

  • Blockeel H, DeRaedt L, Ramon J (1998) Top-down induction of clustering trees. In: Proceedings of the 15th international conference on machine learning, Morgan Kaufmann, pp 55–63

  • Brezany P, Janciak I, Tjoa AM (2007) Ontology-based construction of grid data mining workflows. In: Data mining with ontologies: implementations, findings and frameworks, IGI Global, pp 182–210. doi: 10.4018/978-1-59904-618-1.ch010

  • Brinkman RR et al (2010) Modeling of biomedical experimental processes with OBI. J Biomed Semant 1(Suppl 1):S7. doi:10.1186/2041-1480-1-S1-S7

    Article  Google Scholar 

  • Button K, Deursen RW, Soldatova L, Spasić I (2013) TRAK ontology: defining standard care for the rehabilitation of knee conditions. J Biomed Inf 46(4):615–625. doi:10.1016/j.jbi.2013.04.009

    Article  Google Scholar 

  • Cannataro M, Comito C (2003) A data mining ontology for GRID programming. In: Proceedings of 1st international workshop on semantics in peer-to-peer and grid computing, pp 113–134

  • Caruana R (1997) Multitask learning. Mach Learn 28:41–75. doi:10.1023/A:1007379606734

    Article  Google Scholar 

  • Chapman P, et al. (1999) The CRISP-DM process model. Discussion paper. http://www.crisp-dm.org

  • Courtot M et al (2011) MIREOT: the minimum information to reference an external ontology term. Appl Ontol 6(1):23–33. doi:10.3233/AO-2011-0087

    Google Scholar 

  • Demšar D et al (2006) Using multi-objective classification to model communities of soil. Ecol Model 191(1):131–143. doi:10.1016/j.ecolmodel.2005.08.017

    Article  Google Scholar 

  • Diamantini C, Potena D (2008) Semantic annotation and services for KDD tools sharing and reuse. In: ICDMW ’08: proceedings of the 2008 IEEE ICDM workshops, IEEE computer society, pp 761–770. doi:10.1109/ICDMW.2008.43

  • Dietterich T et al (2008) Structured machine learning: the next ten years. Mach Learn 73:3–23. doi:10.1007/s10994-008-5079-1

    Article  Google Scholar 

  • Džeroski S (2007) Towards a general framework for data mining. In: KDID 2006—revised selected and invited papers, LNCS, vol 4747, Springer, pp 259–300. doi:10.1007/978-3-540-75549-4_16

  • Ford M, Philips L, Ste A (2004) Optimising the EVA descriptor for prediction of biological activity. Organ Biomol Chem 2:3301–3311. doi:10.1039/B410053K

    Article  Google Scholar 

  • Fox MS, Grüninger M (1994) Ontologies for enterprise integration. In: CoopIS, pp 82–89

  • Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L (2002) Sweetening ontologies with DOLCE. In: Proceedings of 13th international conference on knowledge engineering and knowledge management. Ontologies and the semantic web, pp 166–181. doi:10.1007/3-540-45810-7_18

  • Garcia J, Garcia-Penalvo FJ, Theron R (2010) A survey on ontology metrics. In: Communications in computer and information science, vol 111, Springer, Berlin, pp 22–27. doi:10.1007/978-3-642-16318-0_4

  • Golbraikh A, Tropsha A (2002) Beware of \(q^2\)!. J Mol Gr Mod 20:269–276. doi: 10.1016/S1093-3263(01)00123-1

    Article  Google Scholar 

  • Grenon P, Smith B, Goldberg L (2004) Biodynamic ontology: applying BFO in the biomedical domain. In: Pisanelli D, (ed) Ontologies in medicine, vol 102. IOS, Amsterdam, pp 20–38. doi:10.3233/978-1-60750-945-5-20

  • Gruber T (2009) Ontology. In: Ling L, Tamer Özsu M (eds) The encyclopedia of database systems. Springer, pp 1963–1965. doi:10.1007/978-0-387-39940-9_1318

  • Grüninger M, Fox M (1995) Methodology for the design and evaluation of ontologies. In: IJCAI’95, workshop on basic ontological issues in knowledge sharing

  • Guha R, Jurs PC (2004) Development of QSAR models to predict and interpret the biological activity of artemisinin analogues. J Chem Inf Comput Sci 44:1440–1449. doi:10.1021/ci0499469

    Google Scholar 

  • Guha R, Howard MT, Hutchison GR, Murray-Rust P, Rzepa H, Steinbeck C, Wegner J, Willighagen EL (2006) The blue obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3):991–998. doi:10.1021/ci050400b

    Article  Google Scholar 

  • Hand DJ, Smyth P, Mannila H (2001) Principles of data mining. MIT Press, Cambridge, MA

    Google Scholar 

  • Hilario M, Nguyen P, Do H, Woznica A, Kalousis A (2011) Ontology-based meta-mining of knowledge discovery workflows. In: Meta-learning in computational intelligence, studies in computational intelligence, vol 358, Springer, Berlin, pp 273–315. doi:10.1007/978-3-642-20980-2_9

  • ISO (2007) ISO/IEC 11404:2007—Information Technology—General-Purpose datatypes (GPD). Tech. rep, International Organization for Standardization

  • Karalic A, Bratko I (1997) First order regression. Mach Learn 26:147–176. doi:10.1023/A:1007365207130

    Article  MATH  Google Scholar 

  • Keet CM, Lawrynowicz A, d’Amato C, Hilario M (2013) Modeling issues and choices in the data mining optimisation ontology. In: 8th workshop on OWL: experiences and directions (OWLED-13), 26–27 May 2013, Montpellier

  • Kietz JU, F Serban AB, Fischer S (2010) Data mining workflow templates for intelligent discovery assistance and Auto-Experimentation. In: ECML/PKDD 2010 workshop on third generation data mining: towards service-oriented knowledge discovery (SoKD-10), pp 1–12

  • King RD, Muggleton SH, Srinivasan A, Sternberg MJ (1996) Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc Natl Acad Sci 93(1):438–442. doi:10.1073/pnas.93.1.438

    Article  Google Scholar 

  • King RD et al (2009) The automation of science. Science 324(5923):85–89. doi:10.1126/science.1165620

    Article  Google Scholar 

  • Kocev D, Džeroski S, White M, Newell G, Griffioen P (2009) Using single and multi-target regression trees and ensembles to model a compound index of vegetation condition. Ecol Model 220(8):1159–1168. doi:10.1016/j.ecolmodel.2009.01.037

    Article  Google Scholar 

  • Kocev D, Vens C, Struyf J, Džeroski S (2013) Tree ensembles for predicting structured outputs. Pattern Recognit 46(3):817–833. doi:10.1016/j.patcog.2012.09.023

    Article  Google Scholar 

  • Kremen P, Sirin E (2008) SPARQL-DL implementation experience. In: Proceedings of the fourth OWLED workshop on OWL: experiences and directions volume 496 of CEUR workshop proceedings

  • Kriegel HP et al (2007) Future trends in data mining. Data Min Knowl Discov 15:87–97. doi:10.1007/s10618-007-0067-9

    Article  MathSciNet  Google Scholar 

  • López MF, Gómez-Pérez A, Sierra JP, Sierra AP (1999) Building a chemical ontology using methontology and the ontology design environment. IEEE Intell Syst 14:37–46. doi:10.1109/5254.747904

    Article  Google Scholar 

  • Madjarov G, Kocev D, Gjorghevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recognit 45(9):3084–3104. doi:10.1016/j.patcog.2012.03.004

    Article  Google Scholar 

  • Malone J, Parkinson H (2010) Reference and spplication ontologies. Ontogenesis. http://ontogenesis.knowledgeblog.org/295

  • Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Min Knowl Discov 1(3):241–258. doi:10.1023/A:1009796218281

    Article  Google Scholar 

  • Mizoguchi R (2010) Yamato: yet another more advanced top-level ontology. http://www.ei.sanken.osaka-u.ac.jp/hozo/onto_library/YAMATO101216

  • Panov P (2012) A modular ontology of data mining. PhD thesis, Jožef Stefan Iternational Postgraduate School, Ljubljana, Slovenia

  • Panov P, Džeroski S, Soldatova LN (2008) OntoDM: an ontology of data mining. In: ICDMW ’08: proceedings of the 2008 IEEE ICDM workshops. IEEE Computer Society, pp 752–760

  • Panov P, Soldatova L, Džeroski S (2010) Representing entities in the OntoDM data mining ontology. In: Inductive databases and constraint-based data mining, Springer, New York, pp 27–58. doi:10.1007/978-1-4419-7738-0_2

  • Panov P, Soldatova L, Džeroski S (2013) OntoDM-KDD: ontology for representing the knowledge discovery process. In: DS 2013, LNAI 8140, Springer, Berlin, pp 126–140. doi:10.1007/978-3-642-40897-7_9

  • Podpečan V, Zemenova M, Lavrač N (2012) Orange4WS environment for service-oriented data mining. Comput J 55(1):82–98. doi:10.1093/comjnl/bxr077

    Article  Google Scholar 

  • Qi D, King RD, Hopkins AL, Bickerton GRJ, Soldatova LN (2010) An ontology for description of drug discovery investigations. J Integr Bioinf 7(3):126. doi:10.2390/biecoll-jib-2010-126

    Google Scholar 

  • Robinson P, Bauer S (2011) Introduction to bio-ontologies. Chapman & Hall, London

    Google Scholar 

  • Serban F, Vanschoren J, Kietz J, Bernstein A (2013) A survey of intelligent assistants for data analysis. ACM Comput Surv 45(3):31.1–31.35. doi:10.1145/2480741.2480748

    Article  Google Scholar 

  • Silla C, Freitas A (2011) A survey of hierarchical classification across different application domains. Data Min Know Discov 22:31–72. doi:10.1007/s10618-010-0175-9

    Article  MATH  MathSciNet  Google Scholar 

  • Sirin E, Parsia B (2007) SPARQL-DL: SPARQL query for OWL-DL. In: 3rd OWL experiences and directions workshop (OWLED-2007)

  • Slavkov I, Gjorgjioski V, Struyf J, Džeroski S (2010) Finding explained groups of time-course gene expression profiles with predictive clustering trees. Mol BioSyst 6:729–740. doi:10.1039/b913690h

    Article  Google Scholar 

  • Smith B et al (2005) Relations in biomedical ontologies. Genome Biol 6(5):R46. doi:10.1186/gb-2005-6-5-r46

    Article  Google Scholar 

  • Smith B et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech 25(11):1251–1255. doi:10.1038/nbt1346

    Article  Google Scholar 

  • Smith B, Ceusters W (2010) Ontological realism: a methodology for coordinated evolution of scientific ontologies. Appl Ontol 5(3–4):139–188. doi:10.3233/AO-2010-0079

    Google Scholar 

  • Soldatova LN, Lord Ph, Sansone SA, Stephens SM, Shah NH (2010) Selected papers from the 12th annual bio-ontologies meeting. J Biomed Semant 1(Suppl 1):I1

    Article  Google Scholar 

  • Spjuth O, Willighagen EL, Guha R, Eklund M, Wikberg JES (2010) Towards interoperable and reproducible QSAR analyses: exchange of data sets. J Cheminf 2:5. doi:10.1186/1758-2946-2-5

    Article  Google Scholar 

  • Struyf J, Dzeroski S (2005) Constraint based induction of multi-objective regression trees. In: KDID 2005. Lecture notes in computer science, vol 3933, Springer, pp 222–233. doi:10.1007/11733492_13

  • Suarez-Figueroa M C, Gomez-Perez A, Motta E, Gangemi A (2012) The NeOn methodology for ontology engineering. In: Ontology engineering in a networked world, pp 9–34. doi:10.1007/978-3-642-24794-1_2

  • Sure Y, Staab S, Struder R (2009) Ontology engineering methodology. In: Staab S, Struder R (eds) Handbook on ontologies, 2nd edn. International Handbooks on Information Systems. Springer, Berlin, Heidelberg, pp 135–152. doi:10.1007/978-3-540-92673-3_6

  • Tropsha A (2010) Best practices for developing predictive QSAR models. Oral presentation. http://infochim.u-strasbg.fr/CS3_2010/OralPDF/Tropsha_CS3_2010

  • Tsoumakas G, Katakis I (2007) Multi label classification: an overview. Int J Data Wareh Min 3(3):1–13. doi:10.4018/978-1-60566-058-5.ch021

    Article  Google Scholar 

  • Uschold M, King M (1995) Towards a methodology for building ontologies. In: Workshop on basic ontological issues in knowledge sharing, held in conjunction with IJCAI-95

  • Vanschoren J, Blockeel H, Pfahringer B, Holmes G (2012) Experiment databases—-a new way to share, organize and learn from experiments. Mach Learn 87(2):127–158. doi:10.1007/s10994-011-5277-0

    Article  MATH  MathSciNet  Google Scholar 

  • Vanschoren J, Soldatova L (2010) Exposé: an ontology for machine learning experimentation. Presentation at the Data Mining Jamboree, Ljubljana 2010. http://kt.ijs.si/janez_kranjc/dmo_jamboree/Expose

  • Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 73(2):185–214. doi:10.1007/s10994-008-5077-3

    Article  Google Scholar 

  • Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(4):597–604. doi:10.1142/S0219622006002258

    Article  Google Scholar 

  • Young D, Martin T, Venkatapathy R, Harten P (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27(11–12):1337–1345. doi:10.1002/qsar.200810084

    Article  Google Scholar 

  • Žáková M, Kremen P, Železný F, Lavrač N (2010) Automating knowledge discovery workflow composition through ontology-based planning. IEEE Trans Autom Sci Eng 8(2):253–264. doi:10.1109/TASE.2010.2070838

    Google Scholar 

  • Ženko B, Džeroski S (2008) Learning classification rules for multiple target attributes. In: PAKDD. Lecture notes in computer science, vol 5012. Springer, pp 454–465. doi:10.1007/978-3-540-68125-0_40

Download references

Acknowledgments

We would like to acknowledge the support of the European Commission through the project MAESTRA—Learning from Massive, Incompletely annotated, and Structured Data (Grant Number ICT-2013-612944).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panče Panov.

Additional information

Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Železný

Appendix

Appendix

Table 7 Relations used in the OntoDM-core ontology
Table 8 Examples of OntoDM-core competency questions
Table 9 Scope and structure assessment
Table 10 Naming and vocabulary assessment
Table 11 Documentation and collaboration assessment
Table 12 Availability, maintenance and use assessment
Table 13 Formalization of the OntoDM-core competency questions using the SPARQL-DL language

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Panov, P., Soldatova, L. & Džeroski, S. Ontology of core data mining entities. Data Min Knowl Disc 28, 1222–1265 (2014). https://doi.org/10.1007/s10618-014-0363-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0363-0

Keywords

  • Ontology of data mining
  • Mining structured data
  • Domain ontology