OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data

  • David Perez-Rey
  • Alberto Anguita
  • Jose Crespo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4345)


Within the knowledge discovery in databases (KDD) process, previous phases to data mining consume most of the time spent analysing data. Few research efforts have been carried out in theses steps compared to data mining, suggesting that new approaches and tools are needed to support the preparation of data. As regards, we present in this paper a new methodology of ontology-based KDD adopting a federated approach to database integration and retrieval. Within this model, an ontology-based system called OntoDataClean has been developed dealing with instance-level integration and data preprocessing. Within the OntoDataClean development, a preprocessing ontology was built to store the information about the required transformations. Various biomedical experiments were carried out, showing that data have been correctly transformed using the preprocessing ontology. Although OntoDataClean does not cover every possible data transformation, it suggests that ontologies are a suitable mechanism to improve quality in the various steps of KDD processes.


Knowledge Discovery in Databases Preprocessing Data Cleaning Database Integration Ontologies 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Rahm, E., Hai Do, H.: Data cleaning: problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4), 3–13 (2001)Google Scholar
  2. 2.
    Dasu, T., Jonson, T.: Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Chichester (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Weiss, S.M., Indurkhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann, San Francisco (1998)zbMATHGoogle Scholar
  4. 4.
    Gurwitz, D., Lunshof, J.E., Altman, R.B.: A call for the creation of personalized medicine database. Nature Reviews, Drug Discovery 5, 23–26 (2006)CrossRefGoogle Scholar
  5. 5.
    Fayyad, U., Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in databases. AI Magazine 17, 37–54 (1996)Google Scholar
  6. 6.
    Sujansky, W.: Heterogeneous Database Integration in Biomedicine. Journal of Biomedical Informatics 34(4), 285–298 (2001)CrossRefGoogle Scholar
  7. 7.
    Maojo, V., García-Remesal, M., Billhardt, H., Alonso-Calvo, R., Pérez-Rey, D., Martín-Sánchez, F.: Designing New Methodologies for Integrating Biomedical Information in Clinical Trials. Methods Inf Med 45(2), 180–185 (2006)Google Scholar
  8. 8.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E.: AJAX: An Extensible Data Cleaning Tool. In: SIGMOD 2000 Conf. Management of Data, Dallas, p. 590 (2000)Google Scholar
  9. 9.
    Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB 2001, 27th International Conference on Very Large Databases, Rome, pp. 381–390 (2001)Google Scholar
  10. 10.
    Gruber, T.R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2), 199–220 (1993)CrossRefGoogle Scholar
  11. 11.
    Silvescu, A., Reinoso-Castillo, J., Honavar, V.: Ontology-Driven information extraction and knowledge acquisition from heterogeneous, distributed, autonomous data sources. In: Proceedings of the IJCAI (2001)Google Scholar
  12. 12.
    Cespivova, H., Rauch, J., Svatek, V., Kejkula, M., Tomeckova, M.: Roles of Medical Ontology in Association Mining CRISP-DM Cycle. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies (KDO 2004), Pisa (2004)Google Scholar
  13. 13.
    Pérez-Rey, D., Maojo, V., Garcia-Remesal, M., Alonso-Calvo, R., Billhardt, H., Martin-Sanchez, F., Sousa, A.: ONTOFUSION: Ontology-Based Integration of Genomic and Clinical Databases. Computers in Biology and Medicine 36, 712–730 (2006)CrossRefGoogle Scholar
  14. 14.
    Bizer, C.: D2R MAP - A Database to RDF Mapping Language. In: Proceedings of the International World Wide Web Conference (WWW 2003), Budapest, Hungary (2003)Google Scholar
  15. 15.
    Köhler, J., Philippi, S., Lange, M.: SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19(18), 2420–2427 (2003)CrossRefGoogle Scholar
  16. 16. (last accessed September 1, 2006)
  17. 17.
    Phillips, J., Buchanan, B.G.: Ontology-guided knowledge discovery in databases. In: International Conf. Knowledge Capture Victoria, Canada (2001) Google Scholar
  18. 18.
    Kedad, Z., Métais, E.: Ontology-based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, Springer, Heidelberg (2002)CrossRefGoogle Scholar
  19. 19.
    Wang, X., Hamilton, H.J., Bither, Y.: An Ontology-Based Approach to Data Cleaning. Technical report. University of Regina. Canada (2005)Google Scholar
  20. 20.
    Cannataro, M., Hiram Guzzi, P., Mazza, T., Tradigo, G., Veltri, P.: Using Ontologies in PROTEUS for Modeling Proteomics Data Mining Applications. Studies in Health Technology and Informatics 112, 17–26 (2005)Google Scholar
  21. 21.
    Bernstein, A., Provost, F., Hill, S.: Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification. IEEE Transactions on Knowledge and Data Engineering 17(4), 503–518 (2005)CrossRefGoogle Scholar
  22. 22.
    Gottgtroy, P., Kasabov, N., MacDonell, S.: An ontology driven approach for knowledge discovery in Biomedicine. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, Springer, Heidelberg (2004)Google Scholar
  23. 23.
    Svatek, V., Rauch, J., Flek, M.: Ontology-Based Explanation of Discovered Associations in the Domain of Social Reality. In: ECML/PKDD05 Workshop on Knowledge Discovery and Ontologies, Porto (2005)Google Scholar
  24. 24.
    Euler, T., Scholz, M.: Using Ontologies in a KDD Workbench. In: Workshop on Knowledge Discovery and Ontologies at ECML/PKDD (2004)Google Scholar
  25. 25.
    McGuinness, D., van Harmelen, F. (eds.): OWL Web Ontology Language Overview (2003), (last accessed September 1, 2006)
  26. 26.
    Knublauch, H., Fergerson, R.W., Noy, N., Musen, M.A.: The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications. In: Third International Semantic Web Conference (2004)Google Scholar
  27. 27.
    Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., Hendler, J.: Swoop: A web ontology editing browser. Journal of Web Semantics 4(2) (2005)Google Scholar
  28. 28.
    Volz, R., Oberle, D., Motik, B., Staab, S.: KAON server - a semantic web management system. In: Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Alternate Tracks - Practice and Experience, Budapest, Hungary (2003)Google Scholar
  29. 29.
  30. 30. (last accessed September 1, 2006)
  31. 31.
  32. 32.
    Sanandrés-Ledesma, J.A., Maojo, V., Crespo, J., García-Remesal, M., Gómez de la Cámara, A.: A Performance Comparative Analysis Between Rule Induction-Algorithms and Clustering-Based Constructive Induction Algorithms. In: Application to Rheumatoid Arthritis. ISMBDA (2004)Google Scholar
  33. 33.
    Martín-Sanchez, F., Maojo, V., López-Campos, G.: Integrating genomics into health information systems. Methods Inf. Med. 41, 25–30 (2002)Google Scholar
  34. 34.
    Maojo, V., Martin-Sanchez, F.: Bioinformatics: towards new directions for public health. Methods Inf. Med. 43(3), 208–214 (2004)Google Scholar
  35. 35.
    Maojo, V., Kulikowski, C.A.: Bioinformatics and Medical Informatics: Collaborations on the Road to Genomic Medicine? J. Am. Med. Inform. Assoc. 10(6), 515–522 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Perez-Rey
    • 1
  • Alberto Anguita
    • 1
  • Jose Crespo
    • 1
  1. 1.Biomedical Informatics Group, Artificial Intelligence Laboratory, School of Computer ScienceUniversidad Politécnica de MadridBoadilla del Monte, Madrid

Personalised recommendations