Knowledge Management for Model Driven Data Cleaning of Very Large Database

  • Yucong Duan
  • Roger Lee
Part of the Studies in Computational Intelligence book series (SCI, volume 443)


From a knowledge management perspective, we explore data cleaning of very large databases with focus on semantic rich data and linked data. We identify four aspects of complexity which, if they were not explicitly addressed and fully managed will hinder both the recognizing and attaining of the best result: (a) the inconsistency of solution knowledge due to their partial applicability among multiple concerns; (b) the side effect which is introduced during the introduction of solution knowledge for pursuing a precision relating to the existence of multiple semantics; (c) unconscious ignorance of implicit weights of some parameters for value computation; (d) a holism based reasoning which is irreplaceable by simplification for some situations. After analyzing the state of the art, we propose an ongoing Model Driven Engineering (MDE) based knowledge management platform for identifying, refining, organizing and evaluating related variants and solutions with mitigated complexity.


False Negative Knowledge Management Data Cleaning Data Clean Semantic State 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  2. 2.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23 (2000)Google Scholar
  3. 3.
    Davidson, L., Hu, G.: Analysis of ISSP Environment II Survey Data Using Variable Clustering. In: SNPD (Selected Papers), pp. 1–13 (2011)Google Scholar
  4. 4.
    Low, W.L., Lee, M.L., Ling, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)zbMATHCrossRefGoogle Scholar
  5. 5.
    Duan, Y., Cruz, C., Nicolle, C.: Architectural Reconstruction of 3D Building Objects through Semantic Knowledge Management. In: SNPD, pp. 261–266 (2010)Google Scholar
  6. 6.
    Karmacharya, A., Cruz, C., Boochs, F., Marzani, F.: Integration of Spatial Processing and Knowledge Processing Through the Semantic Web Stack. In: GeoS, pp. 200–216 (2011)Google Scholar
  7. 7.
    Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the Sixth ACM SIGKDD, KDD 2000, pp. 290–294. ACM, New York (2000)Google Scholar
  8. 8.
    Liao, S.H.: Knowledge management technologies and applications - literature review from 1995 to 2002. Expert. Syst. Appl. 25(2), 155–164 (2003)CrossRefGoogle Scholar
  9. 9.
    Strong, O., Chiang, C.C., Kim, H.K., Kang, B., Lee, R.Y.: Layering MDA: Applying Transparent Layers of Knowledge to Platform Independent Models. In: SNPD, pp. 191–199 (2009)Google Scholar
  10. 10.
    Marbs, A., Hmida, H., Hung, T., Karmachaiya, A., Cruz, C., Habed, A., Nicolle, C., Voisin, Y.: Integration of knowledge to support automatic object reconstruction from images and 3D data. In: Systems, Signals and Devices (SSD), pp. 1–13 (2011)Google Scholar
  11. 11.
    Bradji, L., Boufaïda, M.: A Rule Management System for Knowledge Based Data Cleaning. Intelligent Information Management 3(6), 230–239 (2011)CrossRefGoogle Scholar
  12. 12.
    Duan, Y., Cruz, C., Nicolle, C.: Managing Semantics Knowledge for 3D Architectural Reconstruction of Building Objects. In: SERA, pp. 121–128 (2010)Google Scholar
  13. 13.
    Low, W.L., Lee, M.L., Ling, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)zbMATHCrossRefGoogle Scholar
  14. 14.
    Yan, H., Diao, X.C.: The Design and Implementation of Data Cleaning Knowledge Modeling. In: Proceedings of KAM, pp. 177–179. IEEE Computer Society, Washington, DC (2008)Google Scholar
  15. 15.
    Duan, Y.: Semantics Computation: Towards Identifying Answers from Problem Expressions. In: SSNE 2011, pp. 19–24 (2011)Google Scholar
  16. 16.
    Duan, Y.: Semantics Computation:A Problem Solving Perspective. IJIMT 2(6), 490–499 (2011)Google Scholar
  17. 17.
    Duan, Y., Cruz, C.: Formalizing Semantic of Natural Language through Conceptualization from Existence. IJIMT 2(1), 37–42 (2011)Google Scholar
  18. 18.
    Duan, Y.: A Dualism Based Semantics Formalization Mechanism for Model Driven Engineering. IJSSCI 1(4), 90–110 (2009)Google Scholar
  19. 19.
    Kedad, Z., Métais, E.: Ontology-Based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, pp. 137–149. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Apiletti, D., Bruno, G., Ficarra, E., Baralis, E.: Data Cleaning and Semantic Improvement in Biological Databases. J. Integrative Bioinformatics 3(2) (2006)Google Scholar
  21. 21.
    Brüggemann, S.: Rule Mining for Automatic Ontology Based Data Cleaning. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS, vol. 4976, pp. 522–527. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  22. 22.
    Alonso-Jimenez, J.A., Borrego-Diaz, J., Chavez-Gonzalez, A.M., Martin-Mateos, F.J.: Foundational Challenges in Automated Semantic Web Data and Ontology Cleaning. IEEE Intelligent Systems 21(1), 42–52 (2006)CrossRefGoogle Scholar
  23. 23.
    Kim, H.K., Lee, R.Y.: MS2Web: Applying MDA and SOA to Web Services. In: Proceedings of SNPD 2008, pp. 163–180 (2008)Google Scholar
  24. 24.
    Deshpande, A., Guestrin, C., Madden, S.R., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: Proceedings of VLDB 2004, pp. 588–599. VLDB Endowment (2004)Google Scholar
  25. 25.
    Kim, H., Zhang, Y., Oussena, S., Clark, T.: A case study on model driven data integration for data centric software development. In: Proceedings of the ACM DSMM 2009, pp. 1–6. ACM, New York (2009)CrossRefGoogle Scholar
  26. 26.
    Jiang, N., Chen, Z.: Model-driven data cleaning for signal processing system in sensor networks. In: Proceedings of Signal Processing Systems (ICSPS). IEEE Computer Society (2010)Google Scholar
  27. 27.
    Carmè, A., Mazón, J.-N., Rizzi, S.: A Model-Driven Heuristic Approach for Detecting Multidimensional Facts in Relational Data Sources. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 13–24. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Duan, Y., Cheung, S.C., Fu, X., Gu, Y.: A Metamodel Based Model Transformation Approach. In: SERA, pp. 184–191 (2005)Google Scholar
  29. 29.
    Winkler, W.E., Winkler, W.E.: Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of Section on Survey Research Methods, American Statistical Association, pp. 667–671 (2000)Google Scholar
  30. 30.
    Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Proceedings of the IJCAI, pp. 43–48. Morgan Kaufmann Publishers Inc., San Francisco (2003)Google Scholar
  31. 31.
    Delen, D., Al-Hawamdeh, S.: A holistic framework for knowledge discovery and management. Commun. ACM 52(6), 141–145 (2009)CrossRefGoogle Scholar
  32. 32.
    Duan, Y.: Value Modeling and Calculation for Everything as a Service (XaaS) based on Reuse. In: Proceedings of SNPD 2012. IEEE Computer Society (2012)Google Scholar
  33. 33.
    Jin, H., Huang, L., Yuan, P.: K-Radius Subgraph Comparison for RDF Data Cleansing. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 309–320. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  34. 34.
    Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)CrossRefGoogle Scholar
  35. 35.
    Duan, Y., Cruz, C., Nicolle, C.: Identification Objective True/False from Subjective Yes/No Semantic based on OWA and CWA. In: ICECT, pp. 689–693. IEEE Computer Society (2012)Google Scholar
  36. 36.
    Duan, Y., Cruz, C.: Attaining and Applying Consistency from Semantic Evolved from Conceptualization. In: ICECT, pp. 699–704. IEEE Computer Society (2012)Google Scholar
  37. 37.
    Christen, P., Goiser, K.: Quality and Complexity Measures for Data Linkage and Deduplication. In: Quality Measures in Data Mining, pp. 127–151 (2007)Google Scholar
  38. 38.
    Beskales, G.: Modeling and Querying Uncertainty in Data Cleaning. PhD thesis, University of Waterloo (2012)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Dipartimento Informatica Sistemistica e ComunicazioneUniversity of Milano-BicoccaMilanoItaly
  2. 2.Software Engineering and Information Technology InstituteCentral Michigan UniversityMount PleasantU.S.A.

Personalised recommendations