Advertisement

Data Cleansing: A Prelude to Knowledge Discovery

  • Jonathan I. Maletic
  • Andrian Marcus
Chapter

Summary

This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed.

Key words

Data Cleansing Data Cleaning Data Mining Ordinal Rules Data Quality Error Detection Ordinal Association Rules 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C. C. & Yu, P. S. Outlier detection for high dimensional data. Proceedings of ACM SIGMOD international Conference on Management of Data; 2001 May 21-24; Santa Barbara, CA. 37-46.Google Scholar
  2. Agrawal, R., Imielinski, T., & Swami, A. Mining Association rules between Sets of Items in Large Databases. Proceedings of ACM SIGMOD International Conference on Management of Data; 1993 May; Washington D.C. 207-216.Google Scholar
  3. Ballou, D. P. & Tayi, G. K. Enhancing Data Quality in DataWarehouse Environments, Communications of the ACM 1999; 42(1):73-78.CrossRefGoogle Scholar
  4. Barnett, V. & Lewis, T., Outliers in Statistical Data. John Wiley and Sons, 1994.Google Scholar
  5. Bochicchio, M. A. & Longo, A. Data Cleansing for Fiscal Services: The Taviano Project. Proceedings of 5th International Conference on Enterprise Information Systems; 2003 April 22-26; Angers, France. 464-467.Google Scholar
  6. Brachman, R. J., Anand, T., The Process of Knowledge Discovery in Databases — A Human–Centered Approach. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uth-urasamy, R., eds. MIT Press/AAAI Press, 1996.Google Scholar
  7. Cadot, M. & di Martion, J. A data cleaning solution by Perl scripts for the KDD Cup 2003 task 2, ACM SIGKDD Explorations Newsletter 2003; 5(2):158-159.CrossRefGoogle Scholar
  8. Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. Robust and efficient fuzzy match for online data cleaning. Proceedings of ACM SIGMOD International Conference on Management of Data; 2003 june 9-12; San Diego, CA. 313-324.Google Scholar
  9. Dasu, T., Vesonder, G. T., & Wright, J. R. Data quality through knowledge engineering.Google Scholar
  10. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2003 August 24-27; Washington, D.C. 705-710.Google Scholar
  11. Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P., From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, Fayyad,Google Scholar
  12. U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1996.Google Scholar
  13. Fayyad, U. M., Piatetsky-Shapiro, G., & Uthurasamy, R. Summary from the KDD-03 Panel - Data Mining: The Next 10 Years, ACM SIGKDD Explorations Newsletter 2003; 5(2):191-196.CrossRefGoogle Scholar
  14. Feekin, A. & Chen, Z. Duplicate detection using k-way sorting method. Proceedings of ACM Symposium on Applied Computing; 2000 Como, Italy. 323-327.Google Scholar
  15. Fox, C., Levitin, A., & Redman, T. The Notion of Data and Its Quality Dimensions, InformationProcessing and Management 1994; 30(1):9-19.CrossRefGoogle Scholar
  16. Galhardas, H. Data Cleaning: Model, Language and Algoritmes. University of Versailles, Saint-Quentin-En-Yvelines, Ph.D., 2001.Google Scholar
  17. Guyon, I., Matic, N., & Vapnik, V., Discovering Information Patterns and Data Cleaning. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1996.Google Scholar
  18. Hamming, R. W., Coding and Information Theory. New Jersey, Prentice-Hall, 1980.zbMATHGoogle Scholar
  19. Hawkins, S., He, H., Williams, G. J., & Baxter, R. A. Outlier Detection Using Replicator Neural Networks. Proceedings of 4th International Conference on Data Warehousing and Knowledge Discovery; 2002 September 04-06; 170-180.Google Scholar
  20. Hernandez, M. & Stolfo, S. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery 1998; 2(1):9-37.CrossRefGoogle Scholar
  21. Johnson, R. A. & Wichern, D. W., Applied Multivariate Statistical Analysis. Prentice Hall, 1998.Google Scholar
  22. Kaufman, L. & Rousseauw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.Google Scholar
  23. Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., & Lee, D. A taxonomy of dirty data, Data Mining and Knowledge Discovery 2003; 7(1):81-99.CrossRefMathSciNetGoogle Scholar
  24. Kimball, R. Dealing with Dirty Data, DBMS 1996; 9(10):55-60.Google Scholar
  25. Knorr, E. M. & Ng, R. T. Algorithms for Mining Distance-Based Outliers in Large Datasets. Proceedings of 24th International Conference on Very Large Data Bases; 1998 New York. 392-403.Google Scholar
  26. Knorr, E. M., Ng, R. T., & Tucakov, V. Distance-based outliers: algorithms and applications, The International Journal on Very Large Data Bases 2000; 8(3-4):237-253.CrossRefGoogle Scholar
  27. Korn, F., Labrinidis, A., Yannis, K., & Faloustsos, C. Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. Proceedings of 24th VLDB Conference; 1998 New York. 582–593.Google Scholar
  28. Lee, M. L., Ling, T. W., & Low, W. L. IntelliClean: a knowledge-based intelligent data cleaner. Proceedings of Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000 August 20-23; Boston, MA. 290-294.Google Scholar
  29. Levitin, A. & Redman, T. A Model of the Data (Life) Cycles with Application to Quality, Information and Software Technology 1995; 35(4):217-223.CrossRefGoogle Scholar
  30. Li, Z., Sung, S. Y., Peng, S., & Ling, T. W. A New Efficient Data cleansing Method. Proceedings of Database and Expert Systems Applications (DEXA 2002); 2002 September 2-6; Aix-en-Provence, France. 484-493.Google Scholar
  31. Maimon, O. and Rokach, L. Improving supervised learning by feature decomposition, Proceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, 2002, 178-196Google Scholar
  32. Maletic, J. I. & Marcus, A. Data Cleansing: Beynod Integrity Analysis. Proceedings of The Conference on Information Quality (IQ2000); 2000 October 20-22; Massachusetts Institute of Technology. 200-209.Google Scholar
  33. Marcus, A., Maletic, J. I., & Lin, K. I. Ordinal Association Rules for Error Identification in Data Sets. Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM 2001); 2001 November 3-5; Atlanta, GA. to appear.Google Scholar
  34. Murtagh, F. A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal 1983; 26(4):354-359.zbMATHGoogle Scholar
  35. Orr, K. Data Quality and Systems Theory, Communications of the ACM 1998; 41(2):66-71.CrossRefMathSciNetGoogle Scholar
  36. Raman, V. & Hellerstein, J. M. Potter’s wheel an interactive data cleaning system. Proceedings of 27th International Conference on Very Large Databases 2001 September 11-14; Rome, Italy. 381–391.Google Scholar
  37. Ramaswamy, S., Rastogi, R., & Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of ACM SIGMOD International Conference on Management of Data; 2000 Dallas. 427-438.Google Scholar
  38. Redman, T. The Impact of Poor Data Quality on the Typical Enterprise, Communications of the ACM 1998; 41(2):79-82.CrossRefGoogle Scholar
  39. Rokach, L., Maimon, O. (2005), Clustering Methods, Data Mining and Knowledge Discovery Handbook, Springer, pp. 321-352.Google Scholar
  40. Simoudis, E., Livezey, B., & Kerber, R., Using Recon for Data Cleaning. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1995.Google Scholar
  41. Srikant, R., Vu, Q., & Agrawal, R. Mining Association Rules with Item Constraints. Proceedings of SIGMOD International Conference on Management of Data; 1996 June; Montreal, Canada. 1-12.Google Scholar
  42. Strong, D., Yang, L., & Wang, R. Data Quality in Context, Communications of the ACM 1997; 40(5):103-110.CrossRefGoogle Scholar
  43. Sung, S. Y., Li, Z., & Sun, P. A fast filtering scheme for large database cleansing. Proceedings of Eleventh ACM International Conference on Information and Knowledge Management; 2002 November 04-09; McLean, VA. 76-83.Google Scholar
  44. Svanks, M. Integrity Analysis: Methods for Automating Data Quality Assurance, EDP Auditors Foundation 1984; 30(10):595-605.Google Scholar
  45. Wang, R., Storey, V., & Firth, C. A Framework for Analysis of Data Quality Research, IEEE Transactions on Knowledge and Data Engineering 1995; 7(4):623-639.CrossRefGoogle Scholar
  46. Wang, R., Strong, D., & Guarascio, L. Beyond Accuracy: What Data Quality Means to Data Consumers, Journal of Management Information Systems 1996; 12(4):5-34.zbMATHGoogle Scholar
  47. Wang, R., Ziad, M., & Lee, Y. W., Data Quality. Kluwer, 2001.Google Scholar
  48. Yang, Y., Carbonell, J., Brown, R., Pierce, T., Archibald, B. T., & Liu, X. Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems 1999; 14(4).Google Scholar
  49. Yu, D., Sheikholeslami, G., & Zhang, A. FindOut: Finding Outliers in Very Large Datasets, Knowledge and Information Systems 2002; 4(4):387-412.CrossRefGoogle Scholar
  50. Zhao, L., Yuan, S. S., Peng, S., & Ling, T. W. A new efficient data cleansing method. Proceedings of 13th International Conference on Database and Expert Systems Applications; 2002 September 02-06; 484-493.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Jonathan I. Maletic
    • 1
  • Andrian Marcus
    • 2
  1. 1.Kent State UniversityKentUSA
  2. 2.Wayne State UniversityDetroitUSA

Personalised recommendations