Summary
This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C. C. & Yu, P. S. Outlier detection for high dimensional data. Proceedings of ACM SIGMOD international Conference on Management of Data; 2001 May 21-24; Santa Barbara, CA. 37-46.
Agrawal, R., Imielinski, T., & Swami, A. Mining Association rules between Sets of Items in Large Databases. Proceedings of ACM SIGMOD International Conference on Management of Data; 1993 May; Washington D.C. 207-216.
Ballou, D. P. & Tayi, G. K. Enhancing Data Quality in DataWarehouse Environments, Communications of the ACM 1999; 42(1):73-78.
Barnett, V. & Lewis, T., Outliers in Statistical Data. John Wiley and Sons, 1994.
Bochicchio, M. A. & Longo, A. Data Cleansing for Fiscal Services: The Taviano Project. Proceedings of 5th International Conference on Enterprise Information Systems; 2003 April 22-26; Angers, France. 464-467.
Brachman, R. J., Anand, T., The Process of Knowledge Discovery in Databases — A Human–Centered Approach. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uth-urasamy, R., eds. MIT Press/AAAI Press, 1996.
Cadot, M. & di Martion, J. A data cleaning solution by Perl scripts for the KDD Cup 2003 task 2, ACM SIGKDD Explorations Newsletter 2003; 5(2):158-159.
Chaudhuri, S., Ganjam, K., Ganti, V., & Motwani, R. Robust and efficient fuzzy match for online data cleaning. Proceedings of ACM SIGMOD International Conference on Management of Data; 2003 june 9-12; San Diego, CA. 313-324.
Dasu, T., Vesonder, G. T., & Wright, J. R. Data quality through knowledge engineering.
Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2003 August 24-27; Washington, D.C. 705-710.
Fayyad, U. M., Piatetsky-Shapiro, G., & Smyth, P., From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, Fayyad,
U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1996.
Fayyad, U. M., Piatetsky-Shapiro, G., & Uthurasamy, R. Summary from the KDD-03 Panel - Data Mining: The Next 10 Years, ACM SIGKDD Explorations Newsletter 2003; 5(2):191-196.
Feekin, A. & Chen, Z. Duplicate detection using k-way sorting method. Proceedings of ACM Symposium on Applied Computing; 2000 Como, Italy. 323-327.
Fox, C., Levitin, A., & Redman, T. The Notion of Data and Its Quality Dimensions, InformationProcessing and Management 1994; 30(1):9-19.
Galhardas, H. Data Cleaning: Model, Language and Algoritmes. University of Versailles, Saint-Quentin-En-Yvelines, Ph.D., 2001.
Guyon, I., Matic, N., & Vapnik, V., Discovering Information Patterns and Data Cleaning. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1996.
Hamming, R. W., Coding and Information Theory. New Jersey, Prentice-Hall, 1980.
Hawkins, S., He, H., Williams, G. J., & Baxter, R. A. Outlier Detection Using Replicator Neural Networks. Proceedings of 4th International Conference on Data Warehousing and Knowledge Discovery; 2002 September 04-06; 170-180.
Hernandez, M. & Stolfo, S. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem, Data Mining and Knowledge Discovery 1998; 2(1):9-37.
Johnson, R. A. & Wichern, D. W., Applied Multivariate Statistical Analysis. Prentice Hall, 1998.
Kaufman, L. & Rousseauw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., & Lee, D. A taxonomy of dirty data, Data Mining and Knowledge Discovery 2003; 7(1):81-99.
Kimball, R. Dealing with Dirty Data, DBMS 1996; 9(10):55-60.
Knorr, E. M. & Ng, R. T. Algorithms for Mining Distance-Based Outliers in Large Datasets. Proceedings of 24th International Conference on Very Large Data Bases; 1998 New York. 392-403.
Knorr, E. M., Ng, R. T., & Tucakov, V. Distance-based outliers: algorithms and applications, The International Journal on Very Large Data Bases 2000; 8(3-4):237-253.
Korn, F., Labrinidis, A., Yannis, K., & Faloustsos, C. Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. Proceedings of 24th VLDB Conference; 1998 New York. 582–593.
Lee, M. L., Ling, T. W., & Low, W. L. IntelliClean: a knowledge-based intelligent data cleaner. Proceedings of Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000 August 20-23; Boston, MA. 290-294.
Levitin, A. & Redman, T. A Model of the Data (Life) Cycles with Application to Quality, Information and Software Technology 1995; 35(4):217-223.
Li, Z., Sung, S. Y., Peng, S., & Ling, T. W. A New Efficient Data cleansing Method. Proceedings of Database and Expert Systems Applications (DEXA 2002); 2002 September 2-6; Aix-en-Provence, France. 484-493.
Maimon, O. and Rokach, L. Improving supervised learning by feature decomposition, Proceedings of the Second International Symposium on Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Springer, 2002, 178-196
Maletic, J. I. & Marcus, A. Data Cleansing: Beynod Integrity Analysis. Proceedings of The Conference on Information Quality (IQ2000); 2000 October 20-22; Massachusetts Institute of Technology. 200-209.
Marcus, A., Maletic, J. I., & Lin, K. I. Ordinal Association Rules for Error Identification in Data Sets. Proceedings of Tenth International Conference on Information and Knowledge Management (CIKM 2001); 2001 November 3-5; Atlanta, GA. to appear.
Murtagh, F. A Survey of Recent Advances in Hierarchical Clustering Algorithms, The Computer Journal 1983; 26(4):354-359.
Orr, K. Data Quality and Systems Theory, Communications of the ACM 1998; 41(2):66-71.
Raman, V. & Hellerstein, J. M. Potter’s wheel an interactive data cleaning system. Proceedings of 27th International Conference on Very Large Databases 2001 September 11-14; Rome, Italy. 381–391.
Ramaswamy, S., Rastogi, R., & Shim, K. Efficient Algorithms for Mining Outliers from Large Data Sets. Proceedings of ACM SIGMOD International Conference on Management of Data; 2000 Dallas. 427-438.
Redman, T. The Impact of Poor Data Quality on the Typical Enterprise, Communications of the ACM 1998; 41(2):79-82.
Rokach, L., Maimon, O. (2005), Clustering Methods, Data Mining and Knowledge Discovery Handbook, Springer, pp. 321-352.
Simoudis, E., Livezey, B., & Kerber, R., Using Recon for Data Cleaning. In Advances in Knowledge Discovery and Data Mining, Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurasamy, R., eds. MIT Press/AAAI Press, 1995.
Srikant, R., Vu, Q., & Agrawal, R. Mining Association Rules with Item Constraints. Proceedings of SIGMOD International Conference on Management of Data; 1996 June; Montreal, Canada. 1-12.
Strong, D., Yang, L., & Wang, R. Data Quality in Context, Communications of the ACM 1997; 40(5):103-110.
Sung, S. Y., Li, Z., & Sun, P. A fast filtering scheme for large database cleansing. Proceedings of Eleventh ACM International Conference on Information and Knowledge Management; 2002 November 04-09; McLean, VA. 76-83.
Svanks, M. Integrity Analysis: Methods for Automating Data Quality Assurance, EDP Auditors Foundation 1984; 30(10):595-605.
Wang, R., Storey, V., & Firth, C. A Framework for Analysis of Data Quality Research, IEEE Transactions on Knowledge and Data Engineering 1995; 7(4):623-639.
Wang, R., Strong, D., & Guarascio, L. Beyond Accuracy: What Data Quality Means to Data Consumers, Journal of Management Information Systems 1996; 12(4):5-34.
Wang, R., Ziad, M., & Lee, Y. W., Data Quality. Kluwer, 2001.
Yang, Y., Carbonell, J., Brown, R., Pierce, T., Archibald, B. T., & Liu, X. Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems 1999; 14(4).
Yu, D., Sheikholeslami, G., & Zhang, A. FindOut: Finding Outliers in Very Large Datasets, Knowledge and Information Systems 2002; 4(4):387-412.
Zhao, L., Yuan, S. S., Peng, S., & Ling, T. W. A new efficient data cleansing method. Proceedings of 13th International Conference on Database and Expert Systems Applications; 2002 September 02-06; 484-493.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Maletic, J.I., Marcus, A. (2009). Data Cleansing: A Prelude to Knowledge Discovery. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_2
Download citation
DOI: https://doi.org/10.1007/978-0-387-09823-4_2
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)