Data Mining and Knowledge Discovery

, Volume 2, Issue 1, pp 9–37 | Cite as

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

  • Mauricio A. Hernández
  • Salvatore J. Stolfo

Abstract

The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.

data cleaning data cleansing duplicate elimination semantic integration 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ACM. SIGMOD record, December 1991.Google Scholar
  2. Agrawal, R. and Jagadish, H.V. Multiprocessor Transitive Closure Algorithms. In Proc. Int'l Symp. on Databases in Parallel and Distributed Systems, pages 56–66, December 1988.Google Scholar
  3. Batini, C., Lenzerini, M. and Navathe, S. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surverys, 18(4):323–364, December 1986.Google Scholar
  4. Bitton, D. and DeWitt, D. J. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.Google Scholar
  5. Buckles, B.P. and Petry, F. E. A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7:213–226, 1982. Generally regarded as the paper that originated Fuzzy Databases.Google Scholar
  6. Buckley, J. P. A Hierarchical Clustering Strategy for Very Large Fuzzy Databases. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pages 3573–3578, 1995.Google Scholar
  7. Church, K.W. and Gale, W. A. Probability Scoring for Spelling Correction. Statistics and Computing, 1:93–103, 1991.Google Scholar
  8. Clark, T. K. Analyzing Foster Childrens' Foster Home Payments Database. In KDD Nuggets 95:7 (http://info.gte.com/"kdd/nuggets/95/), Piatetsky-Shapiro, ed., 1995.Google Scholar
  9. Dietterich, T. and Michalski, R. A Comparative Review of Selected Methods for Learning from Examples. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 1, pages 41–81. Morgan Kaufmann Publishers, Inc., 1983.Google Scholar
  10. Dubes, R. and Jain, A. Clustering Techniques: The User's Dilema. Pattern Recognition, 8:247–260, 1976.Google Scholar
  11. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), Fall 1996.Google Scholar
  12. Fellegi, I. and Sunter, A. A Theory for Record Linkage. American Statistical Association Journal, pages 1183–1210, December 1969.Google Scholar
  13. Forgy, C. L. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.Google Scholar
  14. George, R., Petry, F. E., Buckles, B. P. and Srikanth, R. Fuzzy Database Systems – Challenges and Opportunities of a New Era. International Journal of Intelligent Systems, 11:649–659, 1996.Google Scholar
  15. Ghandeharizadeh, S. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin-Madison, 1990.Google Scholar
  16. Hernández, M. and Stolfo, S. The Merge/Purge Problem for Large Databases. In Proceedings of the 1995 ACM-SIGMOD Conference, May 1995.Google Scholar
  17. Kukich, K. Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24(4):377–439, 1992.Google Scholar
  18. Lebowitz, M. Not the Path to Perdition: The Utility of Similarity-Based Learning. In Proceedings of 5th National Conference on Artificial Intelligence, pages 533–537, 1986.Google Scholar
  19. Monge, A. and Elkan, C. An Efficient Domain-independent Algorithm for Detecting Approximate Duplicate Database Records. In Proceedings of the 1997 SIGMOD Workshop on Research Issues on DMKD, pages 23–29, 1997.Google Scholar
  20. Nyberg, C., Barclay, T., Cvetanovic, Z, Gray, J. and Lomet, D. AlphaSort: A RISC Machine Sort. In Proceedings of the 1994 ACM-SIGMOD Conference, pages 233–242, 1994.Google Scholar
  21. Pollock, J. J. and Zamora, A. Automatic spelling correction in scientific and scholarly text. ACM Computing Surveys, 27(4):358–368, 1987.Google Scholar
  22. Senator, T., Goldberg, H., Wooton, J., Cottini, A., Umar, A., Klinger, C., Llamas, W. Marrone, M. and Wong, R. The FinCEN Artificial Intelligence System: Identifying Potential Money Laundering from Reports of Large Cash Transactions. In Proceedings of the 7th Conference on Innovative Applications of AI, August 1995.Google Scholar
  23. Wang, Y. R. and Madnick, S. E. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proceedings of the Sixth International Conference on Data Engineering, February 1989.Google Scholar

Copyright information

© Kluwer Academic Publishers 1998

Authors and Affiliations

  • Mauricio A. Hernández
    • 1
  • Salvatore J. Stolfo
    • 1
  1. 1.Department of Computer ScienceColumbia UniversityNew York

Personalised recommendations