Data Preprocessing in Data Mining pp 39-57

Part of the Intelligent Systems Reference Library book series (ISRL, volume 72) | Cite as

Data Preparation Basic Models

  • Salvador García
  • Julián Luengo
  • Francisco Herrera
Chapter

Abstract

The basic preprocessing steps carried out in Data Mining convert real-world data to a computer readable format. An overall overview related to this topic is given in Sect. 3.1. When there are several or heterogeneous sources of data, an integration of the data is needed to be performed. This task is discussed in Sect.  3.2. After the data is computer readable and constitutes an unique source, it usually goes through a cleaning phase where the data inaccuracies are corrected. Section  3.3 focuses in the latter task. Finally, some Data Mining applications involve some particular constraints like ranges for the data features, which may imply the normalization of the features (Sect. 3.4) or the transformation of the features of the data distribution (Sect. 3.5).

References

  1. 1.
    Agrawal, R., Srikant, R.: Searching with numbers. IEEE Trans. Knowl. Data Eng. 15(4), 855–870 (2003)CrossRefGoogle Scholar
  2. 2.
    Berry, M.J., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley, New York (1997)Google Scholar
  3. 3.
    Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1–4), 1–15 (2001)MATHCrossRefGoogle Scholar
  4. 4.
    Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. SIGMOD ’98, pp. 201–212. New York (1998)Google Scholar
  5. 5.
    Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: A distance based decision model. In: 31st Annual Hawaii International Conference on System Sciences (HICSS’98), pp. 305–313 (1998)Google Scholar
  6. 6.
    Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)CrossRefGoogle Scholar
  7. 7.
    Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, pp. 509–520 (2001)Google Scholar
  8. 8.
    Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: a multistrategy approach. Mach. Learn. 50, 279–301 (2003)MATHCrossRefGoogle Scholar
  9. 9.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  10. 10.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)CrossRefGoogle Scholar
  11. 11.
    Gill, L.E.: OX-LINK: The Oxford medical record linkage system. In: Proceedings of the International Record Linkage Workshop and Exposition, pp. 15–33 (1997)Google Scholar
  12. 12.
    Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bull. 24(4), 28–34 (2001)Google Scholar
  13. 13.
    Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Nascimento, M.A., Zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB. Morgan Kaufmann, San Francisco (2004)Google Scholar
  14. 14.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  15. 15.
    Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. The Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco (2006)Google Scholar
  16. 16.
    Hulse, J., Khoshgoftaar, T., Huang, H.: The pairwise attribute noise detection algorithm. Knowl. Inf. Syst. 11(2), 171–190 (2007)CrossRefGoogle Scholar
  17. 17.
    Jaro, M.A.: Unimatch: A record linkage system: User’s manual. Technical report (1976)Google Scholar
  18. 18.
    Joachims, T.: Advances in kernel methods. In: Making Large-scale Support Vector Machine Learning Practical, pp. 169–184. MIT Press, Cambridge (1999)Google Scholar
  19. 19.
    Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs (2001)Google Scholar
  20. 20.
    Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Disc. 7(1), 81–99 (2003)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB ’04, vol. 30, pp. 1078–1086. (2004)Google Scholar
  22. 22.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)CrossRefGoogle Scholar
  23. 23.
    Levenshtein, V.: Binary codes capable of correcting deletions. Insertions Reversals Sov. Phys. Doklady 163, 845–848 (1965)MathSciNetGoogle Scholar
  24. 24.
    Lin, T.Y.: Attribute transformations for data mining I: theoretical explorations. Int. J. Intell. Syst. 17(2), 213–222 (2002)MATHCrossRefGoogle Scholar
  25. 25.
    McCallum, A., Wellner, B.: Conditional models of identity uncertainty with application to noun coreference. Advances in Neural Information Processing Systems 17, pp. 905–912. MIT Press, Cambridge (2005)Google Scholar
  26. 26.
    Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 267–270. KDD, Portland, Oregon, USA (1996)Google Scholar
  27. 27.
    Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 39–44 (1990)Google Scholar
  28. 28.
    Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)Google Scholar
  29. 29.
    Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 454–461 (2004)Google Scholar
  30. 30.
    Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)Google Scholar
  31. 31.
    Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20(5), 522–532 (1998)CrossRefGoogle Scholar
  32. 32.
    Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)Google Scholar
  33. 33.
    Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–98 (2000)MATHCrossRefGoogle Scholar
  34. 34.
    Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1993)Google Scholar
  35. 35.
    Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Salvador García
    • 1
  • Julián Luengo
    • 2
  • Francisco Herrera
    • 3
  1. 1.Department of Computer ScienceUniversity of JaénJaénSpain
  2. 2.Department of Civil EngineeringUniversity of BurgosBurgosSpain
  3. 3.Department of Computer Science and Artificial IntelligenceUniversity of GranadaGranadaSpain

Personalised recommendations