Abstract
The basic preprocessing steps carried out in Data Mining convert real-world data to a computer readable format. An overall overview related to this topic is given in Sect. 3.1. When there are several or heterogeneous sources of data, an integration of the data is needed to be performed. This task is discussed in Sect. 3.2. After the data is computer readable and constitutes an unique source, it usually goes through a cleaning phase where the data inaccuracies are corrected. Section 3.3 focuses in the latter task. Finally, some Data Mining applications involve some particular constraints like ranges for the data features, which may imply the normalization of the features (Sect. 3.4) or the transformation of the features of the data distribution (Sect. 3.5).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agrawal, R., Srikant, R.: Searching with numbers. IEEE Trans. Knowl. Data Eng. 15(4), 855–870 (2003)
Berry, M.J., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley, New York (1997)
Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1–4), 1–15 (2001)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. SIGMOD ’98, pp. 201–212. New York (1998)
Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: A distance based decision model. In: 31st Annual Hawaii International Conference on System Sciences (HICSS’98), pp. 305–313 (1998)
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, pp. 509–520 (2001)
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: a multistrategy approach. Mach. Learn. 50, 279–301 (2003)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Gill, L.E.: OX-LINK: The Oxford medical record linkage system. In: Proceedings of the International Record Linkage Workshop and Exposition, pp. 15–33 (1997)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bull. 24(4), 28–34 (2001)
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Nascimento, M.A., Zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB. Morgan Kaufmann, San Francisco (2004)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. The Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Hulse, J., Khoshgoftaar, T., Huang, H.: The pairwise attribute noise detection algorithm. Knowl. Inf. Syst. 11(2), 171–190 (2007)
Jaro, M.A.: Unimatch: A record linkage system: User’s manual. Technical report (1976)
Joachims, T.: Advances in kernel methods. In: Making Large-scale Support Vector Machine Learning Practical, pp. 169–184. MIT Press, Cambridge (1999)
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs (2001)
Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Disc. 7(1), 81–99 (2003)
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB ’04, vol. 30, pp. 1078–1086. (2004)
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Levenshtein, V.: Binary codes capable of correcting deletions. Insertions Reversals Sov. Phys. Doklady 163, 845–848 (1965)
Lin, T.Y.: Attribute transformations for data mining I: theoretical explorations. Int. J. Intell. Syst. 17(2), 213–222 (2002)
McCallum, A., Wellner, B.: Conditional models of identity uncertainty with application to noun coreference. Advances in Neural Information Processing Systems 17, pp. 905–912. MIT Press, Cambridge (2005)
Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 267–270. KDD, Portland, Oregon, USA (1996)
Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 39–44 (1990)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 454–461 (2004)
Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)
Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20(5), 522–532 (1998)
Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–98 (2000)
Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1993)
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
García, S., Luengo, J., Herrera, F. (2015). Data Preparation Basic Models. In: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-319-10247-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-10247-4_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10246-7
Online ISBN: 978-3-319-10247-4
eBook Packages: EngineeringEngineering (R0)