Data Preparation Basic Models

García, Salvador; Luengo, Julián; Herrera, Francisco

doi:10.1007/978-3-319-10247-4_3

Data Preparation Basic Models

Salvador García⁶,
Julián Luengo⁷ &
Francisco Herrera⁸

Chapter
First Online: 01 January 2014

10k Accesses
5 Citations

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 72))

Abstract

The basic preprocessing steps carried out in Data Mining convert real-world data to a computer readable format. An overall overview related to this topic is given in Sect. 3.1. When there are several or heterogeneous sources of data, an integration of the data is needed to be performed. This task is discussed in Sect. 3.2. After the data is computer readable and constitutes an unique source, it usually goes through a cleaning phase where the data inaccuracies are corrected. Section 3.3 focuses in the latter task. Finally, some Data Mining applications involve some particular constraints like ranges for the data features, which may imply the normalization of the features (Sect. 3.4) or the transformation of the features of the data distribution (Sect. 3.5).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Agrawal, R., Srikant, R.: Searching with numbers. IEEE Trans. Knowl. Data Eng. 15(4), 855–870 (2003)
Article Google Scholar
Berry, M.J., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support. Wiley, New York (1997)
Google Scholar
Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1–4), 1–15 (2001)
Article MATH Google Scholar
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. SIGMOD ’98, pp. 201–212. New York (1998)
Google Scholar
Dey, D., Sarkar, S., De, P.: Entity matching in heterogeneous databases: A distance based decision model. In: 31st Annual Hawaii International Conference on System Sciences (HICSS’98), pp. 305–313 (1998)
Google Scholar
Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Article Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD ’01, pp. 509–520 (2001)
Google Scholar
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: a multistrategy approach. Mach. Learn. 50, 279–301 (2003)
Article MATH Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64, 1183–1210 (1969)
Article Google Scholar
Gill, L.E.: OX-LINK: The Oxford medical record linkage system. In: Proceedings of the International Record Linkage Workshop and Exposition, pp. 15–33 (1997)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Pietarinen, L., Srivastava, D.: Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bull. 24(4), 28–34 (2001)
Google Scholar
Guha, S., Koudas, N., Marathe, A., Srivastava, D.: Merging the results of approximate match operations. In: Nascimento, M.A., Zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB. Morgan Kaufmann, San Francisco (2004)
Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. The Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Google Scholar
Hulse, J., Khoshgoftaar, T., Huang, H.: The pairwise attribute noise detection algorithm. Knowl. Inf. Syst. 11(2), 171–190 (2007)
Article Google Scholar
Jaro, M.A.: Unimatch: A record linkage system: User’s manual. Technical report (1976)
Google Scholar
Joachims, T.: Advances in kernel methods. In: Making Large-scale Support Vector Machine Learning Practical, pp. 169–184. MIT Press, Cambridge (1999)
Google Scholar
Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice-Hall, Englewood Cliffs (2001)
Google Scholar
Kim, W., Choi, B.J., Hong, E.K., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Disc. 7(1), 81–99 (2003)
Article MathSciNet Google Scholar
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB ’04, vol. 30, pp. 1078–1086. (2004)
Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions. Insertions Reversals Sov. Phys. Doklady 163, 845–848 (1965)
MathSciNet Google Scholar
Lin, T.Y.: Attribute transformations for data mining I: theoretical explorations. Int. J. Intell. Syst. 17(2), 213–222 (2002)
Article MATH Google Scholar
McCallum, A., Wellner, B.: Conditional models of identity uncertainty with application to noun coreference. Advances in Neural Information Processing Systems 17, pp. 905–912. MIT Press, Cambridge (2005)
Google Scholar
Monge, A.E., Elkan, C.: The field matching problem: algorithms and applications. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 267–270. KDD, Portland, Oregon, USA (1996)
Google Scholar
Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 39–44 (1990)
Google Scholar
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pp. 454–461 (2004)
Google Scholar
Refaat, M.: Data Preparation for Data Mining Using SAS. Morgan Kaufmann, San Francisco (2007)
Google Scholar
Ristad, E.S., Yianilos, P.N.: Learning string edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 20(5), 522–532 (1998)
Article Google Scholar
Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)
Google Scholar
Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–98 (2000)
Article MATH Google Scholar
Winkler, W.E.: Improved decision rules in the Fellegi-Sunter model of record linkage. Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC (1993)
Google Scholar
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Appl. Artif. Intell. 17(5–6), 375–381 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Jaén, Jaén, Spain
Salvador García
Department of Civil Engineering, University of Burgos, Burgos, Spain
Julián Luengo
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Francisco Herrera

Authors

Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salvador García .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

García, S., Luengo, J., Herrera, F. (2015). Data Preparation Basic Models. In: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-319-10247-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-10247-4_3
Published: 31 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10246-7
Online ISBN: 978-3-319-10247-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics