Abstract
Data cleaning is an important step in information systems as low quality of data may seriously impact business processes that utilize the given system. Textual attributes are most prone to errors, especially during the data input stage. In this article, we propose a novel approach to automatic correction values of text attributes. the method combines approaches based on textual similarity with those using data distribution features. Contrary to all the methods in the area, our approach does not require third-party reference data. Experiments performed on real-world address data prove that the method may effectively clean the data with high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barabási, A.L., Bonabeau, E.: Scale-free networks. Scientific American 288(5), 50–59 (2003)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (2003)
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Statistics and Computing 1(2), 93–103 (1991)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 171–176 (1964)
Dardzinska-Glebocka, A.: Chase method based on dynamic knowledge discovery for predicting values in incomplete information systems. Ph.D. thesis. Polish Academy of Sciences, Warsaw (2004)
Deorowicz, S., Ciura, M.G.: Correcting spelling errors by modeling their causes. International Journal of Applied Mathematics and Computer Science 15, 275–285 (2005)
Giebultowicz, M.: Polish spelling errors categorization. In: Proceedings of the 1st International Interdisciplinary Technical Conference of Young Scientists (2008)
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, Chichester (2004)
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining, New York, US, pp. 290–294 (2000)
Maydanchik, A.: Data Quality Assessment. Technics Publications, LLC (2007)
Monge, A.: Adaptive detection of approximately duplicate database records and the database integration approach to information discovery. Ph.D. thesis, University of California, San Diego, US (1997)
Winkler, W.E.: The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ciszak, Ł. (2009). A Method for Automatic Standardization of Text Attributes without Reference Data Sets. In: Cyran, K.A., Kozielski, S., Peters, J.F., Stańczyk, U., Wakulicz-Deja, A. (eds) Man-Machine Interactions. Advances in Intelligent and Soft Computing, vol 59. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00563-3_51
Download citation
DOI: https://doi.org/10.1007/978-3-642-00563-3_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00562-6
Online ISBN: 978-3-642-00563-3
eBook Packages: EngineeringEngineering (R0)