A Method for Automatic Standardization of Text Attributes without Reference Data Sets

Ciszak, Łukasz

doi:10.1007/978-3-642-00563-3_51

Łukasz Ciszak⁴

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 59))

1002 Accesses
3 Altmetric

Abstract

Data cleaning is an important step in information systems as low quality of data may seriously impact business processes that utilize the given system. Textual attributes are most prone to errors, especially during the data input stage. In this article, we propose a novel approach to automatic correction values of text attributes. the method combines approaches based on textual similarity with those using data distribution features. Contrary to all the methods in the area, our approach does not require third-party reference data. Experiments performed on real-world address data prove that the method may effectively clean the data with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barabási, A.L., Bonabeau, E.: Scale-free networks. Scientific American 288(5), 50–59 (2003)
Article Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (2003)
Google Scholar
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Statistics and Computing 1(2), 93–103 (1991)
Article Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 171–176 (1964)
Article Google Scholar
Dardzinska-Glebocka, A.: Chase method based on dynamic knowledge discovery for predicting values in incomplete information systems. Ph.D. thesis. Polish Academy of Sciences, Warsaw (2004)
Google Scholar
Deorowicz, S., Ciura, M.G.: Correcting spelling errors by modeling their causes. International Journal of Applied Mathematics and Computer Science 15, 275–285 (2005)
Google Scholar
Giebultowicz, M.: Polish spelling errors categorization. In: Proceedings of the 1st International Interdisciplinary Technical Conference of Young Scientists (2008)
Google Scholar
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, Chichester (2004)
Google Scholar
Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining, New York, US, pp. 290–294 (2000)
Google Scholar
Maydanchik, A.: Data Quality Assessment. Technics Publications, LLC (2007)
Google Scholar
Monge, A.: Adaptive detection of approximately duplicate database records and the database integration approach to information discovery. Ph.D. thesis, University of California, San Diego, US (1997)
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665, Warsaw, Poland
Łukasz Ciszak

Authors

Łukasz Ciszak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Silesian University of Technology, Gliwice, Poland
Krzysztof A. Cyran , Stanisław Kozielski , Urszula Stańczyk & Alicja Wakulicz-Deja , , &
University of Manitoba, Winnipeg, Canada
James F. Peters

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ciszak, Ł. (2009). A Method for Automatic Standardization of Text Attributes without Reference Data Sets. In: Cyran, K.A., Kozielski, S., Peters, J.F., Stańczyk, U., Wakulicz-Deja, A. (eds) Man-Machine Interactions. Advances in Intelligent and Soft Computing, vol 59. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00563-3_51

Download citation

DOI: https://doi.org/10.1007/978-3-642-00563-3_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00562-6
Online ISBN: 978-3-642-00563-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Method for Automatic Standardization of Text Attributes without Reference Data Sets