Skip to main content

A Method for Automatic Standardization of Text Attributes without Reference Data Sets

  • Conference paper
Man-Machine Interactions

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 59))

Abstract

Data cleaning is an important step in information systems as low quality of data may seriously impact business processes that utilize the given system. Textual attributes are most prone to errors, especially during the data input stage. In this article, we propose a novel approach to automatic correction values of text attributes. the method combines approaches based on textual similarity with those using data distribution features. Contrary to all the methods in the area, our approach does not require third-party reference data. Experiments performed on real-world address data prove that the method may effectively clean the data with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barabási, A.L., Bonabeau, E.: Scale-free networks. Scientific American 288(5), 50–59 (2003)

    Article  Google Scholar 

  2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (2003)

    Google Scholar 

  3. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Statistics and Computing 1(2), 93–103 (1991)

    Article  Google Scholar 

  4. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 171–176 (1964)

    Article  Google Scholar 

  5. Dardzinska-Glebocka, A.: Chase method based on dynamic knowledge discovery for predicting values in incomplete information systems. Ph.D. thesis. Polish Academy of Sciences, Warsaw (2004)

    Google Scholar 

  6. Deorowicz, S., Ciura, M.G.: Correcting spelling errors by modeling their causes. International Journal of Applied Mathematics and Computer Science 15, 275–285 (2005)

    Google Scholar 

  7. Giebultowicz, M.: Polish spelling errors categorization. In: Proceedings of the 1st International Interdisciplinary Technical Conference of Young Scientists (2008)

    Google Scholar 

  8. Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  9. Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. John Wiley & Sons, Chichester (2004)

    Google Scholar 

  10. Lee, M.L., Ling, T.W., Low, W.L.: IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the 6th ACM International Conference on Knowledge Discovery and Data Mining, New York, US, pp. 290–294 (2000)

    Google Scholar 

  11. Maydanchik, A.: Data Quality Assessment. Technics Publications, LLC (2007)

    Google Scholar 

  12. Monge, A.: Adaptive detection of approximately duplicate database records and the database integration approach to information discovery. Ph.D. thesis, University of California, San Diego, US (1997)

    Google Scholar 

  13. Winkler, W.E.: The state of record linkage and current research problems. Tech. rep., Statistical Research Division, U.S. Census Bureau (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ciszak, Ł. (2009). A Method for Automatic Standardization of Text Attributes without Reference Data Sets. In: Cyran, K.A., Kozielski, S., Peters, J.F., Stańczyk, U., Wakulicz-Deja, A. (eds) Man-Machine Interactions. Advances in Intelligent and Soft Computing, vol 59. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00563-3_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00563-3_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00562-6

  • Online ISBN: 978-3-642-00563-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics