Skip to main content

A Method for Automatic Discovery of Reference Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5579))

Abstract

The data quality assessment process consists of several phases; the first phase is the data profiling step. The result of this step is the set of the most current metadata describing the examined data set. We present here a method for automatic discovery of reference data for textual attributes. Our method combines the textual similarity approach with the characteristics of attribute value distribution. The method can discover the correct reference data values also in situations where there is a large number of data impurities. The results of the experiments performed on real address data prove that the method can effectively discover the current reference data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597 (2002)

    Google Scholar 

  2. Barabási, L., Bonabeau, E.: Scale-Free Networks. Scientific American 288, 60–69 (2003)

    Article  Google Scholar 

  3. Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)

    Google Scholar 

  4. Giebułtowicz, M.: Polish spelling errors categorization. In: Proceedings of the International Interdisciplinary Technical Conference of Young Scientists (2008)

    Google Scholar 

  5. Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37 (1998)

    Article  Google Scholar 

  6. Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, p. 525. Wiley John & Sons, Chichester, ISBN: 9780764567575

    Google Scholar 

  7. Lindsey, E.: Three-Dimensional Analysis. Data Profiling Techniques. Data Profiling LLC, p. 242 ISBN: 9780980083309

    Google Scholar 

  8. Maydanchik, A.: Data Quality Assessment. Technics Publications, p. 336 ISBN: 9780977140022

    Google Scholar 

  9. Monge, A.: Adaptive detection of approximately duplicate records and the database integration approach to information discovery. PhD thesis, University of California (1997)

    Google Scholar 

  10. Olson, J.: Data Quality: The Accuracy Dimension, p. 300. Morgan Kaufmann, San Francisco, ISBN: 9781558608917

    Google Scholar 

  11. Winkler, W.E.: The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ciszak, L. (2009). A Method for Automatic Discovery of Reference Data. In: Chien, BC., Hong, TP., Chen, SM., Ali, M. (eds) Next-Generation Applied Intelligence. IEA/AIE 2009. Lecture Notes in Computer Science(), vol 5579. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02568-6_81

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02568-6_81

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02567-9

  • Online ISBN: 978-3-642-02568-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics