Abstract
The data quality assessment process consists of several phases; the first phase is the data profiling step. The result of this step is the set of the most current metadata describing the examined data set. We present here a method for automatic discovery of reference data for textual attributes. Our method combines the textual similarity approach with the characteristics of attribute value distribution. The method can discover the correct reference data values also in situations where there is a large number of data impurities. The results of the experiments performed on real address data prove that the method can effectively discover the current reference data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597 (2002)
Barabási, L., Bonabeau, E.: Scale-Free Networks. Scientific American 288, 60–69 (2003)
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Giebułtowicz, M.: Polish spelling errors categorization. In: Proceedings of the International Interdisciplinary Technical Conference of Young Scientists (2008)
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37 (1998)
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, p. 525. Wiley John & Sons, Chichester, ISBN: 9780764567575
Lindsey, E.: Three-Dimensional Analysis. Data Profiling Techniques. Data Profiling LLC, p. 242 ISBN: 9780980083309
Maydanchik, A.: Data Quality Assessment. Technics Publications, p. 336 ISBN: 9780977140022
Monge, A.: Adaptive detection of approximately duplicate records and the database integration approach to information discovery. PhD thesis, University of California (1997)
Olson, J.: Data Quality: The Accuracy Dimension, p. 300. Morgan Kaufmann, San Francisco, ISBN: 9781558608917
Winkler, W.E.: The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ciszak, L. (2009). A Method for Automatic Discovery of Reference Data. In: Chien, BC., Hong, TP., Chen, SM., Ali, M. (eds) Next-Generation Applied Intelligence. IEA/AIE 2009. Lecture Notes in Computer Science(), vol 5579. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02568-6_81
Download citation
DOI: https://doi.org/10.1007/978-3-642-02568-6_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02567-9
Online ISBN: 978-3-642-02568-6
eBook Packages: Computer ScienceComputer Science (R0)