A Method for Automatic Discovery of Reference Data

Ciszak, Lukasz

doi:10.1007/978-3-642-02568-6_81

A Method for Automatic Discovery of Reference Data

Lukasz Ciszak²³

Conference paper

1545 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5579))

Abstract

The data quality assessment process consists of several phases; the first phase is the data profiling step. The result of this step is the set of the most current metadata describing the examined data set. We present here a method for automatic discovery of reference data for textual attributes. Our method combines the textual similarity approach with the characteristics of attribute value distribution. The method can discover the correct reference data values also in situations where there is a large number of data impurities. The results of the experiments performed on real address data prove that the method can effectively discover the current reference data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, pp. 586–597 (2002)
Google Scholar
Barabási, L., Bonabeau, E.: Scale-Free Networks. Scientific American 288, 60–69 (2003)
Article Google Scholar
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)
Google Scholar
Giebułtowicz, M.: Polish spelling errors categorization. In: Proceedings of the International Interdisciplinary Technical Conference of Young Scientists (2008)
Google Scholar
Hernández, M., Stolfo, S.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery 2, 9–37 (1998)
Article Google Scholar
Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data, p. 525. Wiley John & Sons, Chichester, ISBN: 9780764567575
Google Scholar
Lindsey, E.: Three-Dimensional Analysis. Data Profiling Techniques. Data Profiling LLC, p. 242 ISBN: 9780980083309
Google Scholar
Maydanchik, A.: Data Quality Assessment. Technics Publications, p. 336 ISBN: 9780977140022
Google Scholar
Monge, A.: Adaptive detection of approximately duplicate records and the database integration approach to information discovery. PhD thesis, University of California (1997)
Google Scholar
Olson, J.: Data Quality: The Accuracy Dimension, p. 300. Morgan Kaufmann, San Francisco, ISBN: 9781558608917
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R99/04 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665, Warsaw, Poland
Lukasz Ciszak

Authors

Lukasz Ciszak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Engineering, National University of Tainan, 700, Tainan, Taiwan
Been-Chian Chien
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan
Shyi-Ming Chen
Department of Computer Science, Texas State University-San Marcos, 601 University Drive, 78666-4616, San Marcos, TX, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ciszak, L. (2009). A Method for Automatic Discovery of Reference Data. In: Chien, BC., Hong, TP., Chen, SM., Ali, M. (eds) Next-Generation Applied Intelligence. IEA/AIE 2009. Lecture Notes in Computer Science(), vol 5579. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02568-6_81

Download citation

DOI: https://doi.org/10.1007/978-3-642-02568-6_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02567-9
Online ISBN: 978-3-642-02568-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics