Advertisement

A Data Cleaning Framework Based on User Feedback

  • Hui Xie
  • Hongzhi Wang
  • Jianzhong Li
  • Hong Gao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7923)

Abstract

In this paper, we present our design of a data cleaning framework that combines interaction of data quality rules (CFDS, CINDS and MDs) with user feedback through an interactive process. First, to generate candidate repairs for each potentially dirty attribute, we propose an optimization model based on genetic algorithm. We then create a Bayesian machine learning model with several committees to predict the correctness of the repair and rank these repairs by uncertainly score to improve the learned model. User feedback is used to decide whether the model is accurate while inspecting the suggestions. Finally, our experiments on real-world datasets show significant improvement in data quality.

Keywords

data clean user feedback Bayesian decision data quality rules 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  2. 2.
    Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modication. In: ACM SIGMOD, pp. 143–154 (2005)Google Scholar
  3. 3.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality:consistency and accuracy. In: VLDB, pp. 315–326 (2007)Google Scholar
  4. 4.
    Lopatenko, A., Bravo, L.: Efficient approximation algorithms for repairing inconsistent databases. In: ICDE, pp. 216–225 (2007)Google Scholar
  5. 5.
    Fan, W., Geerts, F.: Foundations of Data Quality Management. In: Synthesis Lectures on Data Management (2012)Google Scholar
  6. 6.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer (2006)Google Scholar
  7. 7.
    Greco, G., Greco, S., Zumpano, E.: A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng. 15(6), 1389–1408 (2003)CrossRefGoogle Scholar
  8. 8.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model and Algorithms. In: VLDB (2001)Google Scholar
  9. 9.
    Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB (2001)Google Scholar
  10. 10.
    Jeery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: ACM SIGMOD, pp. 847–860 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hui Xie
    • 1
  • Hongzhi Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.Harbin Institute of TechnologyChina

Personalised recommendations