Abstract
Most researchers agree that the quality of real-life data archives is often very poor, and this makes the definition and realisation of automatic techniques for cleansing data a relevant issue. In such a scenario, the Universal Cleansing framework has recently been proposed to automatically identify the most accurate cleansing alternatives among those synthesised through model-checking techniques. However, the identification of some values of the cleansed instances still relies on the rules defined by domain-experts and common practice, due to the difficulty to automatically derive them (e.g. the date value of an event to be added).
In this paper we extend this framework by including well-known machine learning algorithms - trained on the data recognised as consistent - to identify the information that the model based cleanser couldn’t produce. The proposed framework has been implemented and successfully evaluated on a real dataset describing the working careers of a population.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here the term believability is intended as “the extent to which data are accepted or regarded as true, real and credible” [48].
- 2.
The term “state” here is considered in terms of a value assignment to a set of finite-domain state variables.
References
Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of Massive Data Sets, vol. 4. Springer, US (2002)
Bertossi, L.: Consistent query answering in databases. ACM Sigmod Rec. 35(2), 68–76 (2006)
Bishop, C.M., et al.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1), 245–271 (1997)
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Inconsistency knowledge discovery for longitudinal data management: a model-based approach. In: Holzinger, A., Pasi, G. (eds.) HCI-KDD 2013. LNCS, vol. 7947, pp. 183–194. Springer, Heidelberg (2013)
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Planning meets data cleansing. In: The 24th International Conference on Automated Planning and Scheduling (ICAPS), pp. 439–443. AAAI (2014)
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: A policy-based cleansing and integration framework for labour and healthcare data. In: Holzinger, A., Jurisica, I. (eds.) Knowledge Discovery and Data Mining. LNCS, vol. 8401, pp. 141–168. Springer, Heidelberg (2014)
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M.: Towards data cleansing via planning. Intelligenza Artificiale 8(1), 57–69 (2014)
Chomicki, J., Marcinkowski, J.: Minimal-change integrity maintenance using tuple deletions. Inf. Comput. 197(1), 90–121 (2005)
Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Bertossi, L., Hunter, A., Schaub, T. (eds.) Inconsistency Tolerance. LNCS, vol. 3300, pp. 119–150. Springer, Heidelberg (2005)
Clemente, P., Kaba, B., Rouzaud-Cornabas, J., Alexandre, M., Aujay, G.: SPTrack: visual analysis of information flows within SELinux policies and attack logs. In: Huang, R., Ghorbani, A.A., Pasi, G., Yamaguchi, T., Yen, N.Y., Jin, B. (eds.) AMT 2012. LNCS, vol. 7669, pp. 596–605. Springer, Heidelberg (2012)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Tang, N.: Nadeef: a commodity data cleaning system. In: Ross, K.A., Srivastava, D., Papadias, D. (eds.) SIGMOD Conference, pp. 541–552. ACM (2013)
De Silva, V., Carlsson, G.: Topological estimation using witness complexes. In: Proceedings of the First Eurographics Conference on Point-Based Graphics, pp. 157–166. Eurographics Association (2004)
Devaraj, S., Kohli, R.: Information technology payoff in the health-care industry: a longitudinal study. J. Manag. Inf. Syst. 16(4), 41–68 (2000)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. In: Proceedings of the VLDB Endowment, vol. 3(1–2), pp. 173–184 (2010)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The kdd process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)
Fellegi, I.P., Holt, D.: A systematic approach to automatic edit and imputation. J. Am. Stat. Assoc. 71(353), 17–35 (1976)
Fisher, C., LaurĂa, E., Chengalur-Smith, S., Wang, R.: Introduction to Information Quality. AuthorHouse, USA (2012)
Freitag, D.: Machine learning for information extraction in informal domains. Mach. Learn. 39(2–3), 169–202 (2000)
Hansen, P., Järvelin, K.: Collaborative information retrieval in an information-intensive domain. Inf. Process. Manag. 41(5), 1101–1119 (2005)
Holzinger, A.: On knowledge discovery and interactive intelligent visualization of biomedical data - challenges in human-computer interaction & biomedical informatics. In: Helfert, M., Francalanci, C., Filipe, J. (eds.) DATA. SciTePress (2012)
Holzinger, A., Bruschi, M., Eder, W.: On interactive data visualization of physiological low-cost-sensor data with focus on mental stress. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 469–480. Springer, Heidelberg (2013)
Holzinger, A., Yildirim, P., Geier, M., Simonic, K.M.: Quality-based knowledge discovery from medical text on the web. In: Pasi et al. [38], pp. 145–158
Holzinger, A., Zupan, M.: Knodwat: a scientific framework application for testing knowledge discovery methods for the biomedical domain. BMC Bioinf. 14, 191 (2013)
Kapovich, I., Myasnikov, A., Schupp, P., Shpilrain, V.: Generic-case complexity, decision problems in group theory, and random walks. J. Algebra 264(2), 665–694 (2003)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI 1995, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco (1995). http://dl.acm.org/citation.cfm?id=1643031.1643047
Kolahi, S., Lakshmanan, L.V.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory, pp. 53–62. ACM (2009)
Lovaglio, P.G., Mezzanzanica, M.: Classification of longitudinal career paths. Qual. Quant. 47(2), 989–1008 (2013)
Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.: Overview and framework for data and information quality research. J. Data Inf. Qual. 1(1), 2:1–2:22 (2009)
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Data Quality through Model Checking Techniques. In: Gama, J., Bradley, E., Hollmén, J. (eds.) IDA 2011. LNCS, vol. 7014, pp. 270–281. Springer, Heidelberg (2011)
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Data quality sensitivity analysis on aggregate indicators. In: Helfert, M., Francalanci , C., Filipe, J. (eds.) DATA 2012-The International Conference on Data Technologies and Applications, pp. 97-108. SciTePress (2012). 10.5220/0004040300970108
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Automatic synthesis of data cleansing activities. In: Helfert, M., Francalanci, C. (eds.) The 2nd International Conference on Data Management Technologies and Applications (DATA), pp. 138–149. Scitepress (2013)
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Improving data cleansing accuracy: a model-based approach. In: The 3rd International Conference on Data Technologies and Applications, pp. 189–201. Insticc (2014)
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: A model-based evaluation of data quality activities in KDD. Inf. Process. Manag. 51(2), 144–166 (2015). doi:10.1016/j.ipm.2014.07.007
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: A model-based approach for developing data cleansing solutions. ACM J. Data Inf. Qual. 5(4), 1–28 (2015). doi:10.1145/2641575
Ng, A.Y.: Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 78. ACM (2004)
de Oliveira, M.C.F., Levkowitz, H.: From visual data exploration to visual data mining: a survey. IEEE Trans. Vis. Comput. Graph. 9(3), 378–394 (2003)
Pasi, G., Bordogna, G., Jain, L.C.: An introduction to quality issues in the management of web information. In: Quality Issues in the Management of Web Information [38], pp. 1–3
Pasi, G., Bordogna, G., Jain, L.C. (eds.): Quality Issues in the Management of Web Information. Intelligent Systems Reference Library, vol. 50. Springer, Heidelberg (2013)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Penna, G.D., Intrigila, B., Magazzeni, D., Mercorio, F.: UPMurphi: a tool for universal planning on pddl+ problems. In: Proceedings of the 19th International Conference on Automated Planning and Scheduling (ICAPS 2009), pp. 106–113. AAAI Press, Thessaloniki, Greece (2009). http://aaai.org/ocs/index.php/ICAPS/ICAPS09/paper/view/707
Penna, G.D., Magazzeni, D., Mercorio, F.: A universal planning system for hybrid domains. Appl. Intell. 36(4), 932–959 (2012). doi:10.1007/s10489-011-0306-z
Prinzie, A., Van den Poel, D.: odeling complex longitudinal consumer behavior with dynamic bayesian networks: an acquisition pattern analysis application. J. Intell. Inf. Syst. 36(3), 283–304 (2011)
Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Vardi, M.: Fundamentals of dependency theory. In: Borger, E. (ed.) Trends in Theoretical Computer Science, pp. 171–224. Computer Science Press, Rockville (1987)
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 International Conference on Management of Data, pp. 553–564. ACM (2013)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. (2015). Accurate Data Cleansing through Model Checking and Machine Learning Techniques. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2014. Communications in Computer and Information Science, vol 178. Springer, Cham. https://doi.org/10.1007/978-3-319-25936-9_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-25936-9_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25935-2
Online ISBN: 978-3-319-25936-9
eBook Packages: Computer ScienceComputer Science (R0)