Syntactical Heuristics for the Open Data Quality Assessment and Their Applications

  • Donato Pirozzi
  • Vittorio Scarano
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 339)


Open Government Data are valuable initiatives in favour of transparency, accountability, and openness. The expectation is to increase participation by engaging citizens, non-profit organisations, and companies in reusing Open Data (OD). A potential barrier in the exploitation of OD and engagement of the target audience is the low quality of available datasets [3, 14, 16]. Non-technical consumers are often unaware that data could have potential quality issues, taking for grant that datasets can be used immediately without any further manipulation. In reality, in order to reuse data, for instance to create visualisations, they need to perform a data clean, which requires time, resources, and proper skills. This leads to a reduced chance to involve citizens.

This paper tackles the quality barrier of raw tabular datasets (i.e. CSV), a popular format (Tim-Berners Lee tree-stars) for Governmental Open Data. The objective is to increase awareness and provide support in data cleaning operations to both PAs to produce better quality Open Data and non-technical data consumers to reuse datasets. DataChecker is an open source and modular JavaScript library shared with community and available on GitHub that takes in input a tabular dataset and generate a machine-readable report based on the data type inferencing (a data profiling technique). Based on it the Social Platform for Open Data (SPOD) provides quality cleaning suggestions to both PAs and end-users.


Open data Quality assessment Type inferencing 



The research leading to results presented in this paper has been conducted in the project ROUTE-TO-PA ( that received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 645860. We gratefully acknowledge discussions with the project participants, who stimulated our work. Authors would like to thanks the anonymous reviewers for the interesting and valuable feedback.


  1. 1.
    Ambrosino, M.A., et al.: Protection and preservation of campania cultural heritage engaging local communities via the use of open data. In: Proceedings of the 19th International Conference on Digital Government Research. ACM (2018).
  2. 2.
    Andriessen, J., et al.: Increasing public value through co-creation of open knowledge. In: 2017 Fourth International Conference on eDemocracy & eGovernment (ICEDEG), pp. 47–54. IEEE (2017)Google Scholar
  3. 3.
    Beno, M., Figl, K., Umbrich, J., Polleres, A.: Open data hopes and fears: determining the barriers of open data. In: 2017 Conference for E-Democracy and Open Government (CeDEM), pp. 69–81. IEEE (2017)Google Scholar
  4. 4.
    Berners-Lee, T.: Linked data - design issues. Accessed 03 May 2018
  5. 5.
    Castro, D., Korte, T.: Open data in the G8: a review of progress on the open data charter (2015). Accessed 23 May 2018Google Scholar
  6. 6.
  7. 7.
    Commission, E.: Open data portal (2017).
  8. 8.
  9. 9.
    Cordasco, G., et al.: Engaging citizens with a social platform for open data. In: Proceedings of the 18th Annual International Conference on Digital Government Research, pp. 242–249. ACM (2017)Google Scholar
  10. 10.
    Dawes, S.S., Helbig, N.: Information strategies for open government: challenges and prospects for deriving public value from government transparency. In: Wimmer, M.A., Chappelet, J.-L., Janssen, M., Scholl, H.J. (eds.) EGOV 2010. LNCS, vol. 6228, pp. 50–60. Springer, Heidelberg (2010). Scholar
  11. 11.
    De Donato, R., et al.: Agile production of high quality open data. In: Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age, p. 84. ACM (2018)Google Scholar
  12. 12.
    De Donato, R., et al.: Datalet-ecosystem provider (deep): scalable architecture for reusable, portable and user-friendly visualizations of open data. In: 2017 Conference for E-Democracy and Open Government (CeDEM), pp. 92–101. IEEE (2017)Google Scholar
  13. 13.
    Döhmen, T., Mühleisen, H., Boncz, P.: Multi-hypothesis CSV parsing. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, p. 16. ACM (2017)Google Scholar
  14. 14.
    European Data Portal: Open data goldbook for data manager and data holders. Accessed 23 May 2018
  15. 15.
    Fish, A., Gargiulo, C., Malandrino, D., Pirozzi, D., Scarano, V.: Visual exploration system in an industrial context. IEEE Trans. Industr. Inf. 12(2), 567–575 (2016)CrossRefGoogle Scholar
  16. 16.
    Foundation TWWW: Open data barometer 4th (edn.) Global Report, May 2017.
  17. 17.
    Helbig, N., Cresswell, A.M., Burke, G.B., Luna-Reyes, L.: The dynamics of opening government data. Center for Technology in Government (2012). Accessed 23 May 2018
  18. 18.
    International OK: Open data handbook. Accessed 05 May 05 2018
  19. 19.
    Maydanchik, A.: Data Quality Assessment. Technics Publications, Denville (2007)Google Scholar
  20. 20.
    Naumann, F.: Data profiling revisited. ACM SIGMOD Rec. 42(4), 40–49 (2014)CrossRefGoogle Scholar
  21. 21.
    Open Data Charter: Open data charter web site. Accessed 23 May 2018
  22. 22.
    Open Knowledge International: Open definition (2018). Accessed 05 May 2018
  23. 23.
    Pirozzi, D., Scarano, V.: Support citizens in visualising open data. In: 20th International Conference on Information Visualisation (IV), pp. 271–276. IEEE (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Dipartimento di InformaticaUniverstà degli Studi di SalernoFiscianoItaly

Personalised recommendations