Skip to main content
Log in

Discovering Data Quality Problems

The Case of Repurposed Data

  • Research Paper
  • Published:
Business & Information Systems Engineering Aims and scope Submit manuscript

Abstract

Existing methodologies for identifying data quality problems are typically user-centric, where data quality requirements are first determined in a top-down manner following well-established design guidelines, organizational structures and data governance frameworks. In the current data landscape, however, users are often confronted with new, unexplored datasets that they may not have any ownership of, but that are perceived to have relevance and potential to create value for them. Such repurposed datasets can be found in government open data portals, data markets and several publicly available data repositories. In such scenarios, applying top-down data quality checking approaches is not feasible, as the consumers of the data have no control over its creation and governance. Hence, data consumers – data scientists and analysts – need to be empowered with data exploration capabilities that allow them to investigate and understand the quality of such datasets to facilitate well-informed decisions on their use. This research aims to develop such an approach for discovering data quality problems using generic exploratory methods that can be effectively applied in settings where data creation and use is separated. The approach, named LANG, is developed through a Design Science approach on the basis of semiotics theory and data quality dimensions. LANG is empirically validated in terms of soundness of the approach, its repeatability and generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Source: Sadiq and Indulska (2017), data available from https://catalog.data.gov

Fig. 2

Source: Peffers et al. (2007)

Fig. 3

Source: Authors, informed by Peffers et al. (2007)

Fig. 4

Source: Authors

Similar content being viewed by others

Notes

  1. The researchers named the approach as ‘LANG’ – ‘Lang’ conveys the meaning of ‘becoming clear’ in the Chinese language, which fits with the aim of the approach, that is, to make clear the data quality requirements of a dataset.

  2. The mapping is omitted due to length considerations but is available from the authors upon request.

  3. The download period is between June and August 2016. We note that the datasets are frequently updated in the respective open data portals including change of meta-data, such as adding or removing columns as well as providing or removing other documentation related to the dataset. Hence, the current versions of the datasets may not have the same data quality problems as those identified in our study.

  4. In this paper we have demonstrated the application of LANG with the help of relational database (MySQL). We present the overall approach in the body of the paper, and present the SQL instantiation of the method in Appendix A.

  5. Some detail is abstracted in this figure for visual simplicity; in particular sequences between some of the individual checks, which may result in skipping certain checks/stages (as relevant on the basis of analysis results).

  6. “The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.” (jupyter.org).

References

  • Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J Int J Very Large Data Bases 24(4):557–581

    Article  Google Scholar 

  • Almars A (2016) Automated data quality discovery tool. Master Thesis, The University of Queensland

  • Batini C, Scannapieco M (2006) Data quality—concepts, methodologies and techniques. Springer, Heidelberg

    Google Scholar 

  • Batini C, Francalanci C, Cappiello C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv 41(3):1–52

    Article  Google Scholar 

  • Belkin R, Patil D (2013) Everything we wish we’d known about building data products. http://firstround.com/review/everything-we-wish-wed-known-about-building-data-products/. Accessed 14 Nov 2018

  • Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A (2007) Conditional functional dependencies for data cleaning. In: IEEE 23rd international conference on data engineering, pp 746–755

  • Byrne B, Kling J, Mccarty D, Sauter G, Smith H, Worcester P (2008) The information perspective of SOA design, part 6: the value of applying the data quality analysis pattern in SOA. IBM Corporation

  • Caballero I, Verbo E, Calero C, Piattini M (2007) A data quality measurement information model based on ISO/IEC 15939. In: Proceedings of the 12th international conference on information quality, pp 393–408

  • Caballero I, Verbo E, Calero C, Piattini M (2008) MMPRO: a methodology based on ISO/IEC 15939 to draw up data quality measurement processes. In: Proceedings of the 13th international conference on information quality, pp 326–340

  • Chakraborti S, Dey S (2019) Analysis of competitor intelligence in the era of big data. Bus Inf Syst Eng 61(3):345–355

    Article  Google Scholar 

  • Clarke R (2016) Big data, big risks. Inf Syst J 26(1):77–90

    Article  Google Scholar 

  • Corsar D, Edwards P (2017) Challenges of open data qality: more than just license, format, and customer support. ACM J Data Inf Qual 9(1):3:1–3:4

    Google Scholar 

  • Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York

    Book  Google Scholar 

  • Duus R, Cooray M (2016) The future will be built on open data—here’s why. http://theconversation.com/the-future-will-be-built-on-open-data-heres-why-52785. Accessed 14 Nov 2018

  • Ehling M, Körner T (2007) Handbook on data quality assessment methods and tools. European Commission, Eurostat

  • Elbaz G (2012) Data markets: the emerging data economy. http://techcrunch.com/2012/09/30/data-markets-the-emerging-data-economy/. Accessed 14 Nov 2018

  • English LP (1999) Improving data warehouse and business information quality. Wiley

  • English LP (2009) Information quality applied. Best practices for improving business information, processes and systems. Wiley, New York

    Google Scholar 

  • Eppler MJ (2001) The concept of information quality. Stud Commun Sci 1(2):167–182

    Google Scholar 

  • Fan W, Geerts F (2012) Foundations of data quality management. Synth Lect Data Manag 4(5):1–217

    Article  Google Scholar 

  • Fisher T (2009) The data asset: how smart companies govern their data for business success. Wiley, New York

    Google Scholar 

  • Gatling GCBR, Champlin R, Stefani H, Weigel G (2007) Enterprise information management with SAP. Galileo, Boston

    Google Scholar 

  • Gregor S, Jones D (2007) The anatomy of a design theory. J Assoc Inf Syst 8(5):312–335

    Google Scholar 

  • Hernández MA, Stolfo SJ (1998) Real-world data is dirty. Data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37

    Article  Google Scholar 

  • Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Q 28(1):75–105

    Article  Google Scholar 

  • Hey AJG, Trefethen AE (2003) The data deluge. An e-science perspective. https://eprints.soton.ac.uk/257648/1/The_Data_Deluge.pdf. Accessed 3 July 2019, pp 809–824

  • HIQA (2011) International review of data quality. Health Information and Quality Authority (HIQA), Ireland. http://www.hiqa.ie/press-release/2011-04-28-international-review-data-quality. Accessed 2 Oct 2017

  • ISO (2011) ISO/TS 8000-1 data quality part 1: overview. ISO

  • ISO (2012) ISO 8000-2 data quality-part 2-vocabulary. ISO

  • Jayawardene V, Sadiq S, Indulska M (2013a) An analysis of data quality dimensions. School of Information Technology and Electrical Engineering, The University of Queensland, ITEE Technical Report

  • Jayawardene V, Sadiq S, Indulska M (2013b) The curse of dimensionality in data quality. In: 24th Australasian conference on information systems. RMIT University, pp 1–11

  • Judah S, Friedman T (2015) Magic quadrant for data quality tools. Gartner

  • Kenett RS, Shmueli G (2014) On information quality. J R Stat Soc Ser A 177(1):3–38

    Article  Google Scholar 

  • Kim J, Hausenblas M (2012) 5 * Open Data. https://5stardata.info/en/. Accessed 14 Nov 2018

  • Köhler H, Leck U, Link S (2013) Possible and certain SQL keys. Department of Computer Science, The University of Auckland

  • Köhler H, Link S, Zhou X (2015) Possible and certain SQL keys. Proc VLDB Endow 8(11):1118–1129

    Article  Google Scholar 

  • Krogstie J (2002) A semiotic approach to quality in requirements specifications. In: Proceedings of the IFIP TC8/WG8 (1), pp 231–249

  • Krogstie J, Lindland OI, Sindre G (1995a) Defining quality aspects for conceptual models. In: Falkenberg ED, Hesse W, Olivé A (eds) Information system concepts. Springer, Boston, pp 216–231

    Chapter  Google Scholar 

  • Krogstie J, Lindland OI, Sindre G (1995b) Towards a deeper understanding of quality in requirements engineering. In: International conference on advanced information systems engineering. Springer, Heidelberg, pp 82–95

  • Krueger R, Casey M (1994) Focus groups. A practical guide for applied research. Sage Publications, Thousand Oaks

    Google Scholar 

  • Lee YW, Strong DM, Kahn BK, Wang RY (2002) AIMQ: a methodology for information quality assessment. Inf Manag 40(2):133–146

    Article  Google Scholar 

  • Lindland OI, Sindre G, Solvberg A (1994) Understanding quality in conceptual modeling. IEEE Softw 11(2):42–49

    Article  Google Scholar 

  • Loshin D (2001) Enterprise knowledge management. The data quality approach. Morgan Kaufmann, Burlington

    Google Scholar 

  • Loshin D (2006) Monitoring data quality performance using data quality metrics. Informatica Corporation, Redwood City

    Google Scholar 

  • Maydanchik A (2007) Data quality assessment. Technics Publications, New Jersey

    Google Scholar 

  • McGilvray D (2008) Executing data quality projects: ten steps to quality data and trusted information. Morgan Kaufmann, Burlington

    Google Scholar 

  • Morgan DL (ed) (1993) Sage focus editions. Successful focus groups: advancing the state of the art, vol 156. Sage Publications, Thousand Oaks

    Google Scholar 

  • Morris CW (1938) Foundations of the theory of signs. In: Langford CH (ed) International encyclopedia of unified science. University of Chicago Press, London

    Google Scholar 

  • Naumann F, Rolker C (2000) Assessment methods for information quality criteria. Humboldt-Universität zu Berlin, Informatik-Berichte, Berlin

    Google Scholar 

  • OMB U (2002) Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies, part IX. Office of Management and Budget

  • Peffers K, Tuunanen T, Rothenberger MA, Chatterjee S (2007) A design science research methodology for information systems research. J Manag Inf Syst 24(3):45–77

    Article  Google Scholar 

  • Pierce CS (1931–1935) Collected papers. Harvard University Press, Cambridge

  • Pipino L, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218

    Article  Google Scholar 

  • Powell RA, Single HM (1996) Focus groups. Int J Qual Health Care 8:499–504. https://doi.org/10.1093/intqhc/8.5.499

    Google Scholar 

  • Prat N (2019) Augmented analytics. Bus Inf Syst Eng 61(3):375–380

    Article  Google Scholar 

  • Price R, Shanks G (2004) A semiotic information quality framework. In: Proceedings of the international conference on decision support systems, pp 658–672

  • Price R, Shanks G (2005a) A semiotic information quality framework: development and comparative analysis. J Inf Technol 20(2):88–102

    Article  Google Scholar 

  • Price R. J, Shanks G (2005b) Empirical refinement of a semiotic information quality framework. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Big Island, pp 216a

  • Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: Proceedings of the 27th VLDB conference, Rome, pp 381–390

  • Rosemann M, Vessey I (2008) Toward improving the relevance of information systems research to practice: the role of applicability checks. MIS Q 32(1):1–22

    Article  Google Scholar 

  • Sadiq S, Indulska M (2017) Open data: quality over quantity. Int J Inf Manag 37(3):150–154

    Article  Google Scholar 

  • Sadiq S, Yeganeh NK, Indulska M (2011) 20 years of data quality research: themes, trends and synergies. In: 22nd Australasian database conference, Perth, pp 153–162

  • Scannapieco M, Virgillito A, Marchetti C, Mecella M, Baldoni R (2004) The Daquincis architecture: a platform for exchanging and improving data quality in cooperative information systems. Inf Syst 29(7):551–582

    Article  Google Scholar 

  • Selvage M, Saul J, Jain A (2017) Magic quadrant for data quality tools. Gartner

  • Shanks GG, Darke P (1998) Understanding data quality and data warehousing: a semiotic approach. IQ, pp 292–309

  • Shanks G, Tansley E (2002) Data quality tagging and decision outcomes. An experimental study. IFIP Working Group, pp 399–410

  • Sismanis Y, Brown P, Haas PJ, Reinwald B (2006) Gordian: efficient and scalable discovery of composite keys. In: Proceedings of the 32nd international conference on very large data bases, VLDB Endowment, pp 691–702

  • Song S, Chen L (2011) Differential dependencies Reasoning and discovery. ACM Trans Database Syst 36(3):16

    Article  Google Scholar 

  • Sonnenberg C, vom Brocke J (2012) Evaluations in the science of the artificial. Reconsidering the build-evaluate pattern in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 7286. Advances in theory and practice. DESRIST. Lecture notes in computer science. Springer, Heidelberg

    Google Scholar 

  • Stamper RK (1992) Review of Andersen “Theory of Computer Semiotics”. Comput J 1

  • Stamper R (1993) A semiotic theory of information and information systems/applied semiotics. In: Invited Papers for the ICL/University of Newcastle Seminar on “Information”, September 6–10 

  • Storey V, Wang R (2001) Extending the ER model to represent data quality requirements. Kluwer, Dordrecht

    Google Scholar 

  • Sturm B, Sunyaev A (2019) Design principles for systematic search systems. Bus Inf Syst Eng 61(1):91–111

    Article  Google Scholar 

  • Stvilia B, Gasser L, Twidale MB, Smith LC (2007) A framework for information quality assessment. J Am Soc Inf Sci Technol 58(12):1720–1733

    Article  Google Scholar 

  • Tu SY, Wang Y-YR (1993) Modeling data quality and context through extension of the ER model. Total Data Quality Management Research Program, Sloan School of Management, Massachusetts Institute of Technology, Cambridge

    Google Scholar 

  • Venable J, Pries-Heje J, Baskerville R (2012) A comprehensive framework for evaluation in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 786. Advances in theory and practice. Springer, Heidelberg, pp 423–438

    Google Scholar 

  • Venable J, Pries-Heje J, Baskerville R (2016) FEDS: a framework for evaluation in design science research. Eur J Inf Syst 25(1):77–89

    Article  Google Scholar 

  • Wand Y, Wang RY (1996) Anchoring data quality dimensions in ontological foundations. Commun ACM 39(11):86–95

    Article  Google Scholar 

  • Wang R (1998) A product perspective on total data quality management. Commun ACM 41(2):58–65

    Article  Google Scholar 

  • Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33

    Article  Google Scholar 

  • Wang R, Ziad M, Lee Y (2001) Data quality. Kluwer, Dordrecht

    Google Scholar 

  • Zhang R, Jayawardene V, Indulska M, Sadiq S, Zhou X (2014) A data driven approach for discovering data quality requirements. In: 35th international conference on information systems, Auckland

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marta Indulska.

Additional information

Accepted after two revisions by Matthias Jarke.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 90 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, R., Indulska, M. & Sadiq, S. Discovering Data Quality Problems. Bus Inf Syst Eng 61, 575–593 (2019). https://doi.org/10.1007/s12599-019-00608-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12599-019-00608-0

Keywords

Navigation