Discovering Data Quality Problems

Zhang, Ruojing; Indulska, Marta; Sadiq, Shazia

doi:10.1007/s12599-019-00608-0

Discovering Data Quality Problems

The Case of Repurposed Data

Research Paper
Published: 22 July 2019

Volume 61, pages 575–593, (2019)
Cite this article

Business & Information Systems Engineering Aims and scope Submit manuscript

Ruojing Zhang¹,
Marta Indulska² &
Shazia Sadiq¹

2881 Accesses
39 Citations
1 Altmetric
Explore all metrics

Abstract

Existing methodologies for identifying data quality problems are typically user-centric, where data quality requirements are first determined in a top-down manner following well-established design guidelines, organizational structures and data governance frameworks. In the current data landscape, however, users are often confronted with new, unexplored datasets that they may not have any ownership of, but that are perceived to have relevance and potential to create value for them. Such repurposed datasets can be found in government open data portals, data markets and several publicly available data repositories. In such scenarios, applying top-down data quality checking approaches is not feasible, as the consumers of the data have no control over its creation and governance. Hence, data consumers – data scientists and analysts – need to be empowered with data exploration capabilities that allow them to investigate and understand the quality of such datasets to facilitate well-informed decisions on their use. This research aims to develop such an approach for discovering data quality problems using generic exploratory methods that can be effectively applied in settings where data creation and use is separated. The approach, named LANG, is developed through a Design Science approach on the basis of semiotics theory and data quality dimensions. LANG is empirically validated in terms of soundness of the approach, its repeatability and generalizability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The researchers named the approach as ‘LANG’ – ‘Lang’ conveys the meaning of ‘becoming clear’ in the Chinese language, which fits with the aim of the approach, that is, to make clear the data quality requirements of a dataset.
The mapping is omitted due to length considerations but is available from the authors upon request.
The download period is between June and August 2016. We note that the datasets are frequently updated in the respective open data portals including change of meta-data, such as adding or removing columns as well as providing or removing other documentation related to the dataset. Hence, the current versions of the datasets may not have the same data quality problems as those identified in our study.
In this paper we have demonstrated the application of LANG with the help of relational database (MySQL). We present the overall approach in the body of the paper, and present the SQL instantiation of the method in Appendix A.
Some detail is abstracted in this figure for visual simplicity; in particular sequences between some of the individual checks, which may result in skipping certain checks/stages (as relevant on the basis of analysis results).
“The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.” (jupyter.org).

References

Abedjan Z, Golab L, Naumann F (2015) Profiling relational data: a survey. VLDB J Int J Very Large Data Bases 24(4):557–581
Article Google Scholar
Almars A (2016) Automated data quality discovery tool. Master Thesis, The University of Queensland
Batini C, Scannapieco M (2006) Data quality—concepts, methodologies and techniques. Springer, Heidelberg
Google Scholar
Batini C, Francalanci C, Cappiello C, Maurino A (2009) Methodologies for data quality assessment and improvement. ACM Comput Surv 41(3):1–52
Article Google Scholar
Belkin R, Patil D (2013) Everything we wish we’d known about building data products. http://firstround.com/review/everything-we-wish-wed-known-about-building-data-products/. Accessed 14 Nov 2018
Bohannon P, Fan W, Geerts F, Jia X, Kementsietsidis A (2007) Conditional functional dependencies for data cleaning. In: IEEE 23rd international conference on data engineering, pp 746–755
Byrne B, Kling J, Mccarty D, Sauter G, Smith H, Worcester P (2008) The information perspective of SOA design, part 6: the value of applying the data quality analysis pattern in SOA. IBM Corporation
Caballero I, Verbo E, Calero C, Piattini M (2007) A data quality measurement information model based on ISO/IEC 15939. In: Proceedings of the 12th international conference on information quality, pp 393–408
Caballero I, Verbo E, Calero C, Piattini M (2008) MMPRO: a methodology based on ISO/IEC 15939 to draw up data quality measurement processes. In: Proceedings of the 13th international conference on information quality, pp 326–340
Chakraborti S, Dey S (2019) Analysis of competitor intelligence in the era of big data. Bus Inf Syst Eng 61(3):345–355
Article Google Scholar
Clarke R (2016) Big data, big risks. Inf Syst J 26(1):77–90
Article Google Scholar
Corsar D, Edwards P (2017) Challenges of open data qality: more than just license, format, and customer support. ACM J Data Inf Qual 9(1):3:1–3:4
Google Scholar
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. Wiley, New York
Book Google Scholar
Duus R, Cooray M (2016) The future will be built on open data—here’s why. http://theconversation.com/the-future-will-be-built-on-open-data-heres-why-52785. Accessed 14 Nov 2018
Ehling M, Körner T (2007) Handbook on data quality assessment methods and tools. European Commission, Eurostat
Elbaz G (2012) Data markets: the emerging data economy. http://techcrunch.com/2012/09/30/data-markets-the-emerging-data-economy/. Accessed 14 Nov 2018
English LP (1999) Improving data warehouse and business information quality. Wiley
English LP (2009) Information quality applied. Best practices for improving business information, processes and systems. Wiley, New York
Google Scholar
Eppler MJ (2001) The concept of information quality. Stud Commun Sci 1(2):167–182
Google Scholar
Fan W, Geerts F (2012) Foundations of data quality management. Synth Lect Data Manag 4(5):1–217
Article Google Scholar
Fisher T (2009) The data asset: how smart companies govern their data for business success. Wiley, New York
Google Scholar
Gatling GCBR, Champlin R, Stefani H, Weigel G (2007) Enterprise information management with SAP. Galileo, Boston
Google Scholar
Gregor S, Jones D (2007) The anatomy of a design theory. J Assoc Inf Syst 8(5):312–335
Google Scholar
Hernández MA, Stolfo SJ (1998) Real-world data is dirty. Data cleansing and the merge/purge problem. Data Min Knowl Discov 2(1):9–37
Article Google Scholar
Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Q 28(1):75–105
Article Google Scholar
Hey AJG, Trefethen AE (2003) The data deluge. An e-science perspective. https://eprints.soton.ac.uk/257648/1/The_Data_Deluge.pdf. Accessed 3 July 2019, pp 809–824
HIQA (2011) International review of data quality. Health Information and Quality Authority (HIQA), Ireland. http://www.hiqa.ie/press-release/2011-04-28-international-review-data-quality. Accessed 2 Oct 2017
ISO (2011) ISO/TS 8000-1 data quality part 1: overview. ISO
ISO (2012) ISO 8000-2 data quality-part 2-vocabulary. ISO
Jayawardene V, Sadiq S, Indulska M (2013a) An analysis of data quality dimensions. School of Information Technology and Electrical Engineering, The University of Queensland, ITEE Technical Report
Jayawardene V, Sadiq S, Indulska M (2013b) The curse of dimensionality in data quality. In: 24th Australasian conference on information systems. RMIT University, pp 1–11
Judah S, Friedman T (2015) Magic quadrant for data quality tools. Gartner
Kenett RS, Shmueli G (2014) On information quality. J R Stat Soc Ser A 177(1):3–38
Article Google Scholar
Kim J, Hausenblas M (2012) 5 * Open Data. https://5stardata.info/en/. Accessed 14 Nov 2018
Köhler H, Leck U, Link S (2013) Possible and certain SQL keys. Department of Computer Science, The University of Auckland
Köhler H, Link S, Zhou X (2015) Possible and certain SQL keys. Proc VLDB Endow 8(11):1118–1129
Article Google Scholar
Krogstie J (2002) A semiotic approach to quality in requirements specifications. In: Proceedings of the IFIP TC8/WG8 (1), pp 231–249
Krogstie J, Lindland OI, Sindre G (1995a) Defining quality aspects for conceptual models. In: Falkenberg ED, Hesse W, Olivé A (eds) Information system concepts. Springer, Boston, pp 216–231
Chapter Google Scholar
Krogstie J, Lindland OI, Sindre G (1995b) Towards a deeper understanding of quality in requirements engineering. In: International conference on advanced information systems engineering. Springer, Heidelberg, pp 82–95
Krueger R, Casey M (1994) Focus groups. A practical guide for applied research. Sage Publications, Thousand Oaks
Google Scholar
Lee YW, Strong DM, Kahn BK, Wang RY (2002) AIMQ: a methodology for information quality assessment. Inf Manag 40(2):133–146
Article Google Scholar
Lindland OI, Sindre G, Solvberg A (1994) Understanding quality in conceptual modeling. IEEE Softw 11(2):42–49
Article Google Scholar
Loshin D (2001) Enterprise knowledge management. The data quality approach. Morgan Kaufmann, Burlington
Google Scholar
Loshin D (2006) Monitoring data quality performance using data quality metrics. Informatica Corporation, Redwood City
Google Scholar
Maydanchik A (2007) Data quality assessment. Technics Publications, New Jersey
Google Scholar
McGilvray D (2008) Executing data quality projects: ten steps to quality data and trusted information. Morgan Kaufmann, Burlington
Google Scholar
Morgan DL (ed) (1993) Sage focus editions. Successful focus groups: advancing the state of the art, vol 156. Sage Publications, Thousand Oaks
Google Scholar
Morris CW (1938) Foundations of the theory of signs. In: Langford CH (ed) International encyclopedia of unified science. University of Chicago Press, London
Google Scholar
Naumann F, Rolker C (2000) Assessment methods for information quality criteria. Humboldt-Universität zu Berlin, Informatik-Berichte, Berlin
Google Scholar
OMB U (2002) Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies, part IX. Office of Management and Budget
Peffers K, Tuunanen T, Rothenberger MA, Chatterjee S (2007) A design science research methodology for information systems research. J Manag Inf Syst 24(3):45–77
Article Google Scholar
Pierce CS (1931–1935) Collected papers. Harvard University Press, Cambridge
Pipino L, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Article Google Scholar
Powell RA, Single HM (1996) Focus groups. Int J Qual Health Care 8:499–504. https://doi.org/10.1093/intqhc/8.5.499
Google Scholar
Prat N (2019) Augmented analytics. Bus Inf Syst Eng 61(3):375–380
Article Google Scholar
Price R, Shanks G (2004) A semiotic information quality framework. In: Proceedings of the international conference on decision support systems, pp 658–672
Price R, Shanks G (2005a) A semiotic information quality framework: development and comparative analysis. J Inf Technol 20(2):88–102
Article Google Scholar
Price R. J, Shanks G (2005b) Empirical refinement of a semiotic information quality framework. In: Proceedings of the 38th annual Hawaii international conference on system sciences, Big Island, pp 216a
Raman V, Hellerstein JM (2001) Potter’s wheel: an interactive data cleaning system. In: Proceedings of the 27th VLDB conference, Rome, pp 381–390
Rosemann M, Vessey I (2008) Toward improving the relevance of information systems research to practice: the role of applicability checks. MIS Q 32(1):1–22
Article Google Scholar
Sadiq S, Indulska M (2017) Open data: quality over quantity. Int J Inf Manag 37(3):150–154
Article Google Scholar
Sadiq S, Yeganeh NK, Indulska M (2011) 20 years of data quality research: themes, trends and synergies. In: 22nd Australasian database conference, Perth, pp 153–162
Scannapieco M, Virgillito A, Marchetti C, Mecella M, Baldoni R (2004) The Daquincis architecture: a platform for exchanging and improving data quality in cooperative information systems. Inf Syst 29(7):551–582
Article Google Scholar
Selvage M, Saul J, Jain A (2017) Magic quadrant for data quality tools. Gartner
Shanks GG, Darke P (1998) Understanding data quality and data warehousing: a semiotic approach. IQ, pp 292–309
Shanks G, Tansley E (2002) Data quality tagging and decision outcomes. An experimental study. IFIP Working Group, pp 399–410
Sismanis Y, Brown P, Haas PJ, Reinwald B (2006) Gordian: efficient and scalable discovery of composite keys. In: Proceedings of the 32nd international conference on very large data bases, VLDB Endowment, pp 691–702
Song S, Chen L (2011) Differential dependencies Reasoning and discovery. ACM Trans Database Syst 36(3):16
Article Google Scholar
Sonnenberg C, vom Brocke J (2012) Evaluations in the science of the artificial. Reconsidering the build-evaluate pattern in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 7286. Advances in theory and practice. DESRIST. Lecture notes in computer science. Springer, Heidelberg
Google Scholar
Stamper RK (1992) Review of Andersen “Theory of Computer Semiotics”. Comput J 1
Stamper R (1993) A semiotic theory of information and information systems/applied semiotics. In: Invited Papers for the ICL/University of Newcastle Seminar on “Information”, September 6–10
Storey V, Wang R (2001) Extending the ER model to represent data quality requirements. Kluwer, Dordrecht
Google Scholar
Sturm B, Sunyaev A (2019) Design principles for systematic search systems. Bus Inf Syst Eng 61(1):91–111
Article Google Scholar
Stvilia B, Gasser L, Twidale MB, Smith LC (2007) A framework for information quality assessment. J Am Soc Inf Sci Technol 58(12):1720–1733
Article Google Scholar
Tu SY, Wang Y-YR (1993) Modeling data quality and context through extension of the ER model. Total Data Quality Management Research Program, Sloan School of Management, Massachusetts Institute of Technology, Cambridge
Google Scholar
Venable J, Pries-Heje J, Baskerville R (2012) A comprehensive framework for evaluation in design science research. In: Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems, vol 786. Advances in theory and practice. Springer, Heidelberg, pp 423–438
Google Scholar
Venable J, Pries-Heje J, Baskerville R (2016) FEDS: a framework for evaluation in design science research. Eur J Inf Syst 25(1):77–89
Article Google Scholar
Wand Y, Wang RY (1996) Anchoring data quality dimensions in ontological foundations. Commun ACM 39(11):86–95
Article Google Scholar
Wang R (1998) A product perspective on total data quality management. Commun ACM 41(2):58–65
Article Google Scholar
Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33
Article Google Scholar
Wang R, Ziad M, Lee Y (2001) Data quality. Kluwer, Dordrecht
Google Scholar
Zhang R, Jayawardene V, Indulska M, Sadiq S, Zhou X (2014) A data driven approach for discovering data quality requirements. In: 35th international conference on information systems, Auckland

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, St Lucia, QLD, 4072, Australia
Ruojing Zhang & Shazia Sadiq
UQ Business School, The University of Queensland, St Lucia, QLD, 4072, Australia
Marta Indulska

Authors

Ruojing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Marta Indulska
View author publications
You can also search for this author in PubMed Google Scholar
Shazia Sadiq
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marta Indulska.

Additional information

Accepted after two revisions by Matthias Jarke.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 90 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, R., Indulska, M. & Sadiq, S. Discovering Data Quality Problems. Bus Inf Syst Eng 61, 575–593 (2019). https://doi.org/10.1007/s12599-019-00608-0

Download citation

Received: 03 October 2017
Accepted: 28 June 2019
Published: 22 July 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s12599-019-00608-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering Data Quality Problems

Abstract

Access this article

Similar content being viewed by others

How to use and assess qualitative research methods

Questionnaire Design

Literature reviews as independent studies: guidelines for academic practice

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (PDF 90 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discovering Data Quality Problems

Abstract

Access this article

Similar content being viewed by others

How to use and assess qualitative research methods

Questionnaire Design

Literature reviews as independent studies: guidelines for academic practice

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (PDF 90 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation