Abstract
With the rapid growth in the amount of data generated worldwide, ensuring adequate data quality (DQ) is increasingly becoming a challenge for companies: data are, among others, required to be timely, complete, consistent, valid, and accessible. Given this multidimensionality, DQ improvements (DQIs) need to be purposefully chosen and –as there can be path dependencies– arranged in an optimal sequence. Thus, this research contributes to performing the complex multidimensional task of ensuring adequate DQ in an economically reasonable manner by providing a formal decision model for identifying an optimal data quality improvement plan (DQIP). This DQIP comprises both an economically reasonable selection and execution sequence of DQIs based on existing interrelationships between different DQ dimensions. Furthermore, a comprehensive Monte Carlo simulation provides insights in implications to put the decision model into operation. For practitioners, the decision model enables efficient allocation of resources to DQIs. The model also gives advice on how to sequence DQIs and attracts attention to the complex problem context of DQ in order to support valid managerial decisions.
Similar content being viewed by others
Notes
Weights, \( {w}_{d_i} \), are not varied, as this would not provide adequate informative value.
This proportionality holds for a +/−1% or +/− 5% variation too.
References
Ballou, D. P., & Pazer, H. L. (1995). Designing information systems to optimize the accuracy-timeliness tradeoff. Information Systems Research, 6(1), 51–72.
Ballou, D. P., & Pazer, H. L. (2003). Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering, 15(1), 240–243.
Ballou, D. P., & Tayi, G. K. (1989). Methodology for allocating resources for data quality enhancement. Communications of the ACM, 32(3), 320–329.
Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.
Ballou, D. P., Wang, R. Y., Pazer, H. L., & Tayi, G. K. (1998). Modeling information manufacturing systems to determine information product quality. Management Science, 44(4), 462–484.
Barse, E. L., Kvarnström, H., & Jonsson, E. (2003). Synthesizing test data for fraud detection systems. Proceedings of the 19th Annual Computer Security Applications Conference, Las Vegas, NV, (USA). 384–395.
Batini, C., & Scannapieco, M. (2006). Data quality. Concepts, methodologies and techniques (data-centric systems and applications) (1st ed.). Berlin: Springer.
De Amicis, F., Barone, D., & Batini, C. (2006). An analytical framework to analyze dependencies among data quality dimensions. Proceedings of the 11th International Conference on Information Quality, Cambridge, MA, (USA). 369–383.
Even, A., & Kaiser, M. (2009). A framework for economics-driven assessment of data quality decisions. Proceedings of the Fifteenth Americas Conference on Information Systems. San Francisco, California. Paper 436.
Even, A., & Shankaranarayanan, G. (2007). Utility-driven assessment of data quality. The DATA BASE for Advances in Information Systems, 38(2), 75–93.
Fisher, C. W., Chengalur-Smith, I. N., & Ballou, D. P. (2003). The impact of experience and time on the use of data quality information in decision making. Information Systems Research, 14(2), 170–188.
Fishman, G. S. (1996). Monte Carlo; concepts, algorithms, and applications. New York [u.a.]: Springer.
Forrester Research. (2011). Trends in data quality and business process alignment. Cambridge (USA).
Fridgen, G., & Müller, H. (2011). An approach for portfolio selection in multi-vendor IT outsourcing. Proceedings of the 32nd International Conference on Information Systems (ICIS), Shanghai, China.
Gackowski, Z. J. (2004). Logical interdependence of data/information quality dimensions—A purpose-focused view on IQ. Proceedings of the Ninth International Conference on Information Quality (ICIQ 2004), Cambridge, MA, (USA).
Gelman, I. A. (2010). Setting priorities for data accuracy improvements in satisficing decision-makingscenarios: a guiding theory. Decision Support Systems, 48(4), 507–520.
Gelman, I. A. (2012). A model of error propagation in conjunctive decisions and its application to database quality management. Journal of Database Management, 23(1), 103–126.
Harris Interactive. (2006). Information workers beware: Your business data can't be trusted. Retrieved 10/13, 2008, from http://www.sap.com/about/newsroom/businessobjects/20060625_005028.epx
Heinrich, B., Kaiser, M., & Klier, M. (2007a). How to measure data quality? – a metric based approach. Proceedings of the 28th International Conference on Information Systems (ICIS), Montreal, (Canada).
Heinrich, B., Kaiser, M., & Klier, M. (2007b). Metrics for measuring data quality – foundations for an economic data quality management. 2nd International Conference on Software and Data Technologies (ICSOFT), Barcelona, (Spain).
Heinrich, B., Kaiser, M., & Klier, M. (2009). A procedure to develop metrics for currency and its application in CRM. ACM Journal of Data and Information Quality, 1(1), 5:1–5:28.
Heinrich, B., & Klier, M. (2011). Assessing data currency — a probabilistic approach. Journal of Information Science, 37(1), 86–100.
Helfert, M., Foley, O., Ge, M., & Cappiello, C. (2009). Limitations of weighted sum measures for information quality. San Francisco, CA, (USA).
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. Management Information Systems Quarterly, 28(1), 75–106.
Hüner, K. H., Schierning, A., Otto, B., & Österle, H. (2011). Product data quality in supply chains: the case of beiersdorf. Electronic Markets, 21, 141–154.
Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.
Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: a methodology for information quality assessment. Information & Management, 40(2), 133–146.
Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.
Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.
Pipino, L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218.
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data, 1(1), 51–59.
Radant, O., Colomo-Palacios, R., & Stantchev, V. (2014). Analysis of reasons, implications and consequences of demographic change for IT departments in times of scarcity of talent: a systematic review. International Journal of Knowledge Management, 10(4), 1–15.
Redman, T. C. (2004). Data: An unfolding quality disaster. DM Review.
Russom, P. (2006). Taking data quality to the enterprise through data governance. Seattle: The Data Warehousing Institute.
Sawilowsky, S., & Fahoome, G. C. (2002). Statistics through Monte Carlo simulation with fortran. Rochester Hills: JMASM.
Shah, S., Horne, A., & Capellá, J. (2012). Good data won't guarantee good decisions. Harvard Business Review, 90(4), 23–25.
Vera-Baquero, A., Colomo-Palacios, R., Stantchev, V., & Molloy, O. (2015). Leveraging big-data for business process analytics. The Learning Organization., 22(4), 215–228.
Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.
Wang, R. Y. (1998). A product perspective on total data quality management. Communications of the ACM, 41(2), 58–65.
Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.
Acknowledgments
Supportive inputs and helpful comments by Dr. Quirin Görz on an earlier version of this paper are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible Editor: Hans-Dieter Zimmermann
Appendix
Appendix
Supplementing the chapter “Development of the Decision Model
”, a more detailed derivation of the objective function is given in the following.
The decision to consider a DQI \( {v}_j^p \) in a DQIP is described formally using the decision variable \( {x}_{v_j^p} \), whereas the binary variable \( {x}_{v_j^p} \) is equal to 1 if DQI \( {v}_j^p \) is part of the DQIP, and 0 if not. Thus, the resulting DQ level \( {Q}_{d_{i,}{v}_j}^p \) of DQ dimension d i , after applying DQI \( {v}_j^p \), is calculated in the following manner:
After remodelling and, for mathematical reasons, introducing the substitute variables \( {f}_{d_i,{v}_j^p} \) and \( {s}_{d_i,{v}_j^p} \), the resulting DQ level \( {Q}_{d_{i,}{v}_j}^p \) can also be written as
Based on this, the DQ level \( {Q}_{d_i}^m \) for dimension d i , after applying a complete DQIP for m DQIs \( {v}_j^p \) is calculated in the following manner:
The binary variables \( {x}_{v_j^p} \), that are part of \( {f}_{d_i,{v}_j^p} \) and \( {s}_{d_i,{v}_j^p} \), neutralize the effect of DQI \( {v}_j^p \) by taking the value 0 if a specific DQI \( {v}_j^p \) is not part of the DQIP. Although term (3) contains all m DQIs \( {v}_j^p \), a DQI selection is possible through this neutralization. Knowing the DQ level \( {Q}_{d_i}^m \) for each DQ dimension d i , the overall DQ level can be calculated. As there can be context-dependent differences between the DQ dimensions, an allocation of weights to different DQ dimensions is reasonable. Therefore, we use a weight \( {w}_{d_i} \) (with \( \sum_{i=1}^n{w}_{d_i}=1 \)) to weight the DQ dimensions in our decision model.
In contrast to former approaches that have been criticized in literature for not considering interrelationships when aggregating DQ dimensions to an overall DQ level, in this decision model, interrelationships between the DQ dimensions d i are implicitly considered by the impacts \( {I}_{d_i,{v}_j^p} \).
Since all DQIs are applied on the same dataset, which has a fixed size, and a DQIP is realized within one period (cf. A.3), the costs \( {c}_{v_j} \) for applying DQI v j are fixed. As a result, the overall DQIP costs are
According to assumption A.8, the costs of a DQIP must not exceed budget B; thus, the budget constraint for the decision model is
In order to allocate a given budget in an economically reasonable manner to an available set of DQIs, all possible DQIPs need to be compared to each other. As a comparison criterion, we calculate the respective effectiveness E p = m of a DQIP in the way described in the chapter “Development of the Decision Model”. The objective function maximizes the effectiveness of a DQIP in the following manner:
Rights and permissions
About this article
Cite this article
Kleindienst, D. The data quality improvement plan: deciding on choice and sequence of data quality improvements. Electron Markets 27, 387–398 (2017). https://doi.org/10.1007/s12525-017-0245-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12525-017-0245-6