Skip to main content
Log in

The data quality improvement plan: deciding on choice and sequence of data quality improvements

  • Research Paper
  • Published:
Electronic Markets Aims and scope Submit manuscript

Abstract

With the rapid growth in the amount of data generated worldwide, ensuring adequate data quality (DQ) is increasingly becoming a challenge for companies: data are, among others, required to be timely, complete, consistent, valid, and accessible. Given this multidimensionality, DQ improvements (DQIs) need to be purposefully chosen and –as there can be path dependencies– arranged in an optimal sequence. Thus, this research contributes to performing the complex multidimensional task of ensuring adequate DQ in an economically reasonable manner by providing a formal decision model for identifying an optimal data quality improvement plan (DQIP). This DQIP comprises both an economically reasonable selection and execution sequence of DQIs based on existing interrelationships between different DQ dimensions. Furthermore, a comprehensive Monte Carlo simulation provides insights in implications to put the decision model into operation. For practitioners, the decision model enables efficient allocation of resources to DQIs. The model also gives advice on how to sequence DQIs and attracts attention to the complex problem context of DQ in order to support valid managerial decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Weights, \( {w}_{d_i} \), are not varied, as this would not provide adequate informative value.

  2. This proportionality holds for a +/−1% or +/− 5% variation too.

References

  • Ballou, D. P., & Pazer, H. L. (1995). Designing information systems to optimize the accuracy-timeliness tradeoff. Information Systems Research, 6(1), 51–72.

    Article  Google Scholar 

  • Ballou, D. P., & Pazer, H. L. (2003). Modeling completeness versus consistency tradeoffs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering, 15(1), 240–243.

    Article  Google Scholar 

  • Ballou, D. P., & Tayi, G. K. (1989). Methodology for allocating resources for data quality enhancement. Communications of the ACM, 32(3), 320–329.

    Article  Google Scholar 

  • Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.

    Article  Google Scholar 

  • Ballou, D. P., Wang, R. Y., Pazer, H. L., & Tayi, G. K. (1998). Modeling information manufacturing systems to determine information product quality. Management Science, 44(4), 462–484.

    Article  Google Scholar 

  • Barse, E. L., Kvarnström, H., & Jonsson, E. (2003). Synthesizing test data for fraud detection systems. Proceedings of the 19th Annual Computer Security Applications Conference, Las Vegas, NV, (USA). 384–395.

  • Batini, C., & Scannapieco, M. (2006). Data quality. Concepts, methodologies and techniques (data-centric systems and applications) (1st ed.). Berlin: Springer.

    Google Scholar 

  • De Amicis, F., Barone, D., & Batini, C. (2006). An analytical framework to analyze dependencies among data quality dimensions. Proceedings of the 11th International Conference on Information Quality, Cambridge, MA, (USA). 369–383.

  • Even, A., & Kaiser, M. (2009). A framework for economics-driven assessment of data quality decisions. Proceedings of the Fifteenth Americas Conference on Information Systems. San Francisco, California. Paper 436.

  • Even, A., & Shankaranarayanan, G. (2007). Utility-driven assessment of data quality. The DATA BASE for Advances in Information Systems, 38(2), 75–93.

    Article  Google Scholar 

  • Fisher, C. W., Chengalur-Smith, I. N., & Ballou, D. P. (2003). The impact of experience and time on the use of data quality information in decision making. Information Systems Research, 14(2), 170–188.

    Article  Google Scholar 

  • Fishman, G. S. (1996). Monte Carlo; concepts, algorithms, and applications. New York [u.a.]: Springer.

  • Forrester Research. (2011). Trends in data quality and business process alignment. Cambridge (USA).

  • Fridgen, G., & Müller, H. (2011). An approach for portfolio selection in multi-vendor IT outsourcing. Proceedings of the 32nd International Conference on Information Systems (ICIS), Shanghai, China.

  • Gackowski, Z. J. (2004). Logical interdependence of data/information quality dimensions—A purpose-focused view on IQ. Proceedings of the Ninth International Conference on Information Quality (ICIQ 2004), Cambridge, MA, (USA).

  • Gelman, I. A. (2010). Setting priorities for data accuracy improvements in satisficing decision-makingscenarios: a guiding theory. Decision Support Systems, 48(4), 507–520.

    Article  Google Scholar 

  • Gelman, I. A. (2012). A model of error propagation in conjunctive decisions and its application to database quality management. Journal of Database Management, 23(1), 103–126.

    Article  Google Scholar 

  • Harris Interactive. (2006). Information workers beware: Your business data can't be trusted. Retrieved 10/13, 2008, from http://www.sap.com/about/newsroom/businessobjects/20060625_005028.epx

  • Heinrich, B., Kaiser, M., & Klier, M. (2007a). How to measure data quality? – a metric based approach. Proceedings of the 28th International Conference on Information Systems (ICIS), Montreal, (Canada).

  • Heinrich, B., Kaiser, M., & Klier, M. (2007b). Metrics for measuring data quality – foundations for an economic data quality management. 2nd International Conference on Software and Data Technologies (ICSOFT), Barcelona, (Spain).

  • Heinrich, B., Kaiser, M., & Klier, M. (2009). A procedure to develop metrics for currency and its application in CRM. ACM Journal of Data and Information Quality, 1(1), 5:1–5:28.

    Google Scholar 

  • Heinrich, B., & Klier, M. (2011). Assessing data currency — a probabilistic approach. Journal of Information Science, 37(1), 86–100.

    Article  Google Scholar 

  • Helfert, M., Foley, O., Ge, M., & Cappiello, C. (2009). Limitations of weighted sum measures for information quality. San Francisco, CA, (USA).

  • Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information systems research. Management Information Systems Quarterly, 28(1), 75–106.

    Google Scholar 

  • Hüner, K. H., Schierning, A., Otto, B., & Österle, H. (2011). Product data quality in supply chains: the case of beiersdorf. Electronic Markets, 21, 141–154.

    Article  Google Scholar 

  • Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.

    Article  Google Scholar 

  • Lee, Y. W., Strong, D. M., Kahn, B. K., & Wang, R. Y. (2002). AIMQ: a methodology for information quality assessment. Information & Management, 40(2), 133–146.

    Article  Google Scholar 

  • Orr, K. (1998). Data quality and systems theory. Communications of the ACM, 41(2), 66–71.

    Article  Google Scholar 

  • Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.

    Article  Google Scholar 

  • Pipino, L., Lee, Y. W., & Wang, R. Y. (2002). Data quality assessment. Communications of the ACM, 45(4), 211–218.

    Article  Google Scholar 

  • Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-driven decision making. Big Data, 1(1), 51–59.

    Article  Google Scholar 

  • Radant, O., Colomo-Palacios, R., & Stantchev, V. (2014). Analysis of reasons, implications and consequences of demographic change for IT departments in times of scarcity of talent: a systematic review. International Journal of Knowledge Management, 10(4), 1–15.

    Article  Google Scholar 

  • Redman, T. C. (2004). Data: An unfolding quality disaster. DM Review.

    Google Scholar 

  • Russom, P. (2006). Taking data quality to the enterprise through data governance. Seattle: The Data Warehousing Institute.

    Google Scholar 

  • Sawilowsky, S., & Fahoome, G. C. (2002). Statistics through Monte Carlo simulation with fortran. Rochester Hills: JMASM.

    Google Scholar 

  • Shah, S., Horne, A., & Capellá, J. (2012). Good data won't guarantee good decisions. Harvard Business Review, 90(4), 23–25.

    Google Scholar 

  • Vera-Baquero, A., Colomo-Palacios, R., Stantchev, V., & Molloy, O. (2015). Leveraging big-data for business process analytics. The Learning Organization., 22(4), 215–228.

    Article  Google Scholar 

  • Wand, Y., & Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39(11), 86–95.

    Article  Google Scholar 

  • Wang, R. Y. (1998). A product perspective on total data quality management. Communications of the ACM, 41(2), 58–65.

    Article  Google Scholar 

  • Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.

    Article  Google Scholar 

Download references

Acknowledgments

Supportive inputs and helpful comments by Dr. Quirin Görz on an earlier version of this paper are gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominikus Kleindienst.

Additional information

Responsible Editor: Hans-Dieter Zimmermann

Appendix

Appendix

Supplementing the chapter “Development of the Decision Model

”, a more detailed derivation of the objective function is given in the following.

The decision to consider a DQI \( {v}_j^p \) in a DQIP is described formally using the decision variable \( {x}_{v_j^p} \), whereas the binary variable \( {x}_{v_j^p} \) is equal to 1 if DQI \( {v}_j^p \) is part of the DQIP, and 0 if not. Thus, the resulting DQ level \( {Q}_{d_{i,}{v}_j}^p \) of DQ dimension d i , after applying DQI \( {v}_j^p \), is calculated in the following manner:

$$ {Q}_{d_{i,}{v}_j}^p={Q}_{d_{i,}{v}_j}^{p-1}+\left\{\begin{array}{l}\left(1-{Q}_{d_{i,}{v}_j}^{p-1}\right)\cdot {I}_{d_i,{v}_j^p}\cdot {x}_{v_j^p}\kern0.5em if\kern0.5em {I}_{d_i,{v}_j^p}\ge 0\\ {}{Q}_{d_{i,}{v}_j}^{p-1}\cdot {I}_{d_i,{v}_j^p}\cdot {x}_{v_j^p}\kern0.5em if\kern0.5em {I}_{d_i,{v}_j^p}<0\end{array}\right.. $$

After remodelling and, for mathematical reasons, introducing the substitute variables \( {f}_{d_i,{v}_j^p} \) and \( {s}_{d_i,{v}_j^p} \), the resulting DQ level \( {Q}_{d_{i,}{v}_j}^p \) can also be written as

$$ {Q}_{d_{i,}{v}_j}^p={Q}_{d_{i,}{v}_j}^{p-1}\cdot {f}_{d_{i,}{v}_j^p}+{s}_{d_{i,}{v}_j^p,}\kern0.5em \mathrm{with}\ {f}_{d_{i,}{v}_j^p}=\left\{\begin{array}{l}1-{I}_{d_{i,}{v}_j^p}\cdot {x}_{v_j^p}\; if\;{I}_{d_{i,}{v}_j^p}\ge 0\\ {}1+{I}_{d_{i,}{v}_j^p}\cdot {x}_{v_j^p}\; if\;{I}_{d_{i,}{v}_j^p}<0\end{array}\right.\mathrm{and}\ {s}_{d_i,{v}_j^p}=\left\{\begin{array}{l}{I}_{d_{i,}{v}_j^p}\cdot {x}_{v_j^p}\; if\;{I}_{d_{i,}{v}_j^p}\ge 0\\ {}\begin{array}{cc}\hfill 0\hfill & \hfill if\;{I}_{d_{i,}{v}_j^p}<0\hfill \end{array}\end{array}\right.. $$

Based on this, the DQ level \( {Q}_{d_i}^m \) for dimension d i , after applying a complete DQIP for m DQIs \( {v}_j^p \) is calculated in the following manner:

$$ {\mathrm{Q}}_{{\mathrm{d}}_{\mathrm{i}}}^{\mathrm{m}}=\left(\left(\left(\dots \left(\left({\mathbf{Q}}_{{\mathbf{d}}_{\mathbf{i}}}^{\mathbf{p}=0} \cdot {\mathbf{f}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=1}}+{\mathbf{s}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=1}}\right) \cdot {\mathbf{f}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=2}}+{\mathbf{s}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=2}}\right)\cdot \dots \right) \cdot {\mathbf{f}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=\mathbf{m}-1}}+{\mathbf{s}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=\mathbf{m}-1}}\right) \cdot {\mathbf{f}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=\mathbf{m}}}+{\mathbf{s}}_{{\mathbf{d}}_{\mathbf{i}},{\mathbf{v}}_{\mathbf{j}}^{\mathbf{p}=\mathbf{m}}}\right) $$

The binary variables \( {x}_{v_j^p} \), that are part of \( {f}_{d_i,{v}_j^p} \) and \( {s}_{d_i,{v}_j^p} \), neutralize the effect of DQI \( {v}_j^p \) by taking the value 0 if a specific DQI \( {v}_j^p \) is not part of the DQIP. Although term (3) contains all m DQIs \( {v}_j^p \), a DQI selection is possible through this neutralization. Knowing the DQ level \( {Q}_{d_i}^m \) for each DQ dimension d i , the overall DQ level can be calculated. As there can be context-dependent differences between the DQ dimensions, an allocation of weights to different DQ dimensions is reasonable. Therefore, we use a weight \( {w}_{d_i} \) (with \( \sum_{i=1}^n{w}_{d_i}=1 \)) to weight the DQ dimensions in our decision model.

$$ {\sum}_{i=1}^n{w}_{d_i}\cdot {Q}_{d_i}^m $$

In contrast to former approaches that have been criticized in literature for not considering interrelationships when aggregating DQ dimensions to an overall DQ level, in this decision model, interrelationships between the DQ dimensions d i are implicitly considered by the impacts \( {I}_{d_i,{v}_j^p} \).

Since all DQIs are applied on the same dataset, which has a fixed size, and a DQIP is realized within one period (cf. A.3), the costs \( {c}_{v_j} \) for applying DQI v j are fixed. As a result, the overall DQIP costs are

$$ {\sum}_{j=1}^m{c}_{v_j}\cdot {x}_{v_j}. $$

According to assumption A.8, the costs of a DQIP must not exceed budget B; thus, the budget constraint for the decision model is

$$ {\sum}_{j=1}^m{c}_{v_j}\cdot {x}_{v_j}\le \mathrm{B} $$

In order to allocate a given budget in an economically reasonable manner to an available set of DQIs, all possible DQIPs need to be compared to each other. As a comparison criterion, we calculate the respective effectiveness E p = m of a DQIP in the way described in the chapter “Development of the Decision Model”. The objective function maximizes the effectiveness of a DQIP in the following manner:

$$ \mathrm{Maximize}{E}^{p=m}=\frac{\sum_{i=1}^n{w}_{d_i}\cdot \left({Q}_{d_i}^m-{Q}_{d_i}^{p=0}\right)}{1-{\sum}_{i=1}^n{w}_{d_i}\cdot {Q}_{d_i}^{p=0}}\mathrm{subject}\ \mathrm{t}\mathrm{o}\ {\sum}_{j=1}^m{c}_{v_j}\cdot {x}_{v_j}\le \mathrm{B} $$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kleindienst, D. The data quality improvement plan: deciding on choice and sequence of data quality improvements. Electron Markets 27, 387–398 (2017). https://doi.org/10.1007/s12525-017-0245-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12525-017-0245-6

Keywords

Jel Classification

Navigation