Advertisement

Managing Data Quality of the Data Warehouse: A Chance-Constrained Programming Approach

  • Qi Liu
  • Gengzhong FengEmail author
  • Giri Kumar Tayi
  • Jun Tian
Article
  • 19 Downloads

Abstract

To make informed decisions, managers establish data warehouses that integrate multiple data sources. However, the outcomes of the data warehouse-based decisions are not always satisfactory due to low data quality. Although many studies focused on data quality management, little effort has been made to explore effective data quality control strategies for the data warehouse. In this study, we propose a chance-constrained programming model that determines the optimal strategy for allocating the control resources to mitigate the data quality problems of the data warehouse. We develop a modified Artificial Bee Colony algorithm to solve the model. Our work contributes to the literature on evaluation of data quality problem propagation in data integration process and data quality control on the data sources that make up the data warehouse. We use a data warehouse in the healthcare organization to illustrate the model and the effectiveness of the algorithm.

Keywords

Data quality Data warehouse Chance-constrained programming Optimization model Artificial bee Colony algorithm 

Notes

Acknowledgements

The research presented in this paper is supported by the National Natural Science Foundation Project of China (71572145).

References

  1. Afshang, M., & Dhillon, H. S. (2018). Poisson cluster process based analysis of hetnets with correlated user and base station locations. IEEE Transactions on Wireless Communications, 17(4), 2417–2431.CrossRefGoogle Scholar
  2. Akay, B., & Karaboga, D. (2012). A modified artificial bee colony algorithm for real-parameter optimization. Information Sciences, 192, 120–142.CrossRefGoogle Scholar
  3. Allam, A., Skiadopoulos, S., & Kalnis, P. (2018). Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering, 117, 98–113.CrossRefGoogle Scholar
  4. Aquilani, B., Silvestri, C., Ruggieri, A., & Gatti, C. (2017). A systematic literature review on total quality management critical success factors and the identification of new avenues of research. The TQM Journal, 29(1), 184–213.CrossRefGoogle Scholar
  5. Arora, R., Pahwa, P., & Gupta, D. (2017). Data quality improvement in data warehouse: A framework. International Journal of Data Analysis Techniques & Strategies, 9(1), 17–33.Google Scholar
  6. Bai, X., Krishnan, R., Padman, R., & Wang, H. J. (2013). On risk management with information flows in business processes. Information Systems Research, 24(3), 731–749.CrossRefGoogle Scholar
  7. Ballou, D. P., & Tayi, G. K. (1999). Enhancing data quality in data warehouse environments. Communications of the ACM, 42(1), 73–78.CrossRefGoogle Scholar
  8. Ballou, D. P., Chengalur-Smith, I. S. N., & Wang, R. Y. (2006). Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering, 18(5), 639–650.CrossRefGoogle Scholar
  9. Batini, C., & Scannapieco, M. (2016). Data and information quality: Dimensions, principles and techniques. Berlin: Springer.CrossRefGoogle Scholar
  10. Cannella, S., Framinan, J. M., Bruccoleri, M., Barbosa-Póvoa, A. P., & Relvas, S. (2015). The effect of inventory record inaccuracy in information exchange supply chains. European Journal of Operational Research, 243(1), 120–129.CrossRefGoogle Scholar
  11. Charnes, A., & Cooper, W. (1959). Chance-constrained programming. Management Science, 6(1), 73–79.CrossRefGoogle Scholar
  12. Chen, C. Y., Chi, Y. L., & Wolfe, P. (2005). An object-oriented quality framework with optimization models for managing data quality in data warehouse applications. International Journal of Operations Research, 2(2), 1–81.Google Scholar
  13. Chen, L., Zhou, C., Li, X., & Dai, G. (2017). An improved differential evolution algorithm based on suboptimal solution mutation. International Journal of Computing Science and Mathematics, 8(1), 28–34.CrossRefGoogle Scholar
  14. Conforti, R., Dumas, M., García-Bañuelos, L., & La Rosa, M. (2016). Bpmn miner: Automated discovery of bpmn process models with hierarchical structure. Information Systems, 56, 284–303.CrossRefGoogle Scholar
  15. Dakrory, S. B., Mahmoud, T. M., & Ali, A. A. (2015). Automated etl testing on the data quality of a data warehouse. International Journal of Computer Applications, 131(16), 9–16.CrossRefGoogle Scholar
  16. Davidson, I., & Tayi, G. (2009). Data preparation using data quality matrices for classification mining. European Journal of Operational Research, 197(2), 764–772.CrossRefGoogle Scholar
  17. DeWitt, J. G., & Hampton, P. M. (2005). Development of a data warehouse at an academic health system: Knowing a place for the first time. Academic Medicine, 80(11), 1019–1025.CrossRefGoogle Scholar
  18. Dey, D., & Kumar, S. (2010). Reassessing data quality for information products. Management Science, 56(12), 2316–2322.CrossRefGoogle Scholar
  19. Dey, D., & Kumar, S. (2013). Data quality of query results with generalized selection conditions. Operations Research, 61(1), 17–31.CrossRefGoogle Scholar
  20. Even, A., Shankaranarayanan, G., & Berger, P. D. (2010). Evaluating a model for cost-effective data quality management in a real-world crm setting. Decision Support Systems, 50(1), 152–163.CrossRefGoogle Scholar
  21. Experian. (2016). The 2016 global data management benchmark report. Retrieved from Boston: https://www.edq.com/globalassets/white-papers/2016-global-data-management-benchmark-report.pdf
  22. Experian. (2017). The 2017 global data management benchmark report. Retrieved from https://www.edq.com/globalassets/white-papers/2017-global-data-management-benchmark-report.pdf
  23. Garcia-Bernardo, J., & Takes, F. W. (2018). The effects of data quality on the analysis of corporate board interlock networks. Information Systems, 78, 164–172.CrossRefGoogle Scholar
  24. Harkany, T., & Hagnermcwhirter, A. (2015). Quantitative western blotting: Improving your data quality and reproducibility. Science, 347(6225), 1022.CrossRefGoogle Scholar
  25. Hartzema, A. G., Reich, C. G., Ryan, P. B., Stang, P. E., Madigan, D., Welebob, E., & Overhage, J. M. (2013). Managing data quality for a drug safety surveillance system. Drug Safety, 36(1), 49–58.CrossRefGoogle Scholar
  26. Heinrich, B., Hristova, D., Klier, M., Schiller, A., & Szubartowicz, M. (2018). Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2), 12.Google Scholar
  27. Jannot, A.-S., Zapletal, E., Avillach, P., Mamzer, M.-F., Burgun, A., & Degoulet, P. (2017). The georges pompidou university hospital clinical data warehouse: A 8-years follow-up experience. International Journal of Medical Informatics, 102, 21–28.CrossRefGoogle Scholar
  28. Jiang, Z., Sarkar, S., De, P., & Dey, D. (2007). A framework for reconciling attribute values from multiple data sources. Management Science, 53(12), 1946–1963.CrossRefGoogle Scholar
  29. Jones-Farmer, L. A., Ezell, J. D., & Hazen, B. T. (2014). Applying control chart methods to enhance data quality. Technometrics, 56(1), 29–41.CrossRefGoogle Scholar
  30. Lee, Y. W. (2006). Journey to data quality. Cambridge, MA: MIT Press.Google Scholar
  31. Liu, X., Heller, A., & Nielsen, P. S. (2017). Citiesdata: A smart city data management framework. Knowledge and Information Systems, 53(3), 699–722.CrossRefGoogle Scholar
  32. Liu, Q., Feng, G., Wang, N., & Tayi, G. K. (2018). A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge. Information Systems Frontiers, 20(2), 401–416.CrossRefGoogle Scholar
  33. Lu, J., Feng, G., Lai, K. K., & Wang, N. (2017). The bullwhip effect on inventory: A perspective on information quality. Applied Economics, 49(24), 2322–2338.CrossRefGoogle Scholar
  34. Lukyanenko, R., Wiggins, A., & Rosser, H. K. (2019). Citizen science: An information quality research frontier. Information Systems Frontiers, 1–23.  https://doi.org/10.1007/s10796-019-09915-z.
  35. Manogaran, G., & Lopez, D. (2018). A gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 21(1), 189–204.CrossRefGoogle Scholar
  36. Mohammed, A., & Talab, S. A. (2015). Enhanced extraction clinical data technique to improve data quality in clinical data warehouse. International Journal of Database Theory and Application, 8(3), 333–342.CrossRefGoogle Scholar
  37. Parssian, A., Sarkar, S., & Jacob, V. S. (2004). Assessing data quality for information products: Impact of selection, projection, and cartesian product. Management Science, 50(7), 967–982.CrossRefGoogle Scholar
  38. Parssian, A., Sarkar, S., & Jacob, V. S. (2009). Impact of the union and difference operations on the quality of information products. Information Systems Research, 20(1), 99–120.CrossRefGoogle Scholar
  39. Pittet, D., & Donaldson, L. (2006). Challenging the world: Patient safety and health care-associated infection. International Journal for Quality in Health Care, 18(1), 4–8.CrossRefGoogle Scholar
  40. Poojari, C. A., & Varghese, B. (2008). Genetic algorithm based technique for solving chance constrained problems. European Journal of Operational Research, 185(3), 1128–1154.CrossRefGoogle Scholar
  41. Qin, X., & Huang, G. (2009). An inexact chance-constrained quadratic programming model for stream water quality management. Water Resources Management, 23(4), 661–695.CrossRefGoogle Scholar
  42. Sagi, T., Gal, A., Barkol, O., Bergman, R., & Avram, A. (2017). Multi-source uncertain entity resolution: Transforming holocaust victim reports into people. Information Systems, 65, 124–136.CrossRefGoogle Scholar
  43. Sakalli, Ü. S. (2013). A simulated annealing approach for reliability-based chance-constrained programming. Applied Stochastic Models in Business & Industry, 30(4), 497–508.CrossRefGoogle Scholar
  44. Sebaa, A., Chikh, F., Nouicer, A., & Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4), 59.CrossRefGoogle Scholar
  45. Subramanian, G. H., & Wang, K. (2017). Systems dynamics-based modeling of data warehouse quality. Journal of Computer Information Systems, 1–8.  https://doi.org/10.1080/08874417.2017.1383863.CrossRefGoogle Scholar
  46. Szeto, W., Wu, Y., & Ho, S. C. (2011). An artificial bee colony algorithm for the capacitated vehicle routing problem. European Journal of Operational Research, 215(1), 126–135.CrossRefGoogle Scholar
  47. Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457–469.CrossRefGoogle Scholar
  48. Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 12(4), 5–33.CrossRefGoogle Scholar
  49. Wang, Y. Y., Huang, G. H., Wang, S., Li, W., & Guan, P. B. (2016). A risk-based interactive multi-stage stochastic programming approach for water resources planning under dual uncertainties. Advances in Water Resources, 94, 217–230.CrossRefGoogle Scholar
  50. Watson, H. J., Fuller, C., & Ariyachandra, T. (2004). Data warehouse governance: Best practices at blue cross and blue shield of North Carolina. Decision Support Systems, 38(3), 435–450.CrossRefGoogle Scholar
  51. Watts, S., Shankaranarayanan, G., & Even, A. (2009). Data quality assessment in context: A cognitive perspective. Decision Support Systems, 48(1), 202–211.CrossRefGoogle Scholar
  52. Xu, Y., Wang, L., Xu, B., Jiang, W., Deng, C., Ji, F., & Xu, X. (2019). An information integration and transmission model of multi-source data for product quality and safety. Information Systems Frontiers, 21(1), 191–212.CrossRefGoogle Scholar
  53. Zak, Y., & Even, A. (2017). Development and evaluation of a continuous-time markov chain model for detecting and handling data currency declines. Decision Support Systems, 103, 82–93.CrossRefGoogle Scholar
  54. Zhu, H.-J., Jiang, T.-H., Wang, Y., Cheng, L., Ma, B., & Zhao, F. (2019). A data cleaning method for heterogeneous attribute fusion and record linkage. International Journal of Computational Science and Engineering, 19(3), 311–324.CrossRefGoogle Scholar
  55. Zong, W., Wu, F., & Feng, P. (2019). Improving data quality during erp implementation based on information product map. Enterprise Information Systems, 1–17.  https://doi.org/10.1080/17517575.2019.1644669.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Qi Liu
    • 1
    • 2
  • Gengzhong Feng
    • 1
    • 2
    Email author
  • Giri Kumar Tayi
    • 3
  • Jun Tian
    • 1
    • 2
  1. 1.School of ManagementXi’an JiaoTong UniversityXi’anChina
  2. 2.The Key Lab of the Ministry of Education for Process Control and Efficiency EngineeringXi’anChina
  3. 3.School of BusinessSUNY at AlbanyAlbanyUSA

Personalised recommendations