Investigating structural metrics for understandability prediction of data warehouse multidimensional schemas using machine learning techniques

Gosain, Anjana; Singh, Jaspreeti

doi:10.1007/s11334-017-0308-z

Investigating structural metrics for understandability prediction of data warehouse multidimensional schemas using machine learning techniques

Original Paper
Published: 20 December 2017

Volume 14, pages 59–80, (2018)
Cite this article

Innovations in Systems and Software Engineering Aims and scope Submit manuscript

Anjana Gosain¹ &
Jaspreeti Singh¹

790 Accesses
2 Citations
Explore all metrics

Abstract

Data warehouse (DW) quality metrics help in evaluating quality attributes and building classification models for predicting multidimensional (MD) schemas as understandable/non-understandable, thereby assisting in DW maintenance. To evaluate DW MD schema quality, we have earlier proposed a set of metrics based on some important aspects of dimension hierarchies and its sharing (like sharing of few hierarchy levels within a dimension; sharing of few hierarchy levels between dimensions, within and across facts) which may lead to structural complexity of MD schemas, thereby affecting its quality. The preliminary empirical validation of these metrics using classical statistical techniques (correlation and linear regression) indicated some of them as possible understandability indicators. However, machine learning (ML) techniques can model the complex associations between DW structural metrics and their quality attributes in a better way. Therefore, this work employs five ML classifiers [J48, partial decision trees (PART), Naïve Bayes, support vector machines (SVM) and logistic regression] to empirically investigate whether accurate prediction models can be built, based on our structural metrics, to be used as understandability predictors. The obtained results reveal that four of our metrics are good predictors of understandability of DW MD schemas. The experimentation further involved comparing the classifiers using mainly five performance measures: accuracy, precision, sensitivity, specificity and area under the receiver operating characteristic curve. The study confirmed the predictive capability of ML techniques for understandability prediction of DW MD schemas. The results also suggest that the SVM and Naïve Bayes classifiers perform better than other classifiers included in the study. Further, the typically used logistic regression technique gave results that were reasonably competitive with the more sophisticated techniques. However, the tree-based (J48) and rule-based (PART) techniques performed significantly worse than the best performing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quality metrics emphasizing dimension hierarchy sharing in multidimensional models for data warehouse: a theoretical and empirical evaluation

Article 17 June 2017

Empirical analysis of metrics for object oriented multidimensional model of data warehouse using unsupervised machine learning techniques

Article 29 June 2016

Quality Assessment of Data Using Statistical and Machine Learning Methods

Notes

WEKA (Waikato Environment for Knowledge Analysis). http://www.cs.waikato.ac.nz/~ml/weka/.
CGPA stands for cumulative grade point average.

References

Abello A, Samos J, Saltor F (2006) YAM2: a multidimensional conceptual model extending UML. Inf Syst 31(6):541–567
Article Google Scholar
Ali S, Smith KA (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138
Article Google Scholar
Arisholm E, Briand LC, Fuglerud M (2007) Data mining techniques for building fault-proneness models in telecom java software. In: The 18th IEEE international symposium on software reliability, pp 215–224
Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635
Article MATH Google Scholar
Basili VR, Weiss DM (1984) A methodology for collecting valid software engineering data. IEEE Trans Softw Eng 10(6):728–738
Article Google Scholar
Basili VR, Briand LC, Melo WL (1996) A validation of object-oriented design metrics as quality indicators. IEEE Trans Softw Eng 22(10):751–761
Article Google Scholar
Belsley D, Kuh E, Welsch R (1980) Regression diagnostics: identifying influential data and sources of collinearity. Wiley, New York
Book MATH Google Scholar
Berenguer G, Romero R, Trujillo J, Serrano M, Piattini M (2005) A set of quality indicators and their corresponding metrics for conceptual models of data warehouses. Data warehousing and knowledge discovery. Springer, Berlin, pp 95–104
Chapter Google Scholar
Briand LC, Morasca S, Basili VR (1996) Property based software engineering measurement. IEEE Trans Softw Eng 22:68–86
Article Google Scholar
Briand LC, Wüst J, Daly JW, Porter DV (2000) Exploring the relationships between design measures and software quality in object-oriented systems. J Syst Softw 51(3):245–273
Article Google Scholar
Brieman L, Friedman J, Olshen R, Stone C (1984) Classification of regression trees. Wadsworth Inc, Belmont
Google Scholar
Calero C, Piattini M, Pascual C, Serrano MA (2001) Towards data warehouse quality metrics. In: Proceedings of 3rd international workshop on design and management of data warehouse, Interlaken, Switzerland, p 2
Catal C, Diri B (2009) A systematic review of software fault prediction studies. Expert Syst Appl 36(4):7346–7354
Article Google Scholar
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Accessed 07 September 2016
Charness G, Gneezy U, Kuhn MA (2012) Experimental methods: between-subject and within-subject design. J Econ Behav Organ 81(1):1–8
Article Google Scholar
Cherfi SS, Prat N (2003) Multidimensional schemas quality: assessing and balancing analyzability and simplicity. Conceptual modeling for novel application domains. Springer, Berlin, pp 140–151
Chapter Google Scholar
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123
Congdon P (2001) Bayesian statistical modelling. Wiley, New York
MATH Google Scholar
Cruz-Lemus JA, Maes A, Genero M, Poels G, Piattini M (2010) The impact of structural complexity on the understandability of UML statechart diagrams. Inf Sci 180(11):2209–2220
Article MathSciNet Google Scholar
Darlington R (1968) Multiple regression in psychological research and practice. Psychol Bull 69(3):161–182
Article Google Scholar
Dejaeger K, Verbraken T, Baesens B (2013) Toward comprehensible software fault prediction models using bayesian network classifiers. IEEE Trans Softw Eng 39(2):237–257
Article Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero–one loss. Mach Learn 29(2–3):103–130
Article MATH Google Scholar
El-Emam K, Benlarbi S, Goel N, Rai S (1999) A validation of object-oriented metrics. Technical report ERB-1063, NRC, 1999. www.object-oriented.org
English L (1996) Information quality improvement: principles. methods and management. Information Impact International, Brentwood
Fenton N, Bieman J (2014) Software metrics: a rigorous and practical approach. CRC Press, London
Book MATH Google Scholar
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689
Article Google Scholar
Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In ICML 98:144–151
Google Scholar
Gosain A, Singh J (2017) Quality metrics emphasizing dimension hierarchy sharing in multidimensional models for data warehouse: a theoretical and empirical evaluation. Int J Syst Assur Eng Manag 8:1672–1688
Article Google Scholar
Gosain A, Nagpal S, Sabharwal S (2011) Quality metrics for conceptual models for data warehouse focusing on dimension hierarchies. ACM SIGSOFT Softw Eng Notes 36(4):1–5
Article Google Scholar
Gosain A, Nagpal S, Sabharwal S (2013) Validating dimension hierarchy metrics for the understandability of multidimensional models for data warehouse. IET Softw 7(2):93–103
Article Google Scholar
Gosain A, Singh J (2015a) Quality metrics for data warehouse multidimensional models with focus on dimension hierarchy sharing. In: Advances in intelligent informatics. Springer, Berlin, pp 429–443
Gosain A, Singh J (2015b) Conceptual multidimensional modeling for data warehouses: a survey. In: Proceedings of the 3rd international conference on frontiers of intelligent computing: theory and applications. Springer, Berlin, pp 305–316
Hsu CW, Chang CC and Lin CJ (2003) A practical guide to support vector classification. www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Hsu CN, Huang HJ, Wong TT (2000) Why discretization works for naive bayesian classifiers. In Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann, San Francisco, CA, pp 399–406
ISO (2001) Software product evaluation-quality characteristics and guidelines for their use. ISO/IEC Standard 9126, Geneva
Jarke M, Lenzerini M, Vassiliou Y, Vassiliadis P (2003) Fundamentals of data warehouses, 2nd edn. Springer, Berlin
Book MATH Google Scholar
Jeusfeld MA, Quix C, Jarke M (1998) Design and analysis of quality information for data warehouses. Conceptual modeling-ER’98. Springer, Berlin, pp 349–362
Google Scholar
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345
Kimball R, Ross M (2002) The data warehouse toolkit: the complete guide to dimensional modeling, 2nd edn. Wiley, London
Google Scholar
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, El Emam K, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng 28(8):721–734
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Joint Conf Artif Intell 14(2):1137–1145
Google Scholar
Koru AG, Liu H (2005) Building effective defect-prediction models in practice. IEEE Softw 22(6):23–29
Article Google Scholar
Kumar M, Gosain A, Singh Y (2014) Empirical validation of structural metrics for predicting understandability of conceptual schemas for data warehouse. Int J Syst Assur Eng Manag 5(3):291–306
Article Google Scholar
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–74
Article MATH Google Scholar
Lanubile F, Visaggio G (1997) Evaluating predictive quality models derived from software measures: lessons learned. J Syst Softw 38(3):225–234
Article Google Scholar
Lanubile F, Lonigro A, Vissagio G (1995) Comparing models for identifying fault-prone software components. In: SEKE, pp 312–319
Lemeshow S, Hosmer D (2000) Applied logistic regression. Wiley series in probability and statistics. Wiley-Interscience, Hoboken
MATH Google Scholar
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485–496
Article Google Scholar
Linstedt D, Olschimke M (2015) Building a scalable data warehouse with data vault 2.0. Morgan Kaufmann, Burlington
List B, Bruckner RM, Machaczek K, Schiefer J (2002) A comparison of data warehouse development methodologies case study of the process warehouse. Database and expert systems applications. Springer, Berlin, pp 203–215
Chapter Google Scholar
Lujan-Mora S, Trujillo J, Song IY (2006) A UML profile for multidimensional modeling in data warehouses. Data Knowl Eng 59(3):725–769
Article Google Scholar
Malinowski E, Zimanyi E (2006) Hierarchies in a multidimensional model: from conceptual modeling to logical representation. Data Knowl Eng 59(2):348–377
Article Google Scholar
Manel S, Williams HC, Ormerod SJ (2001) Evaluating presence-absence models in ecology: the need to account for prevalence. J Appl Ecol 38(5):921–931
Article Google Scholar
Mansmann S, Scholl MH (2007) Extending the multidimensional data model to handle complex data. J Comput Sci Eng 1(2):125–160
Article Google Scholar
Melton A (1996) Software measurement. International Thomson Computer Press, London
MATH Google Scholar
Michalski RS, Carbonell JG, Mitchell TM (2013) Machine learning: an artificial intelligence approach. Springer, Berlin
MATH Google Scholar
Nagpal S, Gosain A, Sabharwal S (2013) Theoretical and empirical validation of comprehensive complexity metric for multidimensional models for data warehouse. Int J Syst Assur Eng Manag 4(2):193–204
Article Google Scholar
Nagpal S, Gosain A, Sabharwal S (2012) Complexity metric for multidimensional models for data warehouse. In: Proceedings of the CUBE international information technology conference, pp 360–365
Pedersen TB, Jensen CS, Dyreson CE (2001) A foundation for capturing and querying complex multidimensional data. Inf Syst 26(5):383–423
Article MATH Google Scholar
Provost F, Kohavi R (1998) On applied research in machine learning. Mach Learn 30:127–132
Article Google Scholar
Quinlan R (1993) C4.5 programs for machine learning. Morgan Kaufmann, Burlington
Google Scholar
Riaz M, Mendes E, Tempero E (2009) A systematic review of software maintainability prediction and metrics. In: Proceedings of the 3rd international symposium on empirical software engineering and measurement, pp 367–377
Rizzi S, Abello A, Lechtenbörger J, Trujillo J (2006) Research in data warehouse modeling and design: dead or alive? In: Proceedings of the 9th ACM international workshop on data warehousing and OLAP, pp 3–10
Sabharwal S, Nagpal S, Aggarwal G (2015) Empirical investigation of metrics for multidimensional model of data warehouse using support vector machine. In: 4th International IEEE conference on reliability, infocom technologies and optimization (trends and future directions), pp 1–5
Schuff D, Corral K, Turetken O (2011) Comparing the understandability of alternative data warehouse schemas: an empirical study. Decis Support Syst 52(1):9–20
Article Google Scholar
Serrano MA, Calero C, Piattini M (2003) Experimental validation of multidimensional data models metrics. In: Proceedings of 36th annual Hawaii IEEE international conference on system sciences, p 7
Serrano MA (2004) Definition of a set of metrics for assuring data warehouse quality. Univeristy of Castilla, La Mancha
Google Scholar
Serrano MA, Calero C, Piattini M (2002) Validating metrics for data warehouse. Softw IEEE Proc 149(5):161–166
Article Google Scholar
Serrano MA, Calero C, Trujillo J, Lujan-Mora S, Piattini M (2004) Empirical validation of metrics for conceptual models for data warehouse. Advanced information systems engineering. Springer, Berlin, pp 506–520
Chapter Google Scholar
Serrano MA, Calero C, Piattini M (2005) An experimental replication with data warehouse metrics. Int J Data Wareh Min 1(4):1–21
Article Google Scholar
Serrano MA, Trujillo J, Calero C, Piattini M (2007) Metrics for data warehouse conceptual models understandability. Inf Softw Technol 49(8):851–870
Article Google Scholar
Serrano MA, Calero C, Sahraoui HA, Piattini M (2008) Empirical studies to assess the understandability of data warehouse schemas using structural metrics. Softw Quality J 16(1):79–106
Article Google Scholar
Shadish WR, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Cengage learning, ISBN-13: 9780395615560/ISBN-10: 0395615569
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Article Google Scholar
Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999
Article Google Scholar
Wen J, Li S, Lin Z, Hu Y, Huang C (2012) Systematic literature review of machine learning based software development effort estimation models. Inf Softw Technol 54(1):41–59
Article Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
MATH Google Scholar
Wixom BH, Watson HJ (2001) An empirical investigation of the factors affecting data warehousing success. MIS Q 25:17–41
Article Google Scholar
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer, Berlin
Book MATH Google Scholar
Zhang D, Tsai JJ (2003) Machine learning and software engineering. Softw Quality J 11(2):87–119
Article Google Scholar

Download references

Author information

Authors and Affiliations

University School of Information, Communication & Technology, Guru Gobind Singh Indraprastha University, Dwarka, 110078, New Delhi, India
Anjana Gosain & Jaspreeti Singh

Authors

Anjana Gosain
View author publications
You can also search for this author in PubMed Google Scholar
Jaspreeti Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaspreeti Singh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gosain, A., Singh, J. Investigating structural metrics for understandability prediction of data warehouse multidimensional schemas using machine learning techniques. Innovations Syst Softw Eng 14, 59–80 (2018). https://doi.org/10.1007/s11334-017-0308-z

Download citation

Received: 13 December 2016
Accepted: 05 December 2017
Published: 20 December 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s11334-017-0308-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating structural metrics for understandability prediction of data warehouse multidimensional schemas using machine learning techniques

Abstract

Access this article

Similar content being viewed by others

Quality metrics emphasizing dimension hierarchy sharing in multidimensional models for data warehouse: a theoretical and empirical evaluation

Empirical analysis of metrics for object oriented multidimensional model of data warehouse using unsupervised machine learning techniques

Quality Assessment of Data Using Statistical and Machine Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Investigating structural metrics for understandability prediction of data warehouse multidimensional schemas using machine learning techniques

Abstract

Access this article

Similar content being viewed by others

Quality metrics emphasizing dimension hierarchy sharing in multidimensional models for data warehouse: a theoretical and empirical evaluation

Empirical analysis of metrics for object oriented multidimensional model of data warehouse using unsupervised machine learning techniques

Quality Assessment of Data Using Statistical and Machine Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation