Skip to main content
Log in

An ensemble-based model for predicting agile software development effort

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

To support agile software development projects, an array of tools and systems is available to plan, design, track, and manage the development process. In this paper, we explore a critical aspect of agile development i.e., effort prediction, that cuts across these tools and agile project teams. Accurate effort prediction can improve the planning of a sprint by enabling optimal assignments of both stories and developers. We develop a model for story-effort prediction using variables that are readily available when a story is created. We use seven predictive algorithms to predict a story’s effort. Interestingly, none of the predictive algorithms consistently outperforms others in predicting story effort across our test data of 423 stories. We develop an ensemble-based method based on our model for predicting story effort. We conduct computational experiments to show that our ensemble-based approach performs better in comparison to other ensemble-based benchmarking approaches. We then demonstrate the practical application of our predictive model and our ensemble-based approach by optimizing sprint planning for two projects from our dataset using an optimization model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. We thank the anonymous reviewer for pointing this out.

  2. Dejaeger et al. (2012) identify 13 data mining techniques used in software effort estimating in a traditional setting. In this paper, we identify representative techniques (regression, decision trees, support vector machines, neural network, Bayesian network) that perform better than the variants and augment them with newer data mining techniques (k-nearest neighbor and ensemble approaches). For example, a radial kernel outperformed other kernels in Support Vector Machines.

  3. Prior effort estimation studies have considered different variants of the algorithms considered in this study. Readers are referred to studies by Jørgensen and Shepperd (2007), Wen et al. (2012), and Idri et al. (2016) for a review of candidate algorithms in traditional software development projects.

  4. Log Transformation of dependent variable provided worse performance.

  5. We thank the anonymous reviewer for pointing this out.

  6. The measure has four categories: (a) negligible effect (|d| < 0.147), (b) small effect (|d| < 0.33), (c) medium effect (|d| <0.474), and (d) large effect (|d| >0.474).

  7. To facilitate practical interpretation, we also provide (Vargha and Delaney 2000) statistic (\(\hat {A_{12}}\)) for each pair of predictive algorithms.

  8. Increasing β leads to higher computation costs. We choose the maximum value as 8.0 based on our experimental results. Values greater than 8.0 did not significantly improve the results.

References

  • Abrahamsson P, Salo O, Ronkainen J, Warsta J (2002) Agile software development methods: Review and analysis. Report, VTT

  • Abrahamsson P, Moser R, Pedrycz W, Sillitti A, Succi G (2007) Effort prediction in iterative software development processes - incremental versus global prediction models. Empirical Software Engineering and Measurement, pp 344–353

  • Abrahamsson P, Fronza I, Moser R, Vlasenko J, Pedrycz W (2011) Predicting development effort from user stories. In: International Symposium on Empirical Software Engineering and Measurement, pp 400–403

  • Aggarwal C (2015) Data Mining: The Textbook. Springer, New York

    MATH  Google Scholar 

  • Azhar D, Riddle P, Mendes E, Mittas N, Angelis l (2013) Using ensembles for web effort estimation. In: ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

  • Azzeh M, Nassif AB, Minku L (2015) An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation. The Journal of Systems and Software 103:36–52

    Article  Google Scholar 

  • Bayley S, Falessi D (2018) Optimizing prediction intervals by tuning random forest via meta-validation. arXiv:1801.07194

  • Beck K, Andres C (2004) Extreme Programming Explained:Embrace Change. Addison-Wesley, Reading

    Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57 (1):289–300

    MathSciNet  MATH  Google Scholar 

  • Bergmeir C, Benitez JM (2011) Forecaster performance evaluation with cross-validation and variants. In: 11Th international conference on intelligent systems design and applications (ISDA). IEEE, pp 849-854

  • Chari K, Agrawal M (2018) Impact of incorrect and new requirements on waterfall software project outcomes. Empir Softw Eng 23(1):165–185

    Article  Google Scholar 

  • Chowdhury S, Di Nardo S, Hindle A, Jiang ZMJ (2018) An exploratory study on assessing the energy impact of logging on android applications. Empir Softw Eng 23(3):1422–1456

    Article  Google Scholar 

  • Cinnéide MÓ, Moghadam IH, Harman M, Counsell S, Tratt L (2017) An experimental search-based approach to cohesion metric evaluation. Empir Softw Eng 22(1):292–329

    Article  Google Scholar 

  • Conboy K (2009) Agility from first principles: Reconstructing the concept of agility in information systems development. Inf Syst Res 20(3):329–354

    Article  Google Scholar 

  • Dejaeger K, Verbeke W, Martens D, Baesens B (2012) Data mining techniques for software effort estimation: A comparative study. IEEE Trans Softw Eng 38(2):375–97

    Article  Google Scholar 

  • Grenning J (2002) Planning poker or how to avoid analysis paralysis while release planning. Report, Hawthorn Woods: Renaissance Software Consulting

  • Hastie T, Tibshirani R, Friedman J (2008) The Elements of Statistical Learning. Springer, New York

    MATH  Google Scholar 

  • Haugen NC (2006) An empirical study of using planning poker for user story estimation. In: Agile Conference, 2006, IEEE, pp 9–pp

  • Hearty P, Fenton N, Marquez D, Neil M (2009) Predicting project velocity in XP using a learning dynamic bayesian network model. IEEE Trans Softw Eng 35 (1):124–137

    Article  Google Scholar 

  • Hussain I, Kosseim L, Ormandjieva O (2013) Approximation of cosmic functional size to support early effort estimation in agile. Data and Knowledge Engineering 85:2–14

    Article  Google Scholar 

  • Idri A, Hosni M, Abran A (2016) Systematic literature review of ensemble effort estimation. J Syst Softw 118:151–175

    Article  Google Scholar 

  • Jahedpari F (2016) Artificial prediction markets for online prediction of continuous variables. PhD thesis, University of Bath, Bath

    Google Scholar 

  • James G, Witten D, Hastie T, Tibshirani R (2015) An Introduction to Statistical Learning with Applications in R. Springer Texts in Statistics. Springer, New York

    MATH  Google Scholar 

  • Jonsson L, Borg M, Broman D, Sandahl K, Eldh S, Runeson P (2016) Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts. Empir Softw Eng 21(4):1533–1578

    Article  Google Scholar 

  • Jørgensen M, Shepperd M (2007) A systematic review of software development cost estimation studies. IEEE Trans Softw Eng 33(1):33–53

    Article  Google Scholar 

  • Karner G (1993) Resource estimation for objectory projects. Objective Systems SF AB, p 17

  • Kocaguneli E, Menzies T, Keung JW (2012) On the value of ensemble effort estimation. IEEE Trans Softw Eng 38(6):1403–1416

    Article  Google Scholar 

  • Kultur Y, Turhanm B, Bener AB (2008) ENNA: software effort estimation using ensemble of neural networks with associative memory. In: 16th ACM SIGSOFT

  • Lee D (2016) Alternatives to p value: confidence interval and effect size. Korean Journal of Anesthesiology 69(6):555–562

    Article  Google Scholar 

  • Li Y, Yue T, Ali S, Zhang L (2017) Zen-ReqOptimizer: A search-based approach for requirements assignment optimization. Empir Softw Eng 22(1):175–234

    Article  Google Scholar 

  • Logue K, McDaid K, Greer D (2007) Allowing for task uncertainties and dependencies in agile release planning. In: 4th Proceedings of the Software Measurement European Forum, pp 275–284

  • Lokan C, Mendes E (2014) Investigating the use of duration-based moving windows to improve software effort prediction: A replicated study. Inf Softw Technol 56(9):1063–1075

    Article  Google Scholar 

  • MacDonell SG, Shepperd M (2010) Data accumulation and software effort prediction. In: ACM-IEEE International Symposium on Empirical Software Engineering and Measurement

  • Magazinius A, Börjesson S, Feldt R (2012) Investigating intentional distortions in software cost estimation–an exploratory study. J Syst Softw 85(8):1770–1781

    Article  Google Scholar 

  • Mahnic V, Hovelja T (2012) On using planning poker for estimating user stories. The Journal of Systems and Software 85:2086–2095

    Article  Google Scholar 

  • Minku L, Yao X (2013) Ensembles and locality: Insight on improving software effort estimation. Inf Softw Technol 55(8):1512–1528

    Article  Google Scholar 

  • Miyazaki Y, Takanou A, Nozaki H, Nakagawa N, Okada K (1991) Method to estimate parameter values in software prediction models. Inf Softw Technol 33:239–243

    Article  Google Scholar 

  • Neill J (2008) Why use effect sizes instead of significance testing in program evaluation? http://www.wilderdom.com/research/effectsizes.html, accessed: 2018-07

  • Nunes N, Constantine L, Kazman R (2011) iUCP: Estimating interactive-software project size with enhanced use-case points. IEEE Software 28(04):64–73

    Article  Google Scholar 

  • Palmer S, Felsing J (2002) A Practical Guide to Feature-driven Development. Prentice Hall, Upper Sadle River

    Google Scholar 

  • Papatheocharous E, Papadopoulos H, Andreou A (2010) Feature subset selection for software cost modelling and estimation. Eng Intell Syst 18:233–246

    Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Pendharkar P, Subramanian G, Rodger J (2005) A probabilistic model for predicting software development effort. IEEE Trans Softw Eng 31(7):615–624

    Article  Google Scholar 

  • Perols J, Chari K, Agrawal M (2009) Information market-based decision fusion. Manag Sci 55(5):827–842

    Article  Google Scholar 

  • Pikkarainen M, Haikara J, Salo O, Abrahamsson P, Still J (2008) The impact of agile practices on communication in software development. Empir Softw Eng 13(3):303–337

    Article  Google Scholar 

  • Santana C, Leoneo F, Vasconcelos A, Gusmão C (2011) Using function points in agile projects. In: International Conference on Agile Software Development. Springer, pp 176–191

  • Schwaber K, Sutherland J (2016) The scrum guide (2013). Dostopno na:. http://www.scrumguidesorg/docs/scrumguide/v1/scrum-guide-uspdf (dostop 28–4–2016)

  • Shmueli G, Bruce P, Patel N (2016) Data Mining for Business Analytics: Concepts, Techniques, and Applications with XLMiner. Wiley, Hoboken

    Google Scholar 

  • Stapleton J (1997) Dynamic systems development method. Addison-Wesley, Boston

    Google Scholar 

  • Usman M, Mendes E, Weidt F, Britto R (2014) Effort estimation in agile software development: a systematic literature review. In: 10th International Conference on Predictive Models in Software Engineering, pp 82–91

  • Usman M, Mendes E, Börstler J (2015) Effort estimation in agile software development: a survey on the state of the practice. In: 19th International Conference on Evaluation and Assessment in Software Engineering

  • Vargha A, Delaney HD (2000) A critique and improvement of the cl common language effect size statistics of mcgraw and wong. J Educ Behav Stat 25(2):101–132

    Google Scholar 

  • VersionOne (2016) 10th annual state of agile report. Technical report

  • Vidgen R, Wang X (2009) Coevolving systems and the organization of agile software development. Inf Syst Res 20(3):355–376

    Article  Google Scholar 

  • Wen J, Li S, Lin Z, Hu Y, Huang C (2012) Systematic literature review of machine learning based software development effort estimation models. Inf Softw Technol 54(1):41–59

    Article  Google Scholar 

  • Wolpert DH (1992) Stacked generalization. Neural networks 5(2):241–259

    Article  Google Scholar 

Download references

Acknowledgements

This paper has benefited from the feedback received at Workshop in Information Technology and Systems (WITS) 2014 (Auckland) and WITS 2016 (Dublin) where preliminary versions of the paper were presented. There were many individuals that helped us with this research project. First, we would like to thank those individuals at our data site that helped us gain access to the dataset. We would also like to thank Dr. Terry Sincich for helping us with the design and choice of statistical tests, developers/project managers who shared their insights on the implications of this research to practice, and Dr. Patricia Nickinson for proofreading and editing our draft. Finally, we would like to express our sincere gratitude to the three anonymous reviewers and the editors for providing constructive feedback on our earlier submission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Onkar Malgonde.

Additional information

Communicated by: Martin Shepperd

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

The K-Nearest Neighbor (KNN) algorithm uses k closest data points in the training data (determined by some distance metric —in our case Euclidean), where k is provided by the user, to determine its prediction. Each neighbor of a target point can equally influence the prediction (uniform weight) or inverse of the distance from the target point (distance-based weight). For our experiments, the parameter optimized was the number of neighbors (k). Exhaustive searches from 2 neighbors to the maximum possible in the training dataset were used; the best number of neighbors to minimize prediction errors is 14. Further, we used a distance-based measure such that closer points had higher influence than those that were farther out.

Decision trees (DT) represent a set of hierarchical decisions on the feature variables of the training data. A decision at a particular node, called a split criterion, is conditional on one or more feature variables. Different criteria can be used to measure the quality of the split. In our case, mean absolute error was used as the measurement criterion. As the depth of a tree increases, the tree is over fitted to the training data i.e., all instances can be classified by a dedicated leaf node of their own. However, such trees suffer when exposed to test datasets. To balance overfitting, optimization for the maximum depth of the tree was found to be 2. We considered depth value from 1 to 199.

Ridge regression (RR), in addition to minimizing the residual sum of squares (RSS), aims to minimize the shrinkage penalty using a tuning parameter. As the tuning parameter approaches zero, the ridge regression produces a least squares model. As the tuning parameter increases, the flexibility of the ridge regression fit decreases, leading to decreased variances but increased bias. The optimal value of the tuning parameter was found to be 3.51. The tuning parameter value varied from 0.0001 to 50 with increments of 0.001.

A Bayesian network (BN) learns the joint probability distribution of the target variable using a causal model that provides the prior and posterior probabilities of variables. New information can be incorporated into the model using belief updating procedures. Bayesian networks are popular in machine learning and software effort prediction literature due to their accuracy (Hearty et al. 2009). We implemented the procedure in Pendharkar et al. (2005) that provided point predictions for a target story.

Support Vector Machine (SVM) is a generalization of maximal margin classifier that identifies a separating hyperplane (p-1 dimensional flat space in a p dimensional space) that classifies the one-dimensional space. However, this approach is limited by cases where the perfect separating hyperplane is not available. Support Vector Classifiers address this problem by identifying soft hyperplanes that almost separate the classes. Data points that lie on the hyperplane are known as support vectors. These points determine the support vector classifier performance. Tolerance for points on the wrong side of the hyperplane can be tuned by a penalty parameter C. A greater value of C will increase the tolerance for data points on the wrong side. The hyperplane can be linear or nonlinear. Support Vector Regression (SVR), a nonparametric method, uses kernel functions to identify the decision boundary. Kernel functions can be linear, polynomial, radial, or sigmoid, among others. In our experiments, the penalty parameter C was identified to be 1.9 using a Radial Basis Function (rbf) kernel whose width was identified to 0.3. We used a multi-level grid search to tune parameters. The possible values for penalty parameters were varied from 0.01 to 10 whereas the values for kernel widths varied from 0.1 to 10.

In Artificial Neural Network (ANN), derived features are created from non-linear combinations of input variables (input nodes). The target variable to be predicted, in turn, is modeled as a linear function of the derived features. The units computing the derived features are known as hidden nodes since they are not visible to the user. Hidden nodes transform the nodes’ inputs using a weighted linear summation. A regularization parameter is used to limit overfitting of the neural network to the training data. In our experiments, the regularization parameter was determined to be 0.000081. The value of regularization parameter varied from 1.00E-06 to 1 with increments of 1.00E-05.

In Random Forest (RF), a number of decision trees are applied to subsamples of the training dataset. Every time these trees are built, a different set of predictors is chosen as split candidates. This produces decorrelated trees which reduce the overall variance. For our experiments, the random forest was optimized to 5 trees in the forest with 4 features selected to determine the best split. Possible values for the number of trees in the forest and features were varied from 1 to 20 each, with all possible combinations considered to identify optimal parameter values.

Extra Trees (ET), similar to RF, fit multiple randomized decision trees on subsamples of the dataset. However, variables are chosen from the entire dataset rather than bootstrapping and at random. The number of trees was optimized to 5 trees and 4 features selected to determine the best split. Possible values for the number of trees and features varied from 1 to 20 each, with all possible combinations considered to identify optimal parameter values.

Adaptive Boost (AB) works by incrementally focusing on cases that are difficult to predict. A base estimator (in our case, decision tree) is identified by fitting to the original dataset. Incrementally, adjusted estimators are fitted to the dataset such that weight is increased for wrongfully classified data points. In our experiments, the boosting was terminated after two estimators were fit to the data. Possible values considered were from one to 30 estimators.

Gradient Boosting (GB) optimizes least squares loss function by adding weak learners (decision trees) to an additive model. As new trees are added at each boosting stage, existing trees are not changed. The optimal number of boosting stages was identified as 28. The possible range of stages considered was 1-60.

Stacking (ST) is an ensemble approach which aggregates individual 1st-level algorithms with a 2nd- level algorithm (often known as the meta-regressor). Individual first-order algorithms provide predictions which form the input to the meta-regressor. Five algorithms selected in our experiments formed the first order algorithms with respective hyperparameters identified in the experiments. We used Support Vector Machine with “rbf” kernel as the meta-regressor.

For an extensive discussion on each of these algorithms, readers are referred to (Hastie et al. 2008), (James et al. 2015), (Aggarwal 2015), and (Shmueli et al. 2016).

Appendix B

Tables in this appendix (Tables 2425 and 26) provide effect sizes with Cliff’s delta, 95% confidence interval, and the Vargha and Delaney (2000) (\(\hat {A_{12}}\)) statistic (VDA), which compares the probability of yielding lower absolute error for individual predictive algorithms. For each effect size computation, we consider the error values across all test dataset (N = 423). Reversing the order of algorithms will provide the same Cliff’s delta multiplied by -1. Similarly, confidence interval values will reverse with different signs. The VDA statistic for the reverse order of the algorithms is VDAReverse = |1 − VDAoriginalorder|. For example, the effect size of RR and SVM is 0.10752, confidence interval is [0.02946, 0.18394], and VDA statistic is 0.553759. Statistically significant pairs are highlighted.

Table 24 Effect Sizes for Individual Predictive Algorithms (MAE)
Table 25 Effect Sizes for Individual Predictive Algorithms (MBE)
Table 26 Effect Sizes for Individual Predictive Algorithms (RMSE)

Appendix C

Tables in this appendix (Tables 2728 and 29) provide effect sizes with Cliff’s delta, 95% confidence interval, and the Vargha and Delaney (2000) (\(\hat {A_{12}}\)) statistic (VDA) for ensemble algorithms. For each effect size computation, we consider the error values across all test dataset (N = 423). Statistically significant pairs are highlighted.

Table 27 Effect Sizes for Ensemble Algorithms (MAE)
Table 28 Effect Sizes for Ensemble Algorithms (MBE)
Table 29 Effect Sizes for Ensemble Algorithms (RMSE)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malgonde, O., Chari, K. An ensemble-based model for predicting agile software development effort. Empir Software Eng 24, 1017–1055 (2019). https://doi.org/10.1007/s10664-018-9647-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9647-0

Keywords

Navigation