Segmentation of Software Engineering Datasets Using the M5 Algorithm

  • D. Rodríguez
  • J. J. Cuadrado
  • M. A. Sicilia
  • R. Ruiz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3994)


This paper reports an empirical study that uses clustering techniques to derive segmented models from software engineering repositories, focusing on the improvement of the accuracy of estimates. In particular, we used two datasets obtained from the International Software Benchmarking Standards Group (ISBSG) repository and created clusters using the M5 algorithm. Each cluster is associated with a linear model. We then compare the accuracy of the estimates so generated with the classical multivariate linear regression and least median squares. Results show that there is an improvement in the accuracy of the results when using clustering. Furthermore, these techniques can help us to understand the datasets better; such techniques provide some advantages to project managers while keeping the estimation process within reasonable complexity.


Effort estimation Data mining Tress M5 


  1. 1.
    Aguilar–Ruiz, J.S., Riquelme, J.C., Ramos, I., Toro, M.: An evolutionary approach to estimating software development projects. Information and Software Technology 14(43), 875–882 (2001)CrossRefGoogle Scholar
  2. 2.
    Boehm, B.: Software Engineering Economics. Prentice-Hall, Englewood Cliffs (1981)zbMATHGoogle Scholar
  3. 3.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.J.: Classification and Regression Trees. Chapman and Hall, New York (1984)zbMATHGoogle Scholar
  4. 4.
    Conte, S.D., Dunsmore, H.E., Shen, V.: Software Engineering Metrics and Models, Benjamin/Cummings (1986)Google Scholar
  5. 5.
    Dreger, J.: Function Point Analysis. Prentice Hall, Englewood Cliffs (1989)Google Scholar
  6. 6.
    Dolado, J.J.: On the problem of the software cost function. Information and Software Technology 43, 61–72 (2001)CrossRefGoogle Scholar
  7. 7.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretisation of continuous valued attributes for classification learning. In: Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco (1993)Google Scholar
  8. 8.
    Finnie, G.R., Wittig, G.E., Desharnais, J.-M.: A comparison of software effort estimation techniques: using function points with neural networks, case-based reasoning and regression models. Journal of Systems and Software 39(3), 281–289 (2000)CrossRefGoogle Scholar
  9. 9.
    Holte, R.C.: Very simple classification rules perform well on most commonly used datasets. Machine learning 11, 63–91 (1993)zbMATHCrossRefGoogle Scholar
  10. 10.
    ISBSG, International Software Benchmarking Standards Group (ISBSG), Web site (2004),
  11. 11.
    Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)zbMATHGoogle Scholar
  12. 12.
    NESMA, NESMA FPA. Counting Practice Manual Version 2.0 (1996)Google Scholar
  13. 13.
    PRICE, Price S. (2005), Web Site
  14. 14.
    Quinlan, J.R.: Learning with continuous class. In: Proc. of the 5th Australian Joint Conference on Artificial Intelligence, pp. 343–348. World Scientific, Singapore (1992)Google Scholar
  15. 15.
    Quinlan, J.R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  16. 16.
    Rousseeuw, P.J., Annick, M.L.: Robust Regression and Outlier Detection. John Wiley & Sons, New York (1987)zbMATHCrossRefGoogle Scholar
  17. 17.
    Shepperd, M., Schofield, C.: Estimating software project effort using analogies. IEEE Transactions on Software Engineering 23(12), 736–743 (2000)Google Scholar
  18. 18.
    Srinivasan, K., Fisher, D.: Machine Learning Approaches to Estimating Software Development Effort. IEEE Transactions on Software Engineering 21(2), 126–137 (1995)CrossRefGoogle Scholar
  19. 19.
    Walkerden, F., Jeffery, R.: An empirical study of analogy-based software effort estimation. Empirical Software Engineering 42, 135–158 (1999)CrossRefGoogle Scholar
  20. 20.
    Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes. In: Proceedings of the poster papers of the European Conference on Machine Learning, University of Economics, Faculty of Informatics and Statistics, PragueGoogle Scholar
  21. 21.
    Witten, I., Frank, E.: Data Mining Practical: Machine Learning Tools and techniques with Java implementations. Morgan Kaufmann, San Francisco (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • D. Rodríguez
    • 1
  • J. J. Cuadrado
    • 2
  • M. A. Sicilia
    • 2
  • R. Ruiz
    • 3
  1. 1.The University of ReadingReadingUK
  2. 2.The University of AlcaláAlcalá de Henares (Madrid)Spain
  3. 3.The University of SevilleSevillaSpain

Personalised recommendations