Empirical Software Engineering

, Volume 21, Issue 1, pp 72–103 | Cite as

Weighing lexical information for software clustering in the context of architecture recovery

  • Anna Corazza
  • Sergio Di Martino
  • Valerio Maggio
  • Giuseppe Scanniello
Article

Abstract

In this paper, we present a software clustering approach that leverages the information conveyed by the zone in which each lexeme appears in the classes of object oriented systems. We define six zones in the source code: Class Name, Attribute Name, Method Name, Parameter Name, Comment, and Source Code Statement. These zones may convey information with different levels of relevance, and so their contribution should be differently weighed according to the software system under study. To this aim, we define a probabilistic model of the lexemes distribution whose parameters are automatically estimated by the Expectation-Maximization algorithm. The weights of the zones are then exploited to compute similarities among source code classes, which are then grouped by a k-Medoid clustering algorithm. To assess the validity of our solution in the software architecture recovery field, we applied our approach to 19 software systems from different application domains. We observed that the use of our probabilistic model and the defined zones improves the quality of clustering results so that they are close to a theoretical upper bound we have proved.

Keywords

Software understanding Reengineering Software clustering Probabilistic model Software maintenance 

References

  1. Ali N, Gueheneuc YG, Antoniol G (2011) Requirements traceability for object oriented systems by partitioning source code. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, pp 45–54Google Scholar
  2. Andritsos P, Tzerpos V (2005) Information-theoretic software clustering. IEEE Trans Softw Eng 31(2):150–165CrossRefGoogle Scholar
  3. Anquetil N, Fourrier C, Lethbridge TC (1999) Experiments with clustering as a software remodularization method. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 235–255Google Scholar
  4. Basili VR, Green S, Laitenberger O, Lanubile F, Shull F, Sørumgård LS, Zelkowitz MV (1996) The empirical investigation of perspective-based reading. Empir Softw Eng 1(2):133–164CrossRefGoogle Scholar
  5. Bavota G, De Lucia A, Marcus A, Oliveto R (2010) Software re-modularization based on structural and semantic metrics. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 195–204Google Scholar
  6. Bavota G, De Lucia A, Marcus A, Oliveto R (2013a) Using structural and semantic measures to improve software modularization. Empir Softw Eng 18 (5):901–932CrossRefGoogle Scholar
  7. Bavota G, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2013b) An empirical study on the developers’ perception of software coupling. In: Proceedings of international conference on software engineering. IEEE / ACM, pp 692–701Google Scholar
  8. Bavota G, Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2014a) Improving software modularization via automated analysis of latent topics and dependencies. ACM Trans Softw Eng Methodol 23(1): 4:1–4:33. doi:10.1145/2559935 CrossRefGoogle Scholar
  9. Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694CrossRefGoogle Scholar
  10. Binkley D (2007) Source code analysis: a road map. In: Future of software engineering. IEEE Computer Society, pp 104–119Google Scholar
  11. Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. SpringerGoogle Scholar
  12. Bittencourt RA, Guerrero DDS (2009) Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European conference on software maintenance and reengineering. IEEE Computer Society, pp 251–254Google Scholar
  13. Conover WJ (1998) Practical nonparametric statistics, 3rd. WileyGoogle Scholar
  14. Corazza A, Di Martino S, Maggio V, Moschitti A, Passerini A, Scanniello G, Silvestri F (2013) Using machine learning and information retrieval techniques to improve software maintainability. In: Eternal systems, communications in computer and information science. Springer, Berlin. In PressCrossRefGoogle Scholar
  15. Corazza A, Di Martino S, Maggio V, Scanniello G (2011) Investigating the use of lexical information for software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 35–44Google Scholar
  16. Corazza A, Di Martino S, Scanniello G (2010) A probabilistic based approach towards software system clustering. In: Proceedings of European conference on software maintenance and reengineering. IEEE Computer Society, pp 89–98Google Scholar
  17. De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227CrossRefGoogle Scholar
  18. De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2012) Using ir methods for labeling source code artifacts: is it worthwhile? In: Proceedings of international conference on program comprehension. IEEE Computer Society Press, pp 193–202Google Scholar
  19. De Lucia A, Risi M, Scanniello G, Tortora G (2009) An investigation of clustering algorithms in the comprehension of legacy web applications. J Web Eng 8(4):346–370Google Scholar
  20. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRefGoogle Scholar
  21. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39 (1):1–38MathSciNetMATHGoogle Scholar
  22. van Deursen A, Hofmeister C, Koschke R, Moonen L, Riva C (2004) Symphony: view-driven software architecture reconstruction. In: Proceedings of working conference on software architecture, pp 122–134Google Scholar
  23. Ducasse S, Pollet D (2009) Software architecture reconstruction: a process-oriented taxonomy. IEEE Trans Softw Eng 35(4):573–591. doi:10.1109/TSE.2009.19 CrossRefGoogle Scholar
  24. Eastwood A (1993) Firm fires shots at legacy systems. Comput Canada 19(2):17MathSciNetGoogle Scholar
  25. Erlikh L (2000) Leveraging legacy system dollars for e-business. IT Professional 2:17–23CrossRefGoogle Scholar
  26. Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University PressGoogle Scholar
  27. Freund RJ, Wilson WJ (2003) Statistical methods, 2nd edn. Academic PressGoogle Scholar
  28. Grubb P, Takang AA (2003) Software maintenance: concepts and practice, 2nd edn. World ScientificGoogle Scholar
  29. Jarzabek S (2007) Effective software maintenance and evolution—a reuse-based approach. Auerbach PublGoogle Scholar
  30. Kampenes V, Dyba T, Hannay J, Sjoberg I (2006) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49(11–12):1073–1086Google Scholar
  31. Kaufman L, Rousseeuw P (1990) Finding groups in data an introduction to cluster analysis. Wiley InterscienceGoogle Scholar
  32. Kevin Freedman JB (1999) Current concepts review - sample size and statistical power in clinical orthopaedic research. J Bone Joint Surg 81:1454–60Google Scholar
  33. Kitchenham B, Al-Khilidar H, Babar M, Berry M, Cox K, Keung J, Kurniawati F, Staples M, Zhang H, Zhu L (2008) Evaluating guidelines for reporting empirical software engineering studies. Empir Softw Eng 13(1):97–121CrossRefGoogle Scholar
  34. Koschke R (2000) Atomic architectural component recovery for program understanding and evolution. Ph.D. thesis, University of StuttgartGoogle Scholar
  35. Kuhn A, Ducasse S, Girba T (2005) Enriching reverse engineering with semantic clustering. In: Proceedings of international working conference on reverse engineering. IEEE Computer Society, pp 133–142Google Scholar
  36. Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243CrossRefGoogle Scholar
  37. Liu Y, Poshyvanyk D, Ferenc R, Gyimȯthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 233–242Google Scholar
  38. Mahdavi K (2005) A clustering genetic algorithm for software modularisation with a multiple hill climbing approach. Ph.D. thesis, Department of Information Systems and Computing, Brunel UniversityGoogle Scholar
  39. Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of international conference on software engineering. IEEE Computer Society, Washington, pp 103–112Google Scholar
  40. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  41. Maqbool O, Babri H (2007) Hierarchical clustering for software architecture recovery. IEEE Trans Software Eng 33 (11):759–780CrossRefGoogle Scholar
  42. Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: International conference on software maintenance. IEEE Computer Society, pp 133–142Google Scholar
  43. Mashiko Y, Basili V (1997) Using the GQM paradigm to investigate influential factors for software process improvement. J Syst Softw 36(1):17–32CrossRefGoogle Scholar
  44. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of workshop on learning for text categorization. AAAI Press, pp 41–48Google Scholar
  45. Mclachlan J, Krishnan T (1996) The EM algorithm and extensions. Wiley Inter-scienceGoogle Scholar
  46. Mendonça NC, Kramer J (1996) Requirements for an effective architecture recovery framework. In: Joint proceedings of the second international software architecture workshop and international workshop on multiple perspectives in software development. ACM, pp 101–105Google Scholar
  47. Mitchell TM (1997) Machine learning, 1st edn. McGraw-Hill, Inc., New YorkMATHGoogle Scholar
  48. Pfleeger SL, Menezes W (2000) Marketing technology to software practitioners. IEEE Softw 17:27–33CrossRefGoogle Scholar
  49. Port O (1998) The software trap – automate or else. Bus Week 9(3051):142–154Google Scholar
  50. Porter MF (1997) An algorithm for suffix stripping. Morgan Kaufmann Publishers Inc., San Francisco, pp 313–316Google Scholar
  51. Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 469–478Google Scholar
  52. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, the art of scientific computing, 2nd edn. Cambridge University PressGoogle Scholar
  53. Reggio G, Ricca F, Scanniello G, Di Cerbo F, Dodero G (2011) A precise style for business process modelling: Results from two controlled experiments. In: Proceedings of model driven engineering languages and systems, lecture notes in computer science. Springer, pp 138–152Google Scholar
  54. Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811CrossRefGoogle Scholar
  55. Risi M, Scanniello G, Tortora G (2012) Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp Comput 24(3):307–330CrossRefGoogle Scholar
  56. Romano S, Scanniello G, Risi M, Gravino C (2011) Clustering and lexical information support for the recovery of design pattern in source code. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 500–503Google Scholar
  57. Romesburg H (2004) Cluster analysis for researchers. Lulu Press. http://books.google.it/books?id=ZuIPv7OKm10C
  58. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi:10.1145/361219.361220 CrossRefMATHGoogle Scholar
  59. Saw JG, Yang MCK, Mo TC (1984) Chebyshev inequality with estimated mean and variance. Am Stat 38(2):130–132MathSciNetGoogle Scholar
  60. Scanniello G, D’Amico A, D’Amico C, D’Amico T (2010) Using the Kleinberg algorithm and Vector Space Model for software system clustering. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 180–189Google Scholar
  61. Scanniello G, Gravino C, Marcus A, Menzies T (2013) Class level fault prediction using software clustering. In: Proceedings of international conference on automated software engineering. IEEE / ACM, pp 640–645Google Scholar
  62. Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 1–10Google Scholar
  63. Scanniello G, Marcus A, Pascale D (2014) Link analysis algorithms for static concept location: an empirical assessment. Empir Softw Eng 1–55. doi:10.1007/s10664-014-9327-7
  64. Scanniello G, Risi M, Tortora G (2010) Architecture recovery using latent semantic indexing and k-means: an empirical evaluation. In: Proceedings of international conference on software engineering and formal methods. IEEE Computer Society, pp 103–112Google Scholar
  65. Shapiro S, Wilk M (1965) An analysis of variance test for normality. Biometrika 52(3–4):591–611CrossRefMathSciNetMATHGoogle Scholar
  66. Shtern M, Tzerpos V (2011) Evaluating software clustering using multiple simulated authoritative decompositions. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 353–361Google Scholar
  67. Tonella P (2001) Concept analysis for module restructuring. IEEE Trans Softw Eng 27(4):351–363. doi:10.1109/32.917524 CrossRefGoogle Scholar
  68. Tzerpos V, Holt RC (1999) Mojo: A distance metric for software clusterings. In: Proceedings of the working conference of reverse engineering, pp 187–193Google Scholar
  69. Wen Z, Tzerpos V (2004) An effectiveness measure for software clustering algorithms. In: Proceedings of international conference on program comprehension. IEEE Computer Society, pp 194–203Google Scholar
  70. Wiggerts TA (1997) Using clustering algorithms in legacy systems remodularization. In: Proceedings of working conference on reverse engineering. IEEE Computer Society, Washington, pp 33–43Google Scholar
  71. Wohlin C, Runeson P, Höst M, Ohlsson M, Regnell B, Wesslén A (2000) Experimentation in software engineering - an introduction. KluwerGoogle Scholar
  72. Wu J, Hassan AE, Holt RC (2005) Comparison of clustering algorithms in the context of software evolution. In: Proceedings of international conference on software maintenance. IEEE Computer Society, pp 525–535Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Anna Corazza
    • 1
  • Sergio Di Martino
    • 1
  • Valerio Maggio
    • 2
  • Giuseppe Scanniello
    • 3
  1. 1.Department of Electrical Engineering and Information TechnologiesUniversity of Naples Federico IINapoliItaly
  2. 2.Department of Information and Electrical Engineering and Applied MathematicsUniversity of SalernoFiscianoItaly
  3. 3.Dipartimento di Matematica, Informatica e EconomiaUniversity of BasilicataPotenzaItaly

Personalised recommendations