Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability

  • Anna Corazza
  • Sergio Di Martino
  • Valerio Maggio
  • Alessandro Moschitti
  • Andrea Passerini
  • Giuseppe Scanniello
  • Fabrizio Silvestri
Part of the Communications in Computer and Information Science book series (CCIS, volume 379)

Abstract

In this paper, we investigate some ideas based on Machine Learning, Natural Language Processing, and Information Retrieval to outline possible research directions in the field of software architecture recovery and clone detection. In particular, after presenting an extensive related work, we illustrate two proposals for addressing these two issues, that represent hot topics in the field of Software Maintenance. Both proposals use Kernel Methods for exploiting structural representation of source code and to automate the detection of clones and the recovery of the actually implemented architecture in a subject software system.

Keywords

Information Retrieval Natural Language Processing Machine Learning Software Maintenance and Evolution 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anquetil, N., Fourrier, C., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: Proceedings of the 6th Working Conference on Reverse Engineering, pp. 235–255. IEEE Computer Society, Washington, DC (1999)Google Scholar
  2. 2.
    Baker, B.: On finding duplication and near-duplication in large software systems. In: IEEE Proceedings of the Working Conference on Reverse Engineering (1995)Google Scholar
  3. 3.
    Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: Proceedings of the International Conference on Software Maintenance, pp. 368–377. IEEE Press (1998)Google Scholar
  4. 4.
    Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.M.: Comparison and evaluation of clone detection tools. IEEE Trans. Software Eng., 577–591 (September 2007)Google Scholar
  5. 5.
    Bittencourt, R.A., Guerrero, D.D.S.: Comparison of graph clustering algorithms for recovering software architecture module views. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 251–254. IEEE Computer Society, Washington, DC (2009), http://portal.acm.org/citation.cfm?id=1545011.1545446 Google Scholar
  6. 6.
    Bulychev, P., Minea, M.: Duplicate code detection using anti-unification. In: Spring/Summer Young Researcher’s Colloquium (2008)Google Scholar
  7. 7.
    Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Investigating the use of lexical information for software system clustering. In: Proceedings of the 15th European Conference on Software Maintenance and Reengineering, CSMR 2011, pp. 35–44. IEEE Computer Society, Washington, DC (2011), http://dx.doi.org/10.1109/CSMR.2011.8 CrossRefGoogle Scholar
  8. 8.
    Corazza, A., Di Martino, S., Scanniello, G.: A probabilistic based approach towards software system clustering. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 88–96 (2010)Google Scholar
  9. 9.
    Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: A tree kernel based approach for clone detection. In: Proceedings of the 2010 IEEE International Conference on Software Maintenance, ICSM 2010, pp. 1–5. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICSM.2010.5609715 CrossRefGoogle Scholar
  10. 10.
    Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.: Combining machine learning and information retrieval techniques for software clustering. In: Moschitti, A., Scandariato, R. (eds.) EternalS 2011. CCIS, vol. 255, pp. 42–60. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.49.7546 CrossRefGoogle Scholar
  12. 12.
    Doval, D., Mancoridis, S., Mitchell, B.S.: Automatic clustering of software systems using a genetic algorithm. In: Proceedings of the Software Technology and Engineering Practice, pp. 73–82. IEEE Computer Society, Washington, DC (1999), http://portal.acm.org/citation.cfm?id=829540.832036 Google Scholar
  13. 13.
    Ducasse, S., Pollet, D.: Software architecture reconstruction: A process-oriented taxonomy. IEEE Transactions on Software Engineering 35(4), 573–591 (2009)CrossRefGoogle Scholar
  14. 14.
    Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detecting duplicated code. In: Proceedings of the International Conference on Software Maintenance, pp. 109–118 (1999)Google Scholar
  15. 15.
    Finley, T., Joachims, T.: Supervised clustering with support vector machines. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 217–224. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102379 Google Scholar
  16. 16.
    Frasconi, P., Passerini, A.: Learning with kernels and logical representations. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Probabilistic Inductive Logic Programming. LNCS (LNAI), vol. 4911, pp. 56–91. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  17. 17.
    Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, ICSE 2008, pp. 321–330. ACM, New York (2008), http://doi.acm.org/10.1145/1368088.1368132 Google Scholar
  18. 18.
    Garlan, D.: Software architecture: a roadmap. In: Proceedings of the Conference on the Future of Software Engineering, ICSE 2000, pp. 91–101. ACM, New York (2000), http://doi.acm.org/10.1145/336512.336537 Google Scholar
  19. 19.
    Gönen, M., Alpaydin, E.: Multiple kernel learning algorithms. J. Mach. Learn. Res., 2211–2268 (July 2011)Google Scholar
  20. 20.
    Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine learning. Annals of Statistics 36(3), 1171–1220 (2008), http://www.projecteuclid.org/DPubS?verb=Displayversion=1.0service=UIhandle=euclid.aos/1211819561page=record MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th International Conference on Software Engineering, ICSE 2007, pp. 96–105. IEEE Computer Society, Washington, DC (2007), http://dx.doi.org/10.1109/ICSE.2007.30 Google Scholar
  22. 22.
    Johnson, J.H.: Identifying redundancy in source code using fingerprints. In: Proc. Conf. Centre for Advanced Studies on Collaborative Research (CASCON), pp. 171–183. IBM Press (1993)Google Scholar
  23. 23.
    Kamiya, T., Kusumoto, S., Inoue, K.: Ccfinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Eng. 28(7), 654–670 (2002)CrossRefGoogle Scholar
  24. 24.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999), http://doi.acm.org/10.1145/324133.324140 MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Komondoor, R., Horwitz, S.: Using slicing to identify duplication in source code. In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  26. 26.
    Koschke, R.: Atomic architectural component recovery for program understanding and evolution. Softwaretechnik-Trends (2000), http://www.iste.uni-stuttgart.de/ps/rainer/thesis
  27. 27.
    Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: WCRE 2006: Proceedings of the 13th Working Conference on Reverse Engineering, pp. 253–262. IEEE Computer Society, Washington, DC (2006)CrossRefGoogle Scholar
  28. 28.
    Krinke, J.: Identifying Similar Code with Program Dependence Graphs. In: Proc. Working Conf. Reverse Engineering (WCRE), pp. 301–309. IEEE Computer Society Press (2001)Google Scholar
  29. 29.
    Kuhn, A., Ducasse, S., Gírba, T.: Semantic clustering: Identifying topics in source code. Information and Software Technology 49, 230–243 (2007), http://portal.acm.org/citation.cfm?id=1224560.1224698 CrossRefGoogle Scholar
  30. 30.
    Landwehr, N., Passerini, A., Raedt, L., Frasconi, P.: Fast learning of relational kernels. Mach. Learn. 78(3), 305–342 (2010), http://dx.doi.org/10.1007/s10994-009-5163-1 MathSciNetCrossRefGoogle Scholar
  31. 31.
    Lehman, M.M.: Programs, life cycles, and laws of software evolution. Proc. IEEE 68(9), 1060–1076 (1980)CrossRefGoogle Scholar
  32. 32.
    Leitão, A.M.: Detection of redundant code using r2d2. Software Quality Journal 12(4), 361–382 (2004)CrossRefGoogle Scholar
  33. 33.
    Maletic, J.I., Marcus, A.: Supporting program comprehension using semantic and structural information. In: Proceedings of the 23rd International Conference on Software Engineering, ICSE 2001, pp. 103–112. IEEE Computer Society, Washington, DC (2001), http://portal.acm.org/citation.cfm?id=381473.381484 CrossRefGoogle Scholar
  34. 34.
    Maqbool, O., Babri, H.: Hierarchical clustering for software architecture recovery. IEEE Transactions on Software Engineering 33(11), 759–780 (2007)CrossRefGoogle Scholar
  35. 35.
    Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 585–592. ACM, New York (2005), http://doi.acm.org/10.1145/1102351.1102425 Google Scholar
  36. 36.
    Mitchell, B.S., Mancoridis, S.: On the automatic modularization of software systems using the bunch tool. IEEE Transactions on Software Engineering 32, 193–208 (2006), http://portal.acm.org/citation.cfm?id=1128600.1128815 CrossRefGoogle Scholar
  37. 37.
    Moschitti, A., Basili, R., Pighin, D.: Tree Kernels for Semantic Role Labeling. In: Computational Linguistics, pp. 193–224. MIT Press, Cambridge (2008)Google Scholar
  38. 38.
    Risi, M., Scanniello, G., Tortora, G.: Using fold-in and fold-out in the architecture recovery of software systems. Formal Asp. Comput. 24(3), 307–330 (2012)CrossRefGoogle Scholar
  39. 39.
    Roy, C.K., Cordy, J.R.: Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC, pp. 172–181 (2008)Google Scholar
  40. 40.
    Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)MathSciNetCrossRefMATHGoogle Scholar
  41. 41.
    Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Architectural layer recovery for software system understanding and evolution. Software Practice and Experience 40, 897–916 (2010), http://dx.doi.org/10.1002/spe.v40:10 CrossRefGoogle Scholar
  42. 42.
    Scanniello, G., D’Amico, A., D’Amico, C., D’Amico, T.: Using the kleinberg algorithm and vector space model for software system clustering. In: Proceedings of the IEEE 18th International Conference on Program Comprehension, ICPC 2010, pp. 180–189. IEEE Computer Society, Washington, DC (2010), http://dx.doi.org/10.1109/ICPC.2010.17 CrossRefGoogle Scholar
  43. 43.
    Tzerpos, V., Holt, R.C.: On the stability of software clustering algorithms. In: Proceedings of the 8th International Workshop on Program Comprehension, pp. 211–218 (2000)Google Scholar
  44. 44.
    Vert, J.P.: A Tree Kernel to analyse phylogenetic profiles. Bioinformatics 18(suppl. 1), S276–S284 (2002)Google Scholar
  45. 45.
    Wahler, V., Seipel, D., von Gudenberg, J.W., Fischer, G.: Clone detection in source code by frequent itemset techniques. In: SCAM 2004: Proceedings of the Fourth IEEE International Workshop on Source Code Analysis and Manipulation, pp. 128–135. IEEE Computer Society, Washington, DC (2004)CrossRefGoogle Scholar
  46. 46.
    Wiggerts, T.A.: Using clustering algorithms in legacy systems remodularization. In: Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE 1997), pp. 33–43. IEEE Computer Society, Washington, DC (1997), http://portal.acm.org/citation.cfm?id=832304.836999 CrossRefGoogle Scholar
  47. 47.
    Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algotithms in the context of software evolution. In: Proceedings of the 21st IEEE International Conference on Software Maintenance, pp. 525–535. IEEE Computer Society (2005)Google Scholar
  48. 48.
    Yang, W.: Identifying syntactic differences between two programs. Software - Practice and Experience 21(7), 739–755 (1991)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Anna Corazza
    • 1
  • Sergio Di Martino
    • 1
  • Valerio Maggio
    • 1
  • Alessandro Moschitti
    • 2
  • Andrea Passerini
    • 2
  • Giuseppe Scanniello
    • 3
  • Fabrizio Silvestri
    • 4
  1. 1.University of Naples “Federico II”Italy
  2. 2.University of TrentoItaly
  3. 3.University of BasilicataItaly
  4. 4.ISTI Institute - CNRItaly

Personalised recommendations