Classification of Software Artifacts Based on Structural Information

  • Yuhanis Yusof
  • Omer F. Rana
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6279)


Classification of software artifacts, in particularly the source code files, are currently performed by administrator of a repository. Even though there exist automated classification on these repositories, nevertheless existing approach focuses on semantic analysis of keywords found in the artifact. This paper presents the use of structural information, that is the software metrics, in determining the appropriate application domain for a particular artifact. Results obtained from the study show that there is a difference in the metrics’ trend between files of different application domain. It is also learned that results obtained using k-nearest neighborhood outperformed C4.5 decision tree and the one generated based on Discriminant Analysis in classifying files of database and graphics domain.


Discriminant Function Analysis Source Code Application Domain Discriminant Function Analysis Recall Score 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C and c++ code counter, (last accessed on April 15, 2010)
  2. 2.
    Freshmeat, (last accessed on May 10, 2010)
  3. 3.
    Sourceforge, (last accessed on May 10, 2010)
  4. 4.
    Spss, (last accessed on April 15, 2010)
  5. 5.
    Chung, K.-P., Fun, C.C.: A hierarchical nonparametric discriminant analysis approach for a content-based image retrieval system. In: ICEBE 2005: Proceedings of the IEEE International Conference on e-Business Engineering, Washington, DC, USA, pp. 346–351. IEEE Computer Society, Los Alamitos (2005)CrossRefGoogle Scholar
  6. 6.
    Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: Xsearch: A semantic search engine for xml. In: Proceedings of the 29th VLDB Conference, Berlin, Germany (2003)Google Scholar
  7. 7.
    Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory IT-13(1), 21–27 (1967)CrossRefGoogle Scholar
  8. 8.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K.: Indexing by latent semantic analysis. Journal of the American Society for Information Science, 391–407 (1990)Google Scholar
  9. 9.
    Fuchs, N.E.: Specifications are (preferably) executable. Software Engineering Journal 7(5), 323–334 (1992)CrossRefGoogle Scholar
  10. 10.
    Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining very large databases. Computer 32(8), 38–45 (1999)CrossRefGoogle Scholar
  11. 11.
    Kawaguchi, S., Garg, P.K., Makoto, M., Inoue, K.: Automatic categorization algorithm for evolvable software archive. In: Proceedings of the Six International Workshop on Principles of Software Evolution, pp. 195–200 (2002)Google Scholar
  12. 12.
    Kawaguchi, S., Garg, P.K., Makoto, M., Inoue, K.: Mudablue: An automatic categorization system for open source repositories. In: Proceedings of the 11th Asia-Pacific Software Engineering Conference, pp. 184–193 (2004)Google Scholar
  13. 13.
    Klecka, W.R.: Discriminant Analysis, 1st edn. Sage Publications, Thousand Oaks (1980)Google Scholar
  14. 14.
    Kwon, O.-W., Lee, J.-H.: Text categorization based on k-nearest neighbor approach for web site classification. Information Processing Management 39(1), 25–44 (2003)zbMATHCrossRefGoogle Scholar
  15. 15.
    Lim, T.-S., Loh, W.-Y., Shih, Y.-S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning 40(3), 203–228 (2000)zbMATHCrossRefGoogle Scholar
  16. 16.
    Marcus, A., Sergeyev, A., Rajlich, V., Maletic, J.I.: An information retrieval approach to concept location in source code. In: WCRE 2004: Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004), Washington, DC, USA, pp. 214–223. IEEE Computer Society, Los Alamitos (2004)CrossRefGoogle Scholar
  17. 17.
    DSFP Modeling and Forecasting. Svm - support vector machines, (last accessed on April 15, 2010)
  18. 18.
    Nagappan, N.: Toward a software testing and reliability early warning metric suite. In: Proceedings of International Conference on Software Engineering, pp. 60–62 (2004)Google Scholar
  19. 19.
    U. of Waikato. Weka, (last accessed on April 15, 2010)
  20. 20.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
  21. 21.
    Ruggieri, S.: Efficient c4.5. IEEE Transactions on Knowledge and Data Engineering 14(2), 438–444 (2002)CrossRefGoogle Scholar
  22. 22.
    Shafia, Mustafa, T., Raza, A., Jamil, U., Shahzad, F.: A classification model for software workbenches. European Journal of Scientific Research 41(1), 109–121 (2010)Google Scholar
  23. 23.
    Ugurel, S., Krovetz, R., Giles, C.L.: What’s the code?: Automatic classification of source code archives. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 632–638. ACM Press, New York (2002)Google Scholar
  24. 24.
    Walters, S., Rajashekhar, T.B.: Mapping of two schemes of classification for software classification. Cataloging and Classification Quarterly 41(1), 163–182 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Yuhanis Yusof
    • 1
  • Omer F. Rana
    • 2
  1. 1.College of Arts and Sciences, Information Technology BuildingUniversiti Utara MalaysiaSintokMalaysia
  2. 2.School of Computer ScienceCardiff UniversityCardiffUK

Personalised recommendations