Toward Validation of Textual Information Retrieval Techniques for Software Weaknesses

  • Jukka RuohonenEmail author
  • Ville Leppänen
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 903)


This paper presents a preliminary validation of common textual information retrieval techniques for mapping unstructured software vulnerability information to distinct software weaknesses. The validation is carried out with a dataset compiled from four software repositories tracked in the Snyk vulnerability database. According to the results, the information retrieval techniques used perform unsatisfactorily compared to regular expression searches. Although the results vary from a repository to another, the preliminary validation presented indicates that explicit referencing of vulnerability and weakness identifiers is preferable for concrete vulnerability tracking. Such referencing allows the use of keyword-based searches, which currently seem to yield more consistent results compared to information retrieval techniques. Further validation work is required for improving the precision of the techniques, however.


Text mining Software vulnerability Snyk LSA CVE CWE NVD 


  1. 1.
    Alsaleh, M.N., Al-Shaer, E., Husari, G.: ROI-driven cyber risk mitigation using host compliance and network configuration. J. Netw. Syst. Manag. 25(4), 759–783 (2017)CrossRefGoogle Scholar
  2. 2.
    Bojanova, I., Black, P.E., Yesha, Y., Wu, Y.: The bugs framework (BF): a structured approach to express bugs. In: Proceedings of the IEEE International Conference on Software Quality, Reliability and Security (QRS 2016), Vienna, pp. 175–182. IEEE (2016)Google Scholar
  3. 3.
    dos Santos, J.C.A., Favero, E.L.: Practical use of a latent semantic analysis (LSA) model for automatic evaluation of written answers. J. Braz. Comput. Soc. 21(1), 1–21 (2015)CrossRefGoogle Scholar
  4. 4.
    Du, D.: Refining traceability links between vulnerability and software component in a vulnerability knowledge graph. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. LNCS, vol. 10845, pp. 33–49. Springer, Cham (2018). Scholar
  5. 5.
    Fautsch, C., Savoy, J.: Adapting the TF IDF vector-space model to domain specific information retrieval. In: Proceedings of the 2010 ACM Symposium on Applied Computing (SAC 2010), Sierre, pp. 1708–1712. ACM (2010)Google Scholar
  6. 6.
    Franqueira, V.N.L., Tun, T.T., Yu, Y., Wieringa, R., Nuseibeh, B.: Risk and argument: a risk-based argumentation method for practical security. In: Proceedings of the IEEE 19th International Requirements Engineering Conference (RE 2011), Trento, pp. 239–248. IEEE (2011)Google Scholar
  7. 7.
    Gamallo, P., Bordag, S.: Is singular value decomposition useful for word similarity extraction? Lang. Resour. Eval. 45(2), 95–119 (2011)CrossRefGoogle Scholar
  8. 8.
    Goseva-Popstojanova, K., Tyo, J.: Experience report: security vulnerability profiles of mission critical software: empirical analysis of security related bug reports. In: Proceedings of the IEEE 28th International Symposium on Software Reliability Engineering (ISSRE 2017), Toulouse, pp. 152–163. IEEE (2017)Google Scholar
  9. 9.
    Hale, M.L., Gamble, R.F.: Semantic hierarchies for extracting, modeling, and connecting compliance requirements in information security control standards. Requir. Eng. 1–38 (2018). Published online in December 2017Google Scholar
  10. 10.
    Han, Z., Li, X., Liu, H., Xing, Z., Feng, Z.: DeepWeak: reasoning common software weaknesses via knowledge graph embedding. In: Proceedings of the IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER 2018), Campobasso, pp. 456–466. IEEE (2018)Google Scholar
  11. 11.
    Hussain, S.F., Suryani, A.: On retrieving intelligently plagiarized documents using semantic similarity. Eng. Appl. Artif. Intell. 45, 246–258 (2015)CrossRefGoogle Scholar
  12. 12.
    Ibrahim, O.A.S., Landa-Silva, D.: Term frequency with average term occurrences for textual information retrieval. Soft. Comput. 20(8), 3045–3061 (2016)CrossRefGoogle Scholar
  13. 13.
    Jimenez, M., Papadakis, M., Traon, Y.L.: An empirical analysis of vulnerabilities in OpenSSL and the Linux Kernel. In: Proceedings of the 23rd Asia-Pacific Software Engineering Conference (APSEC 2016), Hamilton, pp. 105–112. IEEE (2016)Google Scholar
  14. 14.
    Jin, R., Chai, J.Y., Si, L.: Learn to weight terms in information retrieval using category information. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), Bonn, pp. 353–360. ACM (2005)Google Scholar
  15. 15.
    Kang, J., Park, J.H.: A secure-coding and vulnerability check system based on smart-fuzzing and exploit. Neurocomputing 256, 23–34 (2017)CrossRefGoogle Scholar
  16. 16.
    Martin, R.A., Barnum, S.: Common weaknesses enumeration (CWE) status update. ACM SIGAda Ada Lett. Arch. XXVII(1), 88–91 (2008)CrossRefGoogle Scholar
  17. 17.
    McManus, J.: SEI CERT Oracle Coding Standard for Java, Carnegie Mellon University, Software Engineering Institute (SEI) (2018). Accessed May 2018
  18. 18.
    MITRE: Common Weaknesses Enumeration, CWE List Version 3.1, CWE Comprehensive View (2018). Accessed April 2018
  19. 19.
    MITRE: CWE VIEW: Weaknesses Originally Used by NVD from 2008 to 2016 (2018). Accessed January 2018
  20. 20.
    Mitropoulos, D., Karakoidas, V., Louridas, P., Gousios, G., Spinellis, D.: The bug catalog of the maven ecosystem. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), Hyderabad, pp. 372–375. ACM (2014)Google Scholar
  21. 21.
    Muñoz, F.R., Villalba, L.J.G.: An algorithm to find relationships between web vulnerabilities. J. Supercomput. 74(3), 1061–1089 (2018)CrossRefGoogle Scholar
  22. 22.
    Murtaza, S., Khreich, W., Hamou-Lhadj, A., Bener, A.B.: Mining trends and patterns of software vulnerabilities. J. Syst. Softw. 117, 218–228 (2016)CrossRefGoogle Scholar
  23. 23.
    NIST: NVD Data Feeds, National Institute of Standards and Technology (NIST) (2018). Accessed April 2018
  24. 24.
    The Natural Language Toolkit (NLTK): NLTK 3.2.5 Documentation (2017). Accessed April 2018
  25. 25.
    Oyetoyan, T.D., Milosheska, B., Grini, M., Soares Cruzes, D.: Myths and facts about static application security testing tools: an action research at Telenor digital. In: Garbajosa, J., Wang, X., Aguiar, A. (eds.) XP 2018. LNBIP, vol. 314, pp. 86–103. Springer, Cham (2018). Scholar
  26. 26.
    Paik, J.H.: A novel TF-IDF weighting scheme for effective ranking. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), Dublin, pp. 343–352. ACM (2013)Google Scholar
  27. 27.
    Peclat, R.N., Ramos, G.N.: Semantic analysis for identifying security concerns in software procurement edicts. New Gener. Comput. 36(1), 21–40 (2018)CrossRefGoogle Scholar
  28. 28.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  29. 29.
    Raemaekers, S., van Deursen, A., Visser, J.: Semantic versioning and impact of breaking changes in the maven repository. J. Syst. Softw. 129, 140–158 (2017)CrossRefGoogle Scholar
  30. 30.
    Ruohonen, J.: Classifying web exploits with topic modeling. In: Proceedings of the 28th International Workshop on Database and Expert Systems Applications (DEXA 2017), Lyon, pp. 93–97. IEEE (2017)Google Scholar
  31. 31.
    Ruohonen, J., Rauti, S., Hyrynsalmi, S., Leppänen, V.: Mining social networks of open source CVE coordination. In: Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement (IWSM Mensura 2017), Gothenburg, pp. 176–188. ACM (2017)Google Scholar
  32. 32.
    Snyk Ltd.: Snyk Vulnerability Database (2018). Accessed April 2018
  33. 33.
    Squire, M.: Data sets describing the circle of life in Ruby hosting, 2003–2016. Empir. Softw. Eng. 23(2), 1123–1152 (2018)CrossRefGoogle Scholar
  34. 34.
    Tsipenyuk, K., Chess, B., McGraw, G.: Seven Pernicious Kingdoms: a taxonomy of software security errors. IEEE Secur. Priv. 3(6), 81–84 (2005)CrossRefGoogle Scholar
  35. 35.
    Wen, T., Zhang, Y., Wu, Q., Yang, G.: ASVC: an automatic security vulnerability categorization framework based on novel features of vulnerability data. J. Commun. 10(2), 107–116 (2015)CrossRefGoogle Scholar
  36. 36.
    Wu, Y., Gandhi, R.A., Siy, H.: Using semantic templates to study vulnerabilities recorded in large software repositories. In: Proceedings of the 2010 ICSE Workshop on Software Engineering for Secure Systems (SESS 2010), Cape Town, pp. 22–28. ACM (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Future TechnologiesUniversity of TurkuTurkuFinland

Personalised recommendations