Empirical Software Engineering

, Volume 23, Issue 4, pp 2398–2425 | Cite as

The need for software specific natural language techniques

  • Dave BinkleyEmail author
  • Dawn Lawrie
  • Christopher Morrell


For over two decades, software engineering (SE) researchers have been importing tools and techniques from information retrieval (IR). Initial results have been quite positive. For example, when applied to problems such as feature location or re-establishing traceability links, IR techniques work well on their own, and often even better in combination with more traditional source code analysis techniques such as static and dynamic analysis. However, recently there has been growing awareness among SE researchers that IR tools and techniques are designed to work under different assumptions than those that hold for a software system. Thus it may be beneficial to consider IR-inspired tools and techniques that are specifically designed to work with software. One aim of this work is to provide quantitative empirical evidence in support of this observation. To do so a new technique is introduced that captures the level of difficulty found in an information need, the true, often latent, information that a searcher desires to know. The new technique is used to compare two domains: Natural Language (NL) and SE. Analysis of the data leads to three significant findings. First, the variation in the distribution of difficulty of the SE information needs differs from that of the NL information needs; second, collection age plays a role in the differences between the NL collections; and finally, the retrieval model used has little impact on the results.


Feature location Information retrieval Query quality Information need Test collection challenge 



This work was supported in part by NSF grant 1626262. We are very thankful to the original SCAM reviewers and especially to the EMSE reviewers whose comments were insightful and led to considerable improvements to the paper.


  1. Abebe SL, Haiduc S, Tonella P, Marcus A (2009) Lexicon bad smells in software. In: 2009 16th Working Conference on Reverse Engineering. IEEE, Piscataway, pp 95–99Google Scholar
  2. Alduailij M, Al-Duailej M (2015) Performance evaluation of information retrieval models in bug localization on the method level. In: 2015 international conference on collaboration technologies and systems (CTS). IEEE, Piscataway, pp 305–313Google Scholar
  3. Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Gueheneuc Y-G (2014) Repent: Analyzing the nature of identifier renamings. IEEE Trans Softw Eng 40(5):502–532CrossRefGoogle Scholar
  4. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New YorkGoogle Scholar
  5. Bavota G, De Lucia A, Oliveto R (2011) Identifying extract class refactoring opportunities using structural and semantic cohesion measures. J Syst Softw 84(3):397–414CrossRefGoogle Scholar
  6. Binkley D, Lawrie D (2016) A case for software specific natural language techniques. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM) . IEEE, Piscataway, pp 187–196Google Scholar
  7. Binkley D, Lawrie DJ, Uehlinger C, Heinz D (2015) Enabling improved ir-based feature location. J Syst Softw 101(0):30–42CrossRefGoogle Scholar
  8. Callan JP, Bruce Croft W, Harding SM (1992) The inquery retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83Google Scholar
  9. Cleverdon C (1967) The cranfield tests on index language devices. Aslib proceedings 19(6):173–194. MCB UP LtdCrossRefGoogle Scholar
  10. De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 299–309Google Scholar
  11. De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227CrossRefGoogle Scholar
  12. Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: A taxonomy and survey. J Softw Maint Evol 23(7):107–117Google Scholar
  13. Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 2009 mining software repositories (MSR). IEEE, PiscatawayGoogle Scholar
  14. Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: 2009. ICSM 2009. IEEE international conference on software maintenance. IEEE, Piscataway, pp 351–360Google Scholar
  15. Guerrouj L (2010) Automatic derivation of concepts based on the analysis of source code identifiers. In: 2013 20th working conference on reverse engineering (WCRE), vol 0, pp 301–304Google Scholar
  16. Krovetz R (1993) Viewing morphology as an inference process. In: Korfhage R et al (eds) Special interest group on information retrievalGoogle Scholar
  17. Lavrenko V, Croft WB (2001) Relevance-based language models. In: Croft WB, Harper DJ, Kraft DH, Zobel J (eds) SIGIR conference on research and development in information retrievalGoogle Scholar
  18. Lin J, Craig Murray G (2005) Assessing the term independence assumption in blind relevance feedback, ACM, New YorkGoogle Scholar
  19. Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval, Cambridge University Press, CambridgeGoogle Scholar
  20. Mccallum A (2002) Mallet: A machine learning for language toolkitGoogle Scholar
  21. Pradel M, Gross TR (2011) Detecting anomalies in the order of equally-typed method arguments. In: Proceedings of the 2011 international symposium on software testing and analysis. ACM, New York, pp 232–242Google Scholar
  22. Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th working conference on mining software repositories (MSR ’11). ACM, New York, pp 43–52Google Scholar
  23. Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M et al (1995) Okapi at trec-3. Nist Special Publication Sp 109:109Google Scholar
  24. Saha RK (2016) Effective bug detection and localization using information retrieval. PhD thesis, University of Texas, AustinGoogle Scholar
  25. Savage T., Revelle M., Poshyvanyk D (2010) Flat3: Feature location and textual tracing tool. In: Proceedings of 32nd ACM/IEEE international conference on software engineering (ICSE’10), formal research tool demonstration. ACM, New YorkGoogle Scholar
  26. Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, pp 309–318Google Scholar
  27. Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: 2009 MSR’09. 6th, IEEE International Working Conference on Mining software repositories. IEEE, Piscataway, pp 163–166Google Scholar
  28. Ellen M (2008) Voorhees. Overview of trec 2007. Technical reportGoogle Scholar
  29. Voorhees EM, Hardman DK (1999) Overview of the eightj text retrieval conference (trec-8). In: Trec, vol 99, pp 1–25Google Scholar
  30. Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME) . IEEE, Piscataway, pp 171–180Google Scholar
  31. Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Conference on Research and Development in Information Retrieval, SeattleGoogle Scholar
  32. Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22 (2):179–214CrossRefGoogle Scholar
  33. Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Loyola University MarylandBaltimoreUSA

Personalised recommendations