Skip to main content
Log in

The need for software specific natural language techniques

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

For over two decades, software engineering (SE) researchers have been importing tools and techniques from information retrieval (IR). Initial results have been quite positive. For example, when applied to problems such as feature location or re-establishing traceability links, IR techniques work well on their own, and often even better in combination with more traditional source code analysis techniques such as static and dynamic analysis. However, recently there has been growing awareness among SE researchers that IR tools and techniques are designed to work under different assumptions than those that hold for a software system. Thus it may be beneficial to consider IR-inspired tools and techniques that are specifically designed to work with software. One aim of this work is to provide quantitative empirical evidence in support of this observation. To do so a new technique is introduced that captures the level of difficulty found in an information need, the true, often latent, information that a searcher desires to know. The new technique is used to compare two domains: Natural Language (NL) and SE. Analysis of the data leads to three significant findings. First, the variation in the distribution of difficulty of the SE information needs differs from that of the NL information needs; second, collection age plays a role in the differences between the NL collections; and finally, the retrieval model used has little impact on the results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Distributed by the Linguistic Data Consortium

  2. http://www.cs.loyola.edu/~lawrie/papers/esme17-binkley-lawrie/english_stoplist

  3. http://www.cs.loyola.edu/~lawrie/papers/esme17-binkley-lawrie/java_reserved_words

References

  • Abebe SL, Haiduc S, Tonella P, Marcus A (2009) Lexicon bad smells in software. In: 2009 16th Working Conference on Reverse Engineering. IEEE, Piscataway, pp 95–99

  • Alduailij M, Al-Duailej M (2015) Performance evaluation of information retrieval models in bug localization on the method level. In: 2015 international conference on collaboration technologies and systems (CTS). IEEE, Piscataway, pp 305–313

  • Arnaoudova V, Eshkevari LM, Di Penta M, Oliveto R, Antoniol G, Gueheneuc Y-G (2014) Repent: Analyzing the nature of identifier renamings. IEEE Trans Softw Eng 40(5):502–532

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New York

    Google Scholar 

  • Bavota G, De Lucia A, Oliveto R (2011) Identifying extract class refactoring opportunities using structural and semantic cohesion measures. J Syst Softw 84(3):397–414

    Article  Google Scholar 

  • Binkley D, Lawrie D (2016) A case for software specific natural language techniques. In: 2016 IEEE 16th international working conference on source code analysis and manipulation (SCAM) . IEEE, Piscataway, pp 187–196

  • Binkley D, Lawrie DJ, Uehlinger C, Heinz D (2015) Enabling improved ir-based feature location. J Syst Softw 101(0):30–42

    Article  Google Scholar 

  • Callan JP, Bruce Croft W, Harding SM (1992) The inquery retrieval system. In: Proceedings of the third international conference on database and expert systems applications, pp 78–83

  • Cleverdon C (1967) The cranfield tests on index language devices. Aslib proceedings 19(6):173–194. MCB UP Ltd

    Article  Google Scholar 

  • De Lucia A, Oliveto R, Sgueglia P (2006) Incremental approach and user feedbacks: a silver bullet for traceability recovery. In: 22nd IEEE international conference on software maintenance, 2006. ICSM’06. IEEE, Piscataway, pp 299–309

  • De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 37(2):205–227

    Article  Google Scholar 

  • Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: A taxonomy and survey. J Softw Maint Evol 23(7):107–117

    Google Scholar 

  • Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 2009 mining software repositories (MSR). IEEE, Piscataway

  • Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in ir-based concept location. In: 2009. ICSM 2009. IEEE international conference on software maintenance. IEEE, Piscataway, pp 351–360

  • Guerrouj L (2010) Automatic derivation of concepts based on the analysis of source code identifiers. In: 2013 20th working conference on reverse engineering (WCRE), vol 0, pp 301–304

  • Krovetz R (1993) Viewing morphology as an inference process. In: Korfhage R et al (eds) Special interest group on information retrieval

  • Lavrenko V, Croft WB (2001) Relevance-based language models. In: Croft WB, Harper DJ, Kraft DH, Zobel J (eds) SIGIR conference on research and development in information retrieval

  • Lin J, Craig Murray G (2005) Assessing the term independence assumption in blind relevance feedback, ACM, New York

  • Manning C, Raghavan P, Schutze H (2008) Introduction to information retrieval, Cambridge University Press, Cambridge

  • Mccallum A (2002) Mallet: A machine learning for language toolkit

  • Pradel M, Gross TR (2011) Detecting anomalies in the order of equally-typed method arguments. In: Proceedings of the 2011 international symposium on software testing and analysis. ACM, New York, pp 232–242

  • Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study of generic and composite text models. In: Proceedings of the 8th working conference on mining software repositories (MSR ’11). ACM, New York, pp 43–52

  • Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M et al (1995) Okapi at trec-3. Nist Special Publication Sp 109:109

    Google Scholar 

  • Saha RK (2016) Effective bug detection and localization using information retrieval. PhD thesis, University of Texas, Austin

    Google Scholar 

  • Savage T., Revelle M., Poshyvanyk D (2010) Flat3: Feature location and textual tracing tool. In: Proceedings of 32nd ACM/IEEE international conference on software engineering (ICSE’10), formal research tool demonstration. ACM, New York

  • Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories. IEEE Press, Piscataway, pp 309–318

  • Tian K, Revelle M, Poshyvanyk D (2009) Using latent dirichlet allocation for automatic categorization of software. In: 2009 MSR’09. 6th, IEEE International Working Conference on Mining software repositories. IEEE, Piscataway, pp 163–166

  • Ellen M (2008) Voorhees. Overview of trec 2007. Technical report

  • Voorhees EM, Hardman DK (1999) Overview of the eightj text retrieval conference (trec-8). In: Trec, vol 99, pp 1–25

  • Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on Software maintenance and evolution (ICSME) . IEEE, Piscataway, pp 171–180

  • Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Conference on Research and Development in Information Retrieval, Seattle

  • Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS) 22 (2):179–214

    Article  Google Scholar 

  • Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR Forum 51(2):268–276

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by NSF grant 1626262. We are very thankful to the original SCAM reviewers and especially to the EMSE reviewers whose comments were insightful and led to considerable improvements to the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dave Binkley.

Additional information

Communicated by: Michaela Greiler and Gabriele Bavota

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Binkley, D., Lawrie, D. & Morrell, C. The need for software specific natural language techniques. Empir Software Eng 23, 2398–2425 (2018). https://doi.org/10.1007/s10664-017-9566-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-017-9566-5

Keywords

Navigation