Advertisement

Empirical Software Engineering

, Volume 23, Issue 5, pp 2622–2654 | Cite as

Augmenting and structuring user queries to support efficient free-form code search

  • Raphael Sirres
  • Tegawendé F. Bissyandé
  • Dongsun Kim
  • David Lo
  • Jacques Klein
  • Kisub Kim
  • Yves Le Traon
Article

Abstract

Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing, structural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions.

Keywords

Code search GitHub Free-form search Query augmentation StackOverflow Vocabulary mismatch 

Notes

Acknowledgments

The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the Fonds National de la Recherche (FNR), Luxembourg, under projects RECOMMEND C15/IS/10449467, FIXPATTERN C15/IS/9964569, FNR-AFR PhD/11623818, and by the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant, under project 16-C220-SMU-004.

References

  1. Bajracharya SK, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Proceedings of the companion to the 21st ACM SIGPLAN symposium on object-oriented programming systems, languages, and applications (OPSLA). Portland, Oregon, USA, pp 681–682Google Scholar
  2. Bajracharya SK (2010) Facilitating internet-scale code retrieval. Ph.D. thesis, Long Beach. AAI3422111Google Scholar
  3. Bajracharya SK, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering (FSE). Santa Fe, New Mexico, USA, pp 157–166Google Scholar
  4. Barzilay O, Treude C, Zagalsky A (2013) Facilitating crowd sourced software engineering via stack overflow. In: Finding source code on the web for remix and reuse. Springer, Berlin, pp 289–308Google Scholar
  5. Bissyande T, Thung F, Lo D, Jiang L, Reveillere L (2013) Popularity, interoperability, and impact of programming languages in 100,000 open source projects. In: Computer software and applications conference (COMPSAC), 2013 IEEE 37th annual.  https://doi.org/10.1109/COMPSAC.2013.55, pp 303–312
  6. Bissyandé TF, Thung F, Lo D, Jiang L, Réveillère L (2013) Orion: a software project search engine with integrated diverse software artifacts. In: ICECSSGoogle Scholar
  7. Carpineto C, de Mori R, Romano G, Bigi B (2001) An information-theoretic approach to automatic query expansion. ACM Trans Inf Syst 19(1):1–27.  https://doi.org/10.1145/366836.366860 CrossRefGoogle Scholar
  8. Chatterjee S, Juvekar S, Sen K (2009) Sniff: a search engine for java using free-form queries. In: Fundamental approaches to software engineering. Springer, Berlin, pp 385–400Google Scholar
  9. Chen TH, Thomas SW, Nagappan M, Hassan AE (2012) Explaining software defects using topic models. In: Proceedings of the 9th IEEE working conference on mining software repositories, MSR ’12. http://dl.acm.org/citation.cfm?id=2664446.2664476. IEEE Press, Piscataway, pp 189–198
  10. Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: ACM/IEEE 32Nd international conference on software engineering.  https://doi.org/10.1145/1806799.1806825, vol 1, pp 155–164
  11. Codota (2016) http://www.codota.com. Last accessed 12.03.2016
  12. Dagenais B, Robillard MP (2012) Recovering traceability links between an API and its learning resources. In: Proceedings of the 34th international conference on software engineering (ICSE). IEEE, Piscataway, pp 47–57Google Scholar
  13. Eckert K, Stuckenschmidt H, Pfeffer M (2007) Interactive thesaurus assessment for automatic document annotation. In: Proceedings of the 4th international conference on knowledge capture, k-CAP ’07.  https://doi.org/10.1145/1298406.1298426. ACM, New York, pp 103–110
  14. Furnas GW, Landauer TK, Gomez LM, Dumais ST (1987) The vocabulary problem in human-system communication. Commun ACM 30(11):964–971.  https://doi.org/10.1145/32206.32212 CrossRefGoogle Scholar
  15. Gallardo-Valencia RE, Elliott Sim S (2009) Internet-scale code search. In: Proceedings of the 2009 workshop on search-driven development-users, infrastructure, tools and evaluation, SUITEGoogle Scholar
  16. Gollapudi S, Ieong S, Ntoulas A, Paparizos S (2011) Efficient query rewrite for structured web queries. In: Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11.  https://doi.org/10.1145/2063576.2063981. ACM, New York, pp 2417–2420
  17. Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: 2010 ACM/IEEE 32nd international conference on software engineering.  https://doi.org/10.1145/1806799.1806868, vol 1, pp 475–484
  18. Gu X, Zhang H, Zhang D, Kim S (2016) Deep api learning. In: International symposium on foundations of software engineering (FSE)Google Scholar
  19. Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013) Automatic query reformulations for text retrieval in software engineering. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, Piscataway, pp 842–851Google Scholar
  20. Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. http://dl.acm.org/citation.cfm?id=2486788.2486991. IEEE Press, Piscataway, pp 1307–1310
  21. Hill E, Roldan-vega M, Fails JA, Mallet G (2014) NL-based query refinement and contextualized code search results: a user study. In: 2014 Software evolution week - IEEE conference on software maintenance, reengineering, and reverse engineering, CSMR-WCRE 2014, Antwerp, Belgium, February 3-6, 2014.  https://doi.org/10.1109/CSMR-WCRE.2014.6747190, pp 34–43
  22. Hoffmann R, Fogarty J, Weld DS (2007) Assieme: finding and leveraging implicit references in a web search interface for programmers. In: Proceedings of the 20th annual ACM symposium on user interface software and technology (UIST). Newport, Rhode Island, USA, pp 13–22Google Scholar
  23. Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: Proceedings of the 27th international conference on software engineering (ICSE). St. Louis, MO, USA, pp 117–125Google Scholar
  24. Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: Proceedings of the 11th working conference on mining software repositories (MSR). Hyderabad, India, pp 92–101Google Scholar
  25. Keivanloo I, Rilling J, Zou Y (2014) Spotting working code examples. In: Proceedings of ICSEGoogle Scholar
  26. Kim S, Kim D (2016) Automatic identifier inconsistency detection using code dictionary. Empir Softw Eng (EMSE) 21(2):565–604CrossRefGoogle Scholar
  27. Lemos OAL, de Paula AC, Zanichelli FC, Lopes CV (2014) Thesaurus-based automatic query expansion for interface-driven code search. In: Proceedings of the 11th working conference on mining software repositories (MSR). Hyderabad, India, pp 212–221Google Scholar
  28. Liu LM, Halper M, Geller J, Perl Y (1999) Controlled vocabularies in oodbs: Modeling issues and implementation. Distrib. Parallel Databases 7(1):37–65.  https://doi.org/10.1023/A:1008682210559 CrossRefGoogle Scholar
  29. Lozano A, Kellens A, Mens K (2011) Mendel: Source code recommendation based on a genetic metaphor. In: Proceedings of the 2011 26th IEEE/ACM international conference on automated software engineering, ASE ’11.  https://doi.org/10.1109/ASE.2011.6100078. IEEE Computer Society, Washington, pp 384–387
  30. Lu M, Sun X, Wang S, Lo D, Duan Y (2015) Query expansion via WordNet for effective code search. In: Proceedings of 22nd IEEE international conference on software analysis, evolution, and reengineering (SANER). Montreal, QC, Canada, pp 545–549Google Scholar
  31. Lv F, Zhang H, guang Lou J, Wang S, Zhang D, Zhao J (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In: 30th IEEE/ACM international conference on automated software engineering (ASE), pp 260–270Google Scholar
  32. Mamykina L, Manoim B, Mittal M, Hripcsak G, Hartmann B (2011) Design lessons from the fastest Q&A site in the west. In: Proceedings of the SIGCHI conference on human factors in computing systems (CHI). Vancouver, BC, Canada, pp 2857–2866Google Scholar
  33. Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. ACM SIGPLAN Not 40(6):48–61CrossRefGoogle Scholar
  34. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University press, New YorkCrossRefMATHGoogle Scholar
  35. Martie L, LaToza TD, van der Hoek A (2015) CodeExchange: supporting reformulation of internet-scale code queries in context (T). In: 2015 30th IEEE/ACM international conference on Automated software engineering (ASE). Lincoln, USA, pp 24–35Google Scholar
  36. McMillan C, Grechanik M, Poshyvanyk D, Fu C, Xie Q (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087.  https://doi.org/10.1109/TSE.2011.84  https://doi.org/10.1109/TSE.2011.84 CrossRefGoogle Scholar
  37. McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usage. In: Proceedings of ICSEGoogle Scholar
  38. Moreno L, Bavota G, Di Penta M, Oliveto R, Marcus A (2015) How can i use this method?. In: ICSEGoogle Scholar
  39. Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in stackoverflow. In: Proceedings of 28th IEEE international conference on software maintenance (ICSM). Trento, Italy, pp 25–34Google Scholar
  40. Nguyen AT, Nguyen TT, Al-Kofahi J, Nguyen HV, Nguyen T (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: 26Th IEEE/ACM international conference on automated software engineering (ASE).  https://doi.org/10.1109/ASE.2011.6100062, pp 263–272
  41. Nie L, Jiang H, Ren Z, Sun Z, Li X (2016) Query expansion based on crowd knowledge for code search. IEEE Trans Serv Comput 9(5):771–783.  https://doi.org/10.1109/TSC.2016.2560165 CrossRefGoogle Scholar
  42. Openhub (2016) http://code.openhub.net. Last accessed 12.03.2016
  43. Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the IDE into a self-confident programming prompter. In: Proceedings of the 11th working conference on mining software (MSR). Hyderabad, India, pp 102–111Google Scholar
  44. Roldan-vega M, Mallet G, Hill E, Fails JA (2013) Conquer: a tool for nl-based query refinement and contextualizing source code search results. In: Proceedings 29th IEEE international conference on software maintenance. CiteseerGoogle Scholar
  45. Ruthven I (2003) Re-examining the potential effectiveness of interactive query expansion. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03.  https://doi.org/10.1145/860435.860475. ACM, New York, pp 213–220
  46. Sadowski C, Stolee KT, Elbaum S (2015) How developers search for code: a case study. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015.  https://doi.org/10.1145/2786805.2786855. ACM, New York, pp 191–201
  47. Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on aspect-oriented software development (AOSD). Vancouver, British Columbia, Canada, pp 212–224Google Scholar
  48. Sisman B, Kak AC (2013) Assisting code search with automatic query reformulation for bug localization. In: Proceedings of the 10th working conference on mining software repositories (MSR). San Francisco, CA, USA, pp 309–318Google Scholar
  49. Stylos J, Myers BA (2006) Mica: a web-search tool for finding API components and examples. In: IEEE symposium on Visual languages and human-centric computing, 2006. VL /HCC 2006.  https://doi.org/10.1109/VLHCC.2006.32, pp 195–202
  50. Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: Proceedings of the 36th international conference on software engineering (ICSE). Hyderabad, India, pp 643–652Google Scholar
  51. Thummalapenta S, Xie T (2007) Parseweb: a programmer assistant for reusing open source code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering (ASE). Atlanta, Georgia, USA, pp 204–213Google Scholar
  52. Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social coding in Github. In: Proceedings of the 17th European conference on Software maintenance and reengineering (CSMR). Genova, Italy, pp 323–326Google Scholar
  53. Treude C, Robillard M (2016) Augmenting api documentation with insights from stack overflow. In: Proceedings of the 38th international conference on software engineering, ICSE ’16, pp 392–403Google Scholar
  54. Wang S, Lo D, Jiang L (2014) Active code search: incorporating user feedback to improve code search relevance. In: Proceedings of the 29th ACM/IEEE international conference on automated software engineering (ASE). Vasteras, Sweden, pp 677–682Google Scholar
  55. Xie T, Pei J (2006) Mapo: mining api usages from open source repositories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06.  https://doi.org/10.1145/1137983.1137997. ACM, New York, pp 54–57
  56. Xu J, Croft WB (1996) Query expansion using local and global document analysis. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR). Zurich, Switzerland, pp 4–11Google Scholar
  57. Yang J, Tan L (2014) Swordnet: inferring semantically related words from software context. Empir Softw Eng 19(6):1856–1886MathSciNetCrossRefGoogle Scholar
  58. Zhao L, Callan J (2010) Term necessity prediction. In: Proceedings of the 19th ACM international conference on information and knowledge management, CIKMGoogle Scholar
  59. Zhao L, Callan J (2012) Automatic term mismatch diagnosis for selective query expansion. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIRGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2018

Authors and Affiliations

  • Raphael Sirres
    • 1
  • Tegawendé F. Bissyandé
    • 2
  • Dongsun Kim
    • 2
  • David Lo
    • 3
  • Jacques Klein
    • 2
  • Kisub Kim
    • 2
  • Yves Le Traon
    • 2
  1. 1.National Library of LuxembourgRooseveltLuxembourg
  2. 2.Interdisciplinary Centre for Security, Reliability and TrustUniversity of LuxembourgKennedyLuxembourg
  3. 3.School of Information SystemsSingapore Management UniversitySingaporeSingapore

Personalised recommendations