Artifact Representation Techniques for Large-Scale Software Search Engines

  • Oliver Hummel
  • Colin Atkinson
  • Marcus Schumacher


The first generation of software retrieval systems developed some 25 years ago used simple bibliographic indexing techniques adapted from library science to support the retrieval of relatively small numbers of in-house software artifacts. While these were sufficient at the time, they were completely unscaleable to the vast numbers of software artifacts available today. The second generation of software search engines, representing the state-of-the-practice today, tackles this problem by using full-text search frameworks such as Lucene to support text-based searches on large software collections. However, these typically provide no inherent support for sophisticated search use cases which exploit the structure and “meaning” of software artifacts. In this chapter we describe the core techniques used in current text-based code search engines and advanced techniques that can be used to support sophisticated forms of searches that exploit the structure of software. We then survey the challenges and opportunities encountered in the development of the next (third) generation of software search engines based on new, currently emerging data storage platforms.


Search Engine Structure Search Software Artifact XPath Query Relevance Ranking 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors would like to thank Philipp Bostan, Matthias Gutheil, Werner Janjic and Dietmar Stoll from the Software Engineering Group at the University of Mannheim for their contributions to developing the tools described in this chapter.


  1. [1]
    Page, L., Brin, S., Motwani, R., Winograd, T.: The Pagerank Algorithm: Bringing Order to the Web. Proceedings of the International Conference on the World Wide Web (1998)Google Scholar
  2. [2]
    McIlroy, D.: Mass-Produced Software Components. Software Engineering: Report of a conference sponsored by the NATO Science Committee (1968).Google Scholar
  3. [3]
    Krueger, C.W.: Software reuse. ACM Computing Surveys, vol. 24, no 2. (1992)Google Scholar
  4. [4]
    Frakes, W.B., Nejneh, B.: An Information System for Software Reuse. Software Reuse: Emerging Technology, Computer Society Press (1987)Google Scholar
  5. [5]
    Frakes, W.B.: An empirical study of representation methods for reusable software components. IEEE Transactions on Software Engineering, Vol. 20, no.8 (1994)Google Scholar
  6. [6]
    Prieto-Diaz, R., Freeman, P.: Classifying Software for Reusability. IEEE Software, Vol. 4, No. 1 (1987)Google Scholar
  7. [7]
    Mili, A., Mili, R., Mittermeir, R.: A Survey of Software Reuse Libraries. Annals of Software Engineering 5 (1998)Google Scholar
  8. [8]
    Hoffmann, R. and Fogarty, J. and Weld, D.S.: Assieme: Finding and Leveraging implicit References in a Web Search Interface for Programmers. Proceedings of the ACM Symposium on User Interface Software and Technology (2007)Google Scholar
  9. [9]
    Hummel, O.: Facilitating the comparison of software retrieval systems through a reference reuse collection. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)CrossRefGoogle Scholar
  10. [10]
    Hummel, O., Janjic, J.: Test-Driven Reuse: Key to Improving Precision of Search Engines for Software Reuse. In Sim and Gallardo (eds.): Code Retrieval on the Web, Springer (2012)Google Scholar
  11. [11]
    Zaremski, A.M., Wing, J.M.: Signature Matching: A Tool for Using Software Libraries. ACM Transactions on Software Engineering and Methodology, Vol. 4, No. 2 (1995)Google Scholar
  12. [12]
    Umarji, M. and Sim, S. and Lopes, C.: Archetypal internet-scale source code searching. Open Source Development, Communities and Quality, Springer (2008)CrossRefGoogle Scholar
  13. [13]
    Zaremski, A.M., Wing, J.M.: Specification Matching of Software Components. ACM Transactions on Software Engineering and Methodology, Vol. 6, No. 4 (1997)Google Scholar
  14. [14]
    Applications and web applications using lucene, (2012)
  15. [15]
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999)Google Scholar
  16. [16]
    Hatcher, E., Gospodnetic, O., McCandless, M.: Lucene in Action (2nd edition). Manning (2010)Google Scholar
  17. [17]
    Inoue, K., Yokomori, R., Fujiwara, H., Yamamoto, T., Matsushita, M., Kusumoto S.: Ranking Significance of Software Components Based on Use Relations. IEEE Transactions on Software Engineering, Vol. 31, No. 3 (2005)Google Scholar
  18. [18]
    Merobase - Software Component Search Engine, (retr. 2012)
  19. [19]
    Krugle - Open Search, (retr. 2012)
  20. [20]
  21. [21]
    Koders, (retr. 2012)
  22. [22]
    JBoss Community: Hibernate-Search, (retr. 2012)
  23. [23]
  24. [24]
    Bajracharya, S., Ossher, J., Lopes, C.: Leveraging usage similarity for effective retrieval of examples in code repositories. In Proceedings of the Int. ACM SIGSOFT Symposium on Foundations of Software Engineering (2010)Google Scholar
  25. [25]
    Hummel, O.: Semantic Component Retrieval in Software Engineering. PhD dissertation, University of Mannheim (2008)Google Scholar
  26. [26]
    Hummel, O., Janjic, W., Atkinson, C.: Evaluating the efficiency of retrieval methods for component repositories. Proceedings of the International Conference on Software Engineering and Knowledge Engineering (2007)Google Scholar
  27. [27]
    Linping, Q., Lidong, W.: An Evaluation of Lucene for Keywords Search in Large-scale Short Text Storage. Computer Design and Applications (2010)Google Scholar
  28. [28]
    Panchenko, O., Müller, S., Plattner, H., Zeier, A.: Querying Source Code Using a Controlled Natural Language. Proceedings of the International Conference on Software Engineering and Applications (2011)Google Scholar
  29. [29]
    Panchenko, O., Karstens, J., Plattner, H., Zeier, A: Precise and Scalable Querying of Syntactical Source Code Patterns Using Sample Code Snippets and a Database. Proceedings of the International Conference on Program Comprehension (2011)Google Scholar
  30. [30]
    Podgurski, A., Pierce, L.: Retrieving reusable software by sampling behavior. ACM Transactions on Software Engineering and Methodology, Vol.2, No. 3 (1993)Google Scholar
  31. [31]
    Janjic, W., Hummel, O., Atkinson, C.: More archetypal usage scenarios for software search engines. Proceedings of the ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (2010)CrossRefGoogle Scholar
  32. [32]
    Sametinger, J.: Software engineering with reusable components. Springer (1997)Google Scholar
  33. [33]
    Thummalapenta, S. Xie, T.: Parseweb: a programmer assistant for reusing open source code on the web. Proceedings of the International Conference on Automated Software Engineering (2007)Google Scholar
  34. [34]
    Lemos, O., Bajracharya, S., Ossher, J.: CodeGenie: a tool for test-driven source code search. Proceedings of the International Conference on Object-Oriented Programming (2007)Google Scholar
  35. [35]
    Bajracharya, S.: Infrastructure for Building Search Tools for Developers. In Sim and Gallardo-Valencia (eds.): Finding Source Code on the Web for Remix and Reuse, Springer, 2012.Google Scholar
  36. [36]
    Software Engineering Group, University of Mannheim: Merobase Data Sets, (retr. 2012)

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Oliver Hummel
    • 1
  • Colin Atkinson
    • 1
  • Marcus Schumacher
    • 1
  1. 1.Software Engineering GroupUniversity of MannheimMannheimGermany

Personalised recommendations