Skip to main content

Sourcerer: mining and searching internet-scale software repositories

Abstract

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu.

This is a preview of subscription content, access via your institution.

References

  1. Sun Java Coding Standards. http://java.sun.com/docs/codeconv/

  2. Koders web site. http://www.koders.com

  3. Krugle web site. http://www.krugle.com

  4. Codase web site. http://www.Codase.com

  5. Csourcesearch web site. http://csourcesearch.net/

  6. Google CodeSearch web site. http://www.google.com/codesearch

  7. SparsJ Search System. http://demo.spars.info/

  8. Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Matwin S, Mladenic D (eds) 18th European conference on machine learning. Warsaw, Poland (to appear)

  9. Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE ’06: proceeding of the 28th international conference on Software engineering. ACM, New York, pp 361–370

  10. Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley

  11. Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. Wiley

  12. Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. In: OOPSLA ’08: proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, Nashville, TN. ACM, New York, NY (to appear)

  13. Blei D, Lafferty J (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 147–154

    Google Scholar 

  14. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022

    MATH  Article  Google Scholar 

  15. Brill E (1994) Some advances in transformation-based part of speech tagging. In: National conference on artificial intelligence. pp 722–727

  16. Buntine W (2005) Open source search: a data mining platform. SIGIR Forum 39(1): 4–10

    Article  Google Scholar 

  17. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: 4133–4139

    Article  Google Scholar 

  18. Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10): 687–708

    Article  Google Scholar 

  19. Cox A, Clarke C, Sim S (1999) A model independent source code repository. In: CASCON ’99: proceedings of the 1999 conference of the centre for advanced studies on collaborative research. IBM Press, p 1

  20. Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407

    Article  Google Scholar 

  21. Finnigan P, Holt R, Kalas I, Kerr S, Kontogiannis K, Mueller H, Mylopoulos J, Perelgut S, Stanley M, Wong K (1997) The software bookshelf. IBM Sys J 36(4): 564–593

    Article  Google Scholar 

  22. Frean GBM, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: OOPSLA ’06. ACM Press, New York, pp 397–503

  23. Gil JY, Maman I (2005) Micro patterns in java code. In: OOPSLA ’05: proceedings of the 20th annual ACM SIGPLAN conference on object oriented programming systems languages and applications. ACM Press, New York, pp 97–116

  24. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci U S A 101(Suppl 1): 5228–5235

    Article  Google Scholar 

  25. Hajiyev E, Verbaere M, de Moor O (2006) Codequest: scalable source code queries with datalog. In: Thomas D (eds) ECOOP’06: proceedings of the 20th European conference on object-oriented programming, vol 4067 of lecture notes in computer science. Springer, Berlin, pp 2–27

    Google Scholar 

  26. Hill R, Rideout J (2004) Automatic method completion. In: ASE. IEEE Computer Society, pp 228–235

  27. Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE ’05: proceedings of the 27th international conference on software engineering. ACM Press, New York, pp 117–125

  28. Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12): 952–970

    Article  Google Scholar 

  29. Inoue K, Yokomori R, Fujiwara H, Yamamoto T, Matsushita M, Kusumoto S (2003) Component rank: relative significance rank for software component search. In: ICSE ’03: proceedings of the 25th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 14–24

  30. Inoue K, Yokomori R, Yamamoto T, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31(3): 213–225

    Article  Google Scholar 

  31. Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: APSEC ’04: proceedings of the 11th Asia-Pacific software engineering conference (APSEC’04). IEEE Computer Society, Washington, DC, pp 184–193

  32. Kiczales G, Lamping J, Menhdhekar A, Maeda C, Lopes C, Loingtier J, Irwin J (1997) Aspect-oriented programming. In: Akşit M, Matsuoka S (eds) Proceedings European conference on object-oriented programming, vol 1241. Springer, Berlin, pp 220–242

    Google Scholar 

  33. Knuth DE (1971) An empirical study of fortran programs. Softw Pract Exp 1(2):105–133

    MATH  Article  Google Scholar 

  34. Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Info Softw Technol 49(3): 230–243

    Article  Google Scholar 

  35. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining eclipse developer contributions via author-topic models. MSR 2007: proceedings of the fourth international workshop on mining software repositories. pp 30–33

  36. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008) Mining internet-scale software repositories. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 929–936

    Google Scholar 

  37. Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 872–881

  38. Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. In: PLDI ’05: proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 48–61

  39. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering (WCRE 2004). The Netherlands, pp 214–223

  40. McCormick E, Volder KD (2004) Jquery: finding your way through tangled code. In: OOPSLA ’04: companion to the 19th annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications. ACM. New York, pp 9–10

  41. Minto S, Murphy GC (2007) Recommending emergent teams. In: MSR ’07: proceedings of the fourth international workshop on mining software repositories. IEEE Computer Society. Washington, DC, p 5

  42. Mitzenmacher, M. (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2)

  43. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  44. Newman D, Block S (2006) Probabilistic topic decomposition of an eighteenth-century american newspaper. J Am Soc Inf Sci Technol 57(6): 753–767

    Article  Google Scholar 

  45. Newman D, Chemudugunta C, Smyth P, Steyvers M (2006) Analyzing entities and topics in news articles using statistical topic models. In: ISI. pp 93–104

  46. Oman P, Hagemeister J (1992) Metrics for assessing a software system’s maintainability. In: Proceedings of the international conference on software maintenance 1992. IEEE Computer Society Press, pp 337–344

  47. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see: http://dbpubs.stanford.edu/pub/1999-66)

  48. Paul S (1992) Scruple: a reengineer’s tool for source code search. In: CASCON ’92: proceedings of the 1992 conference of the centre for advanced studies on collaborative research. IBM Press, pp 329–346

  49. Paul S, Prakash A (1994) A framework for source code search using program patterns. IEEE Trans Softw Eng 20(6): 463–475

    MATH  Article  Google Scholar 

  50. Poshyvanyk D, Marcus A, Dong Y (2006) Jiriss—an eclipse plug-in for source code exploration. ICPC 0: 252–255

    Google Scholar 

  51. Puppin D, Silvestri F (2006) The social network of java classes. In: Haddad H (ed) SAC. ACM, New York, pp 1409–1413

  52. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI ’04: proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 487–494

  53. Sahavechaphan N, Claypool K (2006) Xsnippet: mining for sample code. SIGPLAN Not 41(10): 413–430

    Article  Google Scholar 

  54. Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03: proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, New York, pp 76–85

  55. Schröter A, Zimmermann T, Premraj R, Zeller A (2006) If your bug database could talk .... In: Proceedings of the 5th international symposium on empirical software engineering, vol II: short papers and posters. Rio de Janeiro, pp 18–20

  56. Sindhgatta R (2006) Using an information retrieval system to retrieve source code samples. In: Osterweil LJ, Rombach HD, Soffa ML, (eds) ICSE. ACM, pp 905–908

  57. Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: KDD ’04: proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 306–315

  58. Swamidass S, Baldi P (2007) Bounds and algorithms for exact searches of chemical fingerprints in linear and sub-linear time. J Chem Inf Model 47(2): 302–317

    Article  Google Scholar 

  59. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Statistical Assoc 101(476): 1566–1581

    MATH  Article  MathSciNet  Google Scholar 

  60. Ugurel S, Krovetz R, Giles CL (2002) What’s the code? automatic classification of source code archives. In: KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 632–638

  61. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1): 191–211

    MATH  Article  MathSciNet  Google Scholar 

  62. Welker KD, Oman PW (1995) Software maintainability metrics models in practice. Crosstalk, J Def Softw Eng 8: 19–23

    Google Scholar 

  63. Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: International workshop on source code analysis and manipulation, pp 45–54

  64. Ye Y, Fischer G (2002) Supporting reuse by delivering task-relevant and personalized information. In: ICSE ’02: proceedings of the 24th international conference on software engineering. ACM, New York, pp 513–523

  65. Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pierre Baldi.

Additional information

Erik Linstead, Sushil Bajracharya, and Trung Ngo have contributed equally to this work.

Responsible editor: Eamonn Keogh.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Linstead, E., Bajracharya, S., Ngo, T. et al. Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Disc 18, 300–336 (2009). https://doi.org/10.1007/s10618-008-0118-x

Download citation

Keywords

  • Mining software
  • Program understanding
  • Code search
  • Software analysis
  • Author-topic probabilistic modeling
  • Code retrieval