Data Mining and Knowledge Discovery

, Volume 18, Issue 2, pp 300–336 | Cite as

Sourcerer: mining and searching internet-scale software repositories

  • Erik Linstead
  • Sushil Bajracharya
  • Trung Ngo
  • Paul Rigor
  • Cristina Lopes
  • Pierre Baldi


Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at:


Mining software Program understanding Code search Software analysis Author-topic probabilistic modeling Code retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Sun Java Coding Standards.
  2. Koders web site.
  3. Krugle web site.
  4. Codase web site.
  5. Csourcesearch web site.
  6. Google CodeSearch web site.
  7. SparsJ Search System.
  8. Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Matwin S, Mladenic D (eds) 18th European conference on machine learning. Warsaw, Poland (to appear)Google Scholar
  9. Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE ’06: proceeding of the 28th international conference on Software engineering. ACM, New York, pp 361–370Google Scholar
  10. Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-WesleyGoogle Scholar
  11. Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. WileyGoogle Scholar
  12. Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. In: OOPSLA ’08: proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, Nashville, TN. ACM, New York, NY (to appear)Google Scholar
  13. Blei D, Lafferty J (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 147–154Google Scholar
  14. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022zbMATHCrossRefGoogle Scholar
  15. Brill E (1994) Some advances in transformation-based part of speech tagging. In: National conference on artificial intelligence. pp 722–727Google Scholar
  16. Buntine W (2005) Open source search: a data mining platform. SIGIR Forum 39(1): 4–10CrossRefGoogle Scholar
  17. Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: 4133–4139CrossRefGoogle Scholar
  18. Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10): 687–708CrossRefGoogle Scholar
  19. Cox A, Clarke C, Sim S (1999) A model independent source code repository. In: CASCON ’99: proceedings of the 1999 conference of the centre for advanced studies on collaborative research. IBM Press, p 1Google Scholar
  20. Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407CrossRefGoogle Scholar
  21. Finnigan P, Holt R, Kalas I, Kerr S, Kontogiannis K, Mueller H, Mylopoulos J, Perelgut S, Stanley M, Wong K (1997) The software bookshelf. IBM Sys J 36(4): 564–593CrossRefGoogle Scholar
  22. Frean GBM, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: OOPSLA ’06. ACM Press, New York, pp 397–503Google Scholar
  23. Gil JY, Maman I (2005) Micro patterns in java code. In: OOPSLA ’05: proceedings of the 20th annual ACM SIGPLAN conference on object oriented programming systems languages and applications. ACM Press, New York, pp 97–116Google Scholar
  24. Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci U S A 101(Suppl 1): 5228–5235CrossRefGoogle Scholar
  25. Hajiyev E, Verbaere M, de Moor O (2006) Codequest: scalable source code queries with datalog. In: Thomas D (eds) ECOOP’06: proceedings of the 20th European conference on object-oriented programming, vol 4067 of lecture notes in computer science. Springer, Berlin, pp 2–27Google Scholar
  26. Hill R, Rideout J (2004) Automatic method completion. In: ASE. IEEE Computer Society, pp 228–235Google Scholar
  27. Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE ’05: proceedings of the 27th international conference on software engineering. ACM Press, New York, pp 117–125Google Scholar
  28. Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12): 952–970CrossRefGoogle Scholar
  29. Inoue K, Yokomori R, Fujiwara H, Yamamoto T, Matsushita M, Kusumoto S (2003) Component rank: relative significance rank for software component search. In: ICSE ’03: proceedings of the 25th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 14–24Google Scholar
  30. Inoue K, Yokomori R, Yamamoto T, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31(3): 213–225CrossRefGoogle Scholar
  31. Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: APSEC ’04: proceedings of the 11th Asia-Pacific software engineering conference (APSEC’04). IEEE Computer Society, Washington, DC, pp 184–193Google Scholar
  32. Kiczales G, Lamping J, Menhdhekar A, Maeda C, Lopes C, Loingtier J, Irwin J (1997) Aspect-oriented programming. In: Akşit M, Matsuoka S (eds) Proceedings European conference on object-oriented programming, vol 1241. Springer, Berlin, pp 220–242Google Scholar
  33. Knuth DE (1971) An empirical study of fortran programs. Softw Pract Exp 1(2):105–133zbMATHCrossRefGoogle Scholar
  34. Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Info Softw Technol 49(3): 230–243CrossRefGoogle Scholar
  35. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining eclipse developer contributions via author-topic models. MSR 2007: proceedings of the fourth international workshop on mining software repositories. pp 30–33Google Scholar
  36. Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008) Mining internet-scale software repositories. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 929–936Google Scholar
  37. Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 872–881Google Scholar
  38. Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. In: PLDI ’05: proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 48–61Google Scholar
  39. Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering (WCRE 2004). The Netherlands, pp 214–223Google Scholar
  40. McCormick E, Volder KD (2004) Jquery: finding your way through tangled code. In: OOPSLA ’04: companion to the 19th annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications. ACM. New York, pp 9–10Google Scholar
  41. Minto S, Murphy GC (2007) Recommending emergent teams. In: MSR ’07: proceedings of the fourth international workshop on mining software repositories. IEEE Computer Society. Washington, DC, p 5Google Scholar
  42. Mitzenmacher, M. (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2)Google Scholar
  43. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  44. Newman D, Block S (2006) Probabilistic topic decomposition of an eighteenth-century american newspaper. J Am Soc Inf Sci Technol 57(6): 753–767CrossRefGoogle Scholar
  45. Newman D, Chemudugunta C, Smyth P, Steyvers M (2006) Analyzing entities and topics in news articles using statistical topic models. In: ISI. pp 93–104Google Scholar
  46. Oman P, Hagemeister J (1992) Metrics for assessing a software system’s maintainability. In: Proceedings of the international conference on software maintenance 1992. IEEE Computer Society Press, pp 337–344Google Scholar
  47. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see:
  48. Paul S (1992) Scruple: a reengineer’s tool for source code search. In: CASCON ’92: proceedings of the 1992 conference of the centre for advanced studies on collaborative research. IBM Press, pp 329–346Google Scholar
  49. Paul S, Prakash A (1994) A framework for source code search using program patterns. IEEE Trans Softw Eng 20(6): 463–475zbMATHCrossRefGoogle Scholar
  50. Poshyvanyk D, Marcus A, Dong Y (2006) Jiriss—an eclipse plug-in for source code exploration. ICPC 0: 252–255Google Scholar
  51. Puppin D, Silvestri F (2006) The social network of java classes. In: Haddad H (ed) SAC. ACM, New York, pp 1409–1413Google Scholar
  52. Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI ’04: proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 487–494Google Scholar
  53. Sahavechaphan N, Claypool K (2006) Xsnippet: mining for sample code. SIGPLAN Not 41(10): 413–430CrossRefGoogle Scholar
  54. Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03: proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, New York, pp 76–85Google Scholar
  55. Schröter A, Zimmermann T, Premraj R, Zeller A (2006) If your bug database could talk .... In: Proceedings of the 5th international symposium on empirical software engineering, vol II: short papers and posters. Rio de Janeiro, pp 18–20Google Scholar
  56. Sindhgatta R (2006) Using an information retrieval system to retrieve source code samples. In: Osterweil LJ, Rombach HD, Soffa ML, (eds) ICSE. ACM, pp 905–908Google Scholar
  57. Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: KDD ’04: proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 306–315Google Scholar
  58. Swamidass S, Baldi P (2007) Bounds and algorithms for exact searches of chemical fingerprints in linear and sub-linear time. J Chem Inf Model 47(2): 302–317CrossRefGoogle Scholar
  59. Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Statistical Assoc 101(476): 1566–1581zbMATHCrossRefMathSciNetGoogle Scholar
  60. Ugurel S, Krovetz R, Giles CL (2002) What’s the code? automatic classification of source code archives. In: KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 632–638Google Scholar
  61. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1): 191–211zbMATHCrossRefMathSciNetGoogle Scholar
  62. Welker KD, Oman PW (1995) Software maintainability metrics models in practice. Crosstalk, J Def Softw Eng 8: 19–23Google Scholar
  63. Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: International workshop on source code analysis and manipulation, pp 45–54Google Scholar
  64. Ye Y, Fischer G (2002) Supporting reuse by delivering task-relevant and personalized information. In: ICSE ’02: proceedings of the 24th international conference on software engineering. ACM, New York, pp 513–523Google Scholar
  65. Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University PressGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Erik Linstead
    • 1
  • Sushil Bajracharya
    • 1
  • Trung Ngo
    • 1
  • Paul Rigor
    • 1
  • Cristina Lopes
    • 1
  • Pierre Baldi
    • 1
  1. 1.Donald Bren School of Information and Computer SciencesUniversity of CaliforniaIrvineUSA

Personalised recommendations