Skip to main content
Log in

Sourcerer: mining and searching internet-scale software repositories

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Sun Java Coding Standards. http://java.sun.com/docs/codeconv/

  • Koders web site. http://www.koders.com

  • Krugle web site. http://www.krugle.com

  • Codase web site. http://www.Codase.com

  • Csourcesearch web site. http://csourcesearch.net/

  • Google CodeSearch web site. http://www.google.com/codesearch

  • SparsJ Search System. http://demo.spars.info/

  • Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Matwin S, Mladenic D (eds) 18th European conference on machine learning. Warsaw, Poland (to appear)

  • Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE ’06: proceeding of the 28th international conference on Software engineering. ACM, New York, pp 361–370

  • Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley

  • Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. Wiley

  • Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. In: OOPSLA ’08: proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, Nashville, TN. ACM, New York, NY (to appear)

  • Blei D, Lafferty J (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 147–154

    Google Scholar 

  • Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022

    Article  MATH  Google Scholar 

  • Brill E (1994) Some advances in transformation-based part of speech tagging. In: National conference on artificial intelligence. pp 722–727

  • Buntine W (2005) Open source search: a data mining platform. SIGIR Forum 39(1): 4–10

    Article  Google Scholar 

  • Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: 4133–4139

    Article  Google Scholar 

  • Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10): 687–708

    Article  Google Scholar 

  • Cox A, Clarke C, Sim S (1999) A model independent source code repository. In: CASCON ’99: proceedings of the 1999 conference of the centre for advanced studies on collaborative research. IBM Press, p 1

  • Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407

    Article  Google Scholar 

  • Finnigan P, Holt R, Kalas I, Kerr S, Kontogiannis K, Mueller H, Mylopoulos J, Perelgut S, Stanley M, Wong K (1997) The software bookshelf. IBM Sys J 36(4): 564–593

    Article  Google Scholar 

  • Frean GBM, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: OOPSLA ’06. ACM Press, New York, pp 397–503

  • Gil JY, Maman I (2005) Micro patterns in java code. In: OOPSLA ’05: proceedings of the 20th annual ACM SIGPLAN conference on object oriented programming systems languages and applications. ACM Press, New York, pp 97–116

  • Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci U S A 101(Suppl 1): 5228–5235

    Article  Google Scholar 

  • Hajiyev E, Verbaere M, de Moor O (2006) Codequest: scalable source code queries with datalog. In: Thomas D (eds) ECOOP’06: proceedings of the 20th European conference on object-oriented programming, vol 4067 of lecture notes in computer science. Springer, Berlin, pp 2–27

    Google Scholar 

  • Hill R, Rideout J (2004) Automatic method completion. In: ASE. IEEE Computer Society, pp 228–235

  • Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE ’05: proceedings of the 27th international conference on software engineering. ACM Press, New York, pp 117–125

  • Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12): 952–970

    Article  Google Scholar 

  • Inoue K, Yokomori R, Fujiwara H, Yamamoto T, Matsushita M, Kusumoto S (2003) Component rank: relative significance rank for software component search. In: ICSE ’03: proceedings of the 25th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 14–24

  • Inoue K, Yokomori R, Yamamoto T, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31(3): 213–225

    Article  Google Scholar 

  • Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: APSEC ’04: proceedings of the 11th Asia-Pacific software engineering conference (APSEC’04). IEEE Computer Society, Washington, DC, pp 184–193

  • Kiczales G, Lamping J, Menhdhekar A, Maeda C, Lopes C, Loingtier J, Irwin J (1997) Aspect-oriented programming. In: Akşit M, Matsuoka S (eds) Proceedings European conference on object-oriented programming, vol 1241. Springer, Berlin, pp 220–242

    Google Scholar 

  • Knuth DE (1971) An empirical study of fortran programs. Softw Pract Exp 1(2):105–133

    Article  MATH  Google Scholar 

  • Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Info Softw Technol 49(3): 230–243

    Article  Google Scholar 

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining eclipse developer contributions via author-topic models. MSR 2007: proceedings of the fourth international workshop on mining software repositories. pp 30–33

  • Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008) Mining internet-scale software repositories. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 929–936

    Google Scholar 

  • Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 872–881

  • Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. In: PLDI ’05: proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 48–61

  • Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering (WCRE 2004). The Netherlands, pp 214–223

  • McCormick E, Volder KD (2004) Jquery: finding your way through tangled code. In: OOPSLA ’04: companion to the 19th annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications. ACM. New York, pp 9–10

  • Minto S, Murphy GC (2007) Recommending emergent teams. In: MSR ’07: proceedings of the fourth international workshop on mining software repositories. IEEE Computer Society. Washington, DC, p 5

  • Mitzenmacher, M. (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2)

  • Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  • Newman D, Block S (2006) Probabilistic topic decomposition of an eighteenth-century american newspaper. J Am Soc Inf Sci Technol 57(6): 753–767

    Article  Google Scholar 

  • Newman D, Chemudugunta C, Smyth P, Steyvers M (2006) Analyzing entities and topics in news articles using statistical topic models. In: ISI. pp 93–104

  • Oman P, Hagemeister J (1992) Metrics for assessing a software system’s maintainability. In: Proceedings of the international conference on software maintenance 1992. IEEE Computer Society Press, pp 337–344

  • Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see: http://dbpubs.stanford.edu/pub/1999-66)

  • Paul S (1992) Scruple: a reengineer’s tool for source code search. In: CASCON ’92: proceedings of the 1992 conference of the centre for advanced studies on collaborative research. IBM Press, pp 329–346

  • Paul S, Prakash A (1994) A framework for source code search using program patterns. IEEE Trans Softw Eng 20(6): 463–475

    Article  MATH  Google Scholar 

  • Poshyvanyk D, Marcus A, Dong Y (2006) Jiriss—an eclipse plug-in for source code exploration. ICPC 0: 252–255

    Google Scholar 

  • Puppin D, Silvestri F (2006) The social network of java classes. In: Haddad H (ed) SAC. ACM, New York, pp 1409–1413

  • Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI ’04: proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 487–494

  • Sahavechaphan N, Claypool K (2006) Xsnippet: mining for sample code. SIGPLAN Not 41(10): 413–430

    Article  Google Scholar 

  • Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03: proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, New York, pp 76–85

  • Schröter A, Zimmermann T, Premraj R, Zeller A (2006) If your bug database could talk .... In: Proceedings of the 5th international symposium on empirical software engineering, vol II: short papers and posters. Rio de Janeiro, pp 18–20

  • Sindhgatta R (2006) Using an information retrieval system to retrieve source code samples. In: Osterweil LJ, Rombach HD, Soffa ML, (eds) ICSE. ACM, pp 905–908

  • Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: KDD ’04: proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 306–315

  • Swamidass S, Baldi P (2007) Bounds and algorithms for exact searches of chemical fingerprints in linear and sub-linear time. J Chem Inf Model 47(2): 302–317

    Article  Google Scholar 

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Statistical Assoc 101(476): 1566–1581

    Article  MATH  MathSciNet  Google Scholar 

  • Ugurel S, Krovetz R, Giles CL (2002) What’s the code? automatic classification of source code archives. In: KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 632–638

  • Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1): 191–211

    Article  MATH  MathSciNet  Google Scholar 

  • Welker KD, Oman PW (1995) Software maintainability metrics models in practice. Crosstalk, J Def Softw Eng 8: 19–23

    Google Scholar 

  • Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: International workshop on source code analysis and manipulation, pp 45–54

  • Ye Y, Fischer G (2002) Supporting reuse by delivering task-relevant and personalized information. In: ICSE ’02: proceedings of the 24th international conference on software engineering. ACM, New York, pp 513–523

  • Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Baldi.

Additional information

Responsible editor: Eamonn Keogh.

Erik Linstead, Sushil Bajracharya, and Trung Ngo have contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Linstead, E., Bajracharya, S., Ngo, T. et al. Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Disc 18, 300–336 (2009). https://doi.org/10.1007/s10618-008-0118-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0118-x

Keywords

Navigation