Knowledge and Information Systems

, Volume 41, Issue 3, pp 727–759 | Cite as

Surfacing code in the dark: an instant clone search approach

  • Jin-woo Park
  • Mu-Woong Lee
  • Jong-Won Roh
  • Seung-won Hwang
  • Sunghun Kim
Regular Paper


In this paper, we study how to “surface” code for instant reference. A traditional mode of surfacing code has been treating code as text and applying keyword search techniques. However, many prior work observes the limitation of such approach: (1) semantic description of code is limited to comments and (2) syntactic keyword is often not selective enough. In contrast, we discuss enabling techniques and scenarios of instant semantic-based surfacing. For example, developers, during a development session, may reference the existing code sharing similar semantics, using his code so far as a query. In addition to such semantic-based surfacing, we also enhance keyword-based surfacing with semantics, by instantly adding semantic tags for code submitted to the repository. To achieve this goal, we first propose scalable indexing structures on vector abstractions of code. Our experimental results show our techniques outperform a state-of-the-art tool in efficiency without compromising accuracy. We then deploy our technique for instant search and tagging scenarios: For instant code search scenario, we demonstrate an instant clone search tool using our techniques, supporting sub-second search over 54 million LOC. For instant code tagging scenario, we propose an automatic instant code tagging algorithm to mine the meaningful tags from clones.


Code indexing Instant code search Instant code tagging Software development 



This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Science, ICT & Future Planning (MSIP)/National Research Foundation of Korea (NRF) (Grant NRF-2008-0062609).


  1. 1.
    Lee M-W, Roh J-W, Hwang SW, Kim S (2010) Instant code clone search. In: ACM SIGSOFT/FSEGoogle Scholar
  2. 2.
    Kim J, Lee S, Hwang SW, Kim S (2009) Adding examples into java documents. In: ASEGoogle Scholar
  3. 3.
    Kim J, Lee S, Hwang SW, Sunghun K (2010) Towards an intelligent code search engine. In: AAAIGoogle Scholar
  4. 4.
    Kim J, Lee S, Hwang S-W, Kim S (2013) Enriching documents with examples: a corpus mining approach. ACM Trans Inf Syst 31(1):1:1–1:27CrossRefMathSciNetGoogle Scholar
  5. 5.
    Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: ISESEGoogle Scholar
  6. 6.
    Brandt J, Dontcheva M, Weskamp M, Klemmer S (2010) Example-centric programming: integrating web search into the development environment. In: SIGCHIGoogle Scholar
  7. 7.
    Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSMGoogle Scholar
  8. 8.
    Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: ICSEGoogle Scholar
  9. 9.
    Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670CrossRefGoogle Scholar
  10. 10.
    Livieri S, Higo Y, Matushita M, Inoue K (2007) Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: ICSEGoogle Scholar
  11. 11.
    Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74(7):470–495CrossRefzbMATHMathSciNetGoogle Scholar
  12. 12.
    Wahler V, Seipel D, Wolff J, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: SCAMGoogle Scholar
  13. 13.
    Jürgens E, Hummel B, Deissenboeck F, Feilkas M (2008) Static bug detection through analysis of inconsistent clones. In: Software Engineering (Workshops), pp 443–446Google Scholar
  14. 14.
    Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. SIGSOFT Softw Eng Notes 30(5):187–196CrossRefGoogle Scholar
  15. 15.
    Beckmann N, Begel HP, Schneider R, Seeger B (1990) The r*-tree: an efficient and robust access method for points and rectangles. In: SIGMODGoogle Scholar
  16. 16.
    Wang X-J, Zhang L, Liu M, Li Y, Ma WY (2010) Arista-image search to annotation on billions of web photos. In: CVPRGoogle Scholar
  17. 17.
    Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. ACM TODS 24:265–318CrossRefGoogle Scholar
  19. 19.
    Böhm C, Krebs F (2004) The k-nearest neighbour join: turbo charging the kdd process. Knowl Inf Syst 6(6):728–749CrossRefGoogle Scholar
  20. 20.
    Korn F, Pagel B-U, Faloutsos C (2001) On the ‘dimensionality curse’ and the ‘self-similarity blessing’. TKDE 13(1):96–111Google Scholar
  21. 21.
    Korn F, Sidiropoulos N, Faloutsos C, Siegel E, Protopapas Z (1996) Fast nearest neighbor search in medical image databases. In: VLDBGoogle Scholar
  22. 22.
    Seidl T, Kriegel H-P (1998) Optimal multi-step k-nearest neighbor search. In: SIGMODGoogle Scholar
  23. 23.
    Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: SIGMODGoogle Scholar
  24. 24.
    Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDBGoogle Scholar
  25. 25.
    Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, DordrechtCrossRefzbMATHGoogle Scholar
  26. 26.
    Bercken J, Seeger B (2001) An evaluation of generic bulk loading techniques. In: VLDBGoogle Scholar
  27. 27.
    Kamel I, Faloutsos C (1993) On packing r-trees. In: CIKMGoogle Scholar
  28. 28.
    Leutenegger ST, Edgington JM, Lopez MA (1997) STR: a simple and efficient algorithm for r-tree packing. In: ICDEGoogle Scholar
  29. 29.
    Berchtold S, Böhm C, Kriegel H-P(1998) Improving the query performance of high-dimensional index structures by bulk-load operations. In: EDBTGoogle Scholar
  30. 30.
    Silberschatz A, Galvin PB, Gagne G (2008) Operating system concepts, 8th edn. Wiley, New YorkGoogle Scholar
  31. 31.
    Jolliffe IT (2002) Principal component analysis. Springer Series in Statistics, 2nd edn. Springer, BerlinGoogle Scholar
  32. 32.
    Yi B-K, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: VLDBGoogle Scholar
  33. 33.
    Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, LondonGoogle Scholar
  34. 34.
    Li Z, Shan L, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: OSDIGoogle Scholar
  35. 35.
    Bianchini M, Gori M, Scarselli F (2005) Inside pagerank. ACM Trans Int Technol 5(1):92–128CrossRefGoogle Scholar
  36. 36.
    Keivanloo I, Rilling J, Charland P (2011) Internet-scale real-time code clone search via multi-level indexing. In: WCREGoogle Scholar
  37. 37.
    Xie T, Acharya M, Thummalapenta S, Taneja K (2008) Improving software reliability and productivity via mining program source code. In: NSFNGSGoogle Scholar
  38. 38.
    Li Y, Zhang L, Li G, Xie B, Sun J (2008) Recommending typical usage examples for component retrieval in reuse repositories. In: ICSRGoogle Scholar
  39. 39.
    Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSEGoogle Scholar
  40. 40.
    Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12):952–970CrossRefGoogle Scholar
  41. 41.
    Bajracharya SK, Ngo TC, Linstead E, Dou Y, Rigor P, Baldi P, Lopes CV (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: OOPSLA CompanionGoogle Scholar
  42. 42.
    Wang X, Lo D, Jiefeng CL, Zhang HM, Jeffrey XY (2010) Matching dependence-related queries in the system dependence graph. In: ASEGoogle Scholar
  43. 43.
    McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Chen F (2011) Portfolio: finding relevant functions and their usage. In: ICSEGoogle Scholar
  44. 44.
    Wang S, Lo D, Jiang L (2011) Code search via topic-enriched dependence graph matching. In: WCREGoogle Scholar
  45. 45.
    McMillan C, Grechanik M, Poshyvanyk D, Fu C, Qing X (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans Softw Eng 38(5):1069–1087CrossRefGoogle Scholar
  46. 46.
    Chan W-K, Cheng H, Lo D (2012) Searching connected api subgraph via text phrases. In: FSEGoogle Scholar
  47. 47.
    McMillan C, Grechanik M, Poshyvanyk D (2012) Detecting similar software applications. In: ICSEGoogle Scholar
  48. 48.
    Ferdian T, David L, Lingxiao J (2012) Detecting similar applications with collaborative taggingGoogle Scholar
  49. 49.
    Al-Kofahi JM, Tamrawi A, Nguyen TT, Nguyen HA, Nguyen TN (2010) Fuzzy set approach for automatic tagging in evolving software. In: ICDMGoogle Scholar
  50. 50.
    Preisach C, Marinho LB, Schmidt-Thieme L (2010) Semi-supervised tag recommendation-using untagged resources to mitigate cold-start problems. In: PAKDDGoogle Scholar
  51. 51.
    Rendle S, Marinho LB, Nanopoulos A, Schmidt-Thieme L (2009) Learning optimal ranking with tensor factorization for tag recommendation. In: KDDGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Jin-woo Park
    • 1
  • Mu-Woong Lee
    • 1
  • Jong-Won Roh
    • 1
  • Seung-won Hwang
    • 1
  • Sunghun Kim
    • 2
  1. 1.Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea
  2. 2.Hong Kong University of Science and Technology (HKUST)Hong KongChina

Personalised recommendations