Skip to main content

SourcererCC: Scalable and Accurate Clone Detection

  • Chapter
  • First Online:
Code Clone Analysis

Abstract

Clone detection is an active area of research. However, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. SourcererCC was developed as an attempt to fill this gap. It is a widely used token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. In the evaluation experiments, SourcererCC demonstrated both high recall and precision, and the ability to scale to a large inter-project repository (250MLOC) even using a standard workstation. This chapter reflects on some of the principle design decisions behind the success of SourcererCC and also presents an architecture to scale it horizontally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The author was part of the team that carried out the analysis.

  2. 2.

    i5 processor, 8 GB RAM, and 500 GB disk storage.

  3. 3.

    https://en.wikipedia.org/wiki/Debian.

  4. 4.

    Similar to the popular bag-of-words model [22] in Information Retrieval.

  5. 5.

    The curve can also be represented using \(y = x(x-1)/2\) quadratic function where x is the number of methods in a project and y is the number of candidate comparisons carried out to detect all clone pairs.

  6. 6.

    In case of a single high-performance multi-processor machine, \(N+1\) is the number of processors available on that machine.

  7. 7.

    https://visualvm.java.net/.

  8. 8.

    http://profiler4j.sourceforge.net/.

References

  1. C. Caprile, P. Tonella, Nomen est omen: analyzing the language of function identifiers, in Reverse Engineering. Proceedings. Sixth Working Conference on (1999), pp 112–122. https://doi.org/10.1109/WCRE.1999.806952

  2. S. Chaudhuri, V. Ganti, R. Kaushik, A primitive operator for similarity joins in data cleaning, in Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, ICDE ’06 (2006), pp 5. https://doi.org/10.1109/ICDE.2006.9

  3. K. Chen, P. Liu, Y. Zhang, Achieving accuracy and scalability simultaneously in detecting application clones on android markets, in Proceedings of the 36th International Conference on Software Engineering, ACM, New York, NY, USA, (ICSE 2014), pp 175–186

    Google Scholar 

  4. J.R. Cordy, Comprehending reality-practical barriers to industrial adoption of software maintenance automation, in Program Comprehension, 2003. 11th IEEE International Workshop on (IEEE, 2003), pp 196–205

    Google Scholar 

  5. J. Davies, D. German, M. Godfrey, A. Hindle, Software Bertillonage: finding the provenance of an entity, in Proceedings of MSR (2011)

    Google Scholar 

  6. F. Deissenboeck, M. Pizka, Concise and consistent naming. Softw. Qual. J. 14(3), 261–282. https://doi.org/10.1007/s11219-006-9219-1

  7. L. Guerrouj, Normalizing source code vocabulary to support program comprehension and software quality, in Software Engineering (ICSE), 2013 35th International Conference on(2013), pp 1385–1388. https://doi.org/10.1109/ICSE.2013.6606723

  8. A. Hemel, R. Koschke, Reverse engineering variability in source code using clone detection: a case study for linux variants of consumer electronic devices, in Proceedings of Working Conference on Reverse Engineering (2012), pp 357–366

    Google Scholar 

  9. A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu, On the naturalness of software, in Proceedings of the 34th International Conference on Software Engineering (IEEE Press, Piscataway, NJ, USA, ICSE ’12, 2012), pp 837–847. http://dl.acm.org/citation.cfm?id=2337223.2337322

  10. B. Hummel, E. Juergens, L. Heinemann, M. Conradt, Index-based code clone detection:incremental, distributed, scalable, in Proceedings of ICSM (2010)

    Google Scholar 

  11. T. Ishihara, K. Hotta, Y. Higo, H. Igaki, S. Kusumoto, Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects, in Reverse Engineering (WCRE), 2012 19th Working Conference on (2012), pp 387–391. https://doi.org/10.1109/WCRE.2012.48

  12. I. Keivanloo, J. Rilling, P. Charland, Internet-scale real-time code clone search via multi-level indexing, in Proceedings of WCRE (2011)

    Google Scholar 

  13. M. Kim, D. Notkin, Program element matching for multi-version program analyses, in Proceedings of the 2006 International Workshop on Mining Software Repositories (ACM, New York, NY, USA, MSR ’06, 2006), pp 58–64. https://doi.org/10.1145/1137983.1137999

  14. R. Koschke, Large-scale inter-system clone detection using suffix trees, in Proceedings of CSMR (2012), pp. 309–318

    Google Scholar 

  15. D. Lawrie, C. Morrell, H. Feild, D. Binkley, What’s in a name? a study of identifiers, in 14th IEEE International Conference on Program Comprehension (ICPC’06) (2006), pp. 3–12. https://doi.org/10.1109/ICPC.2006.51

  16. S. Livieri, Y. Higo, M. Matsushita, K. Inoue, Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder, in Proceedings of ICSE (2007)

    Google Scholar 

  17. C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, J. Vitek, Déjàvu: a map of code duplicates on github, in Proceedings of the ACM Program Lang 1(OOPSLA) (2017). https://doi.org/10.1145/3133908

  18. C. Roy, M. Zibran, R. Koschke, The vision of software clone management: past, present, and future (keynote paper), in Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week–IEEE Conference on (2014), pp. 18–33

    Google Scholar 

  19. H. Sajnani, V. Saini, J. Svajlenko, C.K. Roy, C.V. Lopes, Sourcerercc: scaling code clone detection to big-code, in Proceedings of the 38th International Conference on Software Engineering, Association for Computing Machinery (New York, NY, USA, ICSE ’16, 2016), pp. 1157–1168. https://doi.org/10.1145/2884781.2884877

  20. J. Svajlenko, I. Keivanloo, C. Roy, Scaling classical clone detection tools for ultra-large datasets: an exploratory study, in Software Clones (IWSC), 2013 7th International Workshop on (2013), pp. 16–22

    Google Scholar 

  21. C. Xiao, W. Wang, X. Lin, J. X. Yu, Efficient similarity joins for near duplicate detection, in Proceedings of the 17th International Conference on World Wide Web (ACM, New York, NY, USA, WWW ’08, 2008), pp. 131–140. https://doi.org/10.1145/1367497.1367516

  22. Y. Zhang, R. Jin, Z.H. Zhou, Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1-4), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hitesh Sajnani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sajnani, H., Saini, V., Roy, C.K., Lopes, C. (2021). SourcererCC: Scalable and Accurate Clone Detection. In: Inoue, K., Roy, C.K. (eds) Code Clone Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-16-1927-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-1927-4_4

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-1926-7

  • Online ISBN: 978-981-16-1927-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics