Abstract
Clone detection is an active area of research. However, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. SourcererCC was developed as an attempt to fill this gap. It is a widely used token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. In the evaluation experiments, SourcererCC demonstrated both high recall and precision, and the ability to scale to a large inter-project repository (250MLOC) even using a standard workstation. This chapter reflects on some of the principle design decisions behind the success of SourcererCC and also presents an architecture to scale it horizontally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The author was part of the team that carried out the analysis.
- 2.
i5 processor, 8 GB RAM, and 500 GB disk storage.
- 3.
- 4.
Similar to the popular bag-of-words model [22] in Information Retrieval.
- 5.
The curve can also be represented using \(y = x(x-1)/2\) quadratic function where x is the number of methods in a project and y is the number of candidate comparisons carried out to detect all clone pairs.
- 6.
In case of a single high-performance multi-processor machine, \(N+1\) is the number of processors available on that machine.
- 7.
- 8.
References
C. Caprile, P. Tonella, Nomen est omen: analyzing the language of function identifiers, in Reverse Engineering. Proceedings. Sixth Working Conference on (1999), pp 112–122. https://doi.org/10.1109/WCRE.1999.806952
S. Chaudhuri, V. Ganti, R. Kaushik, A primitive operator for similarity joins in data cleaning, in Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, ICDE ’06 (2006), pp 5. https://doi.org/10.1109/ICDE.2006.9
K. Chen, P. Liu, Y. Zhang, Achieving accuracy and scalability simultaneously in detecting application clones on android markets, in Proceedings of the 36th International Conference on Software Engineering, ACM, New York, NY, USA, (ICSE 2014), pp 175–186
J.R. Cordy, Comprehending reality-practical barriers to industrial adoption of software maintenance automation, in Program Comprehension, 2003. 11th IEEE International Workshop on (IEEE, 2003), pp 196–205
J. Davies, D. German, M. Godfrey, A. Hindle, Software Bertillonage: finding the provenance of an entity, in Proceedings of MSR (2011)
F. Deissenboeck, M. Pizka, Concise and consistent naming. Softw. Qual. J. 14(3), 261–282. https://doi.org/10.1007/s11219-006-9219-1
L. Guerrouj, Normalizing source code vocabulary to support program comprehension and software quality, in Software Engineering (ICSE), 2013 35th International Conference on(2013), pp 1385–1388. https://doi.org/10.1109/ICSE.2013.6606723
A. Hemel, R. Koschke, Reverse engineering variability in source code using clone detection: a case study for linux variants of consumer electronic devices, in Proceedings of Working Conference on Reverse Engineering (2012), pp 357–366
A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu, On the naturalness of software, in Proceedings of the 34th International Conference on Software Engineering (IEEE Press, Piscataway, NJ, USA, ICSE ’12, 2012), pp 837–847. http://dl.acm.org/citation.cfm?id=2337223.2337322
B. Hummel, E. Juergens, L. Heinemann, M. Conradt, Index-based code clone detection:incremental, distributed, scalable, in Proceedings of ICSM (2010)
T. Ishihara, K. Hotta, Y. Higo, H. Igaki, S. Kusumoto, Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects, in Reverse Engineering (WCRE), 2012 19th Working Conference on (2012), pp 387–391. https://doi.org/10.1109/WCRE.2012.48
I. Keivanloo, J. Rilling, P. Charland, Internet-scale real-time code clone search via multi-level indexing, in Proceedings of WCRE (2011)
M. Kim, D. Notkin, Program element matching for multi-version program analyses, in Proceedings of the 2006 International Workshop on Mining Software Repositories (ACM, New York, NY, USA, MSR ’06, 2006), pp 58–64. https://doi.org/10.1145/1137983.1137999
R. Koschke, Large-scale inter-system clone detection using suffix trees, in Proceedings of CSMR (2012), pp. 309–318
D. Lawrie, C. Morrell, H. Feild, D. Binkley, What’s in a name? a study of identifiers, in 14th IEEE International Conference on Program Comprehension (ICPC’06) (2006), pp. 3–12. https://doi.org/10.1109/ICPC.2006.51
S. Livieri, Y. Higo, M. Matsushita, K. Inoue, Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder, in Proceedings of ICSE (2007)
C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, J. Vitek, Déjà vu: a map of code duplicates on github, in Proceedings of the ACM Program Lang 1(OOPSLA) (2017). https://doi.org/10.1145/3133908
C. Roy, M. Zibran, R. Koschke, The vision of software clone management: past, present, and future (keynote paper), in Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week–IEEE Conference on (2014), pp. 18–33
H. Sajnani, V. Saini, J. Svajlenko, C.K. Roy, C.V. Lopes, Sourcerercc: scaling code clone detection to big-code, in Proceedings of the 38th International Conference on Software Engineering, Association for Computing Machinery (New York, NY, USA, ICSE ’16, 2016), pp. 1157–1168. https://doi.org/10.1145/2884781.2884877
J. Svajlenko, I. Keivanloo, C. Roy, Scaling classical clone detection tools for ultra-large datasets: an exploratory study, in Software Clones (IWSC), 2013 7th International Workshop on (2013), pp. 16–22
C. Xiao, W. Wang, X. Lin, J. X. Yu, Efficient similarity joins for near duplicate detection, in Proceedings of the 17th International Conference on World Wide Web (ACM, New York, NY, USA, WWW ’08, 2008), pp. 131–140. https://doi.org/10.1145/1367497.1367516
Y. Zhang, R. Jin, Z.H. Zhou, Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1-4), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Sajnani, H., Saini, V., Roy, C.K., Lopes, C. (2021). SourcererCC: Scalable and Accurate Clone Detection. In: Inoue, K., Roy, C.K. (eds) Code Clone Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-16-1927-4_4
Download citation
DOI: https://doi.org/10.1007/978-981-16-1927-4_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1926-7
Online ISBN: 978-981-16-1927-4
eBook Packages: Computer ScienceComputer Science (R0)