SourcererCC: Scalable and Accurate Clone Detection

Sajnani, Hitesh; Saini, Vaibhav; Roy, Chanchal K.; Lopes, Cristina

doi:10.1007/978-981-16-1927-4_4

Hitesh Sajnani³,
Vaibhav Saini³,
Chanchal K. Roy⁴ &
…
Cristina Lopes⁵

669 Accesses
1 Citations

Abstract

Clone detection is an active area of research. However, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. SourcererCC was developed as an attempt to fill this gap. It is a widely used token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. In the evaluation experiments, SourcererCC demonstrated both high recall and precision, and the ability to scale to a large inter-project repository (250MLOC) even using a standard workstation. This chapter reflects on some of the principle design decisions behind the success of SourcererCC and also presents an architecture to scale it horizontally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The author was part of the team that carried out the analysis.
2.
i5 processor, 8 GB RAM, and 500 GB disk storage.
3.
https://en.wikipedia.org/wiki/Debian.
4.
Similar to the popular bag-of-words model [22] in Information Retrieval.
5.
The curve can also be represented using \(y = x(x-1)/2\) quadratic function where x is the number of methods in a project and y is the number of candidate comparisons carried out to detect all clone pairs.
6.
In case of a single high-performance multi-processor machine, \(N+1\) is the number of processors available on that machine.
7.
https://visualvm.java.net/.
8.
http://profiler4j.sourceforge.net/.

References

C. Caprile, P. Tonella, Nomen est omen: analyzing the language of function identifiers, in Reverse Engineering. Proceedings. Sixth Working Conference on (1999), pp 112–122. https://doi.org/10.1109/WCRE.1999.806952
S. Chaudhuri, V. Ganti, R. Kaushik, A primitive operator for similarity joins in data cleaning, in Proceedings of the 22nd International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, ICDE ’06 (2006), pp 5. https://doi.org/10.1109/ICDE.2006.9
K. Chen, P. Liu, Y. Zhang, Achieving accuracy and scalability simultaneously in detecting application clones on android markets, in Proceedings of the 36th International Conference on Software Engineering, ACM, New York, NY, USA, (ICSE 2014), pp 175–186
Google Scholar
J.R. Cordy, Comprehending reality-practical barriers to industrial adoption of software maintenance automation, in Program Comprehension, 2003. 11th IEEE International Workshop on (IEEE, 2003), pp 196–205
Google Scholar
J. Davies, D. German, M. Godfrey, A. Hindle, Software Bertillonage: finding the provenance of an entity, in Proceedings of MSR (2011)
Google Scholar
F. Deissenboeck, M. Pizka, Concise and consistent naming. Softw. Qual. J. 14(3), 261–282. https://doi.org/10.1007/s11219-006-9219-1
L. Guerrouj, Normalizing source code vocabulary to support program comprehension and software quality, in Software Engineering (ICSE), 2013 35th International Conference on(2013), pp 1385–1388. https://doi.org/10.1109/ICSE.2013.6606723
A. Hemel, R. Koschke, Reverse engineering variability in source code using clone detection: a case study for linux variants of consumer electronic devices, in Proceedings of Working Conference on Reverse Engineering (2012), pp 357–366
Google Scholar
A. Hindle, E. T. Barr, Z. Su, M. Gabel, P. Devanbu, On the naturalness of software, in Proceedings of the 34th International Conference on Software Engineering (IEEE Press, Piscataway, NJ, USA, ICSE ’12, 2012), pp 837–847. http://dl.acm.org/citation.cfm?id=2337223.2337322
B. Hummel, E. Juergens, L. Heinemann, M. Conradt, Index-based code clone detection:incremental, distributed, scalable, in Proceedings of ICSM (2010)
Google Scholar
T. Ishihara, K. Hotta, Y. Higo, H. Igaki, S. Kusumoto, Inter-project functional clone detection toward building libraries: an empirical study on 13,000 projects, in Reverse Engineering (WCRE), 2012 19th Working Conference on (2012), pp 387–391. https://doi.org/10.1109/WCRE.2012.48
I. Keivanloo, J. Rilling, P. Charland, Internet-scale real-time code clone search via multi-level indexing, in Proceedings of WCRE (2011)
Google Scholar
M. Kim, D. Notkin, Program element matching for multi-version program analyses, in Proceedings of the 2006 International Workshop on Mining Software Repositories (ACM, New York, NY, USA, MSR ’06, 2006), pp 58–64. https://doi.org/10.1145/1137983.1137999
R. Koschke, Large-scale inter-system clone detection using suffix trees, in Proceedings of CSMR (2012), pp. 309–318
Google Scholar
D. Lawrie, C. Morrell, H. Feild, D. Binkley, What’s in a name? a study of identifiers, in 14th IEEE International Conference on Program Comprehension (ICPC’06) (2006), pp. 3–12. https://doi.org/10.1109/ICPC.2006.51
S. Livieri, Y. Higo, M. Matsushita, K. Inoue, Very-large scale code clone analysis and visualization of open source programs using distributed ccfinder: D-ccfinder, in Proceedings of ICSE (2007)
Google Scholar
C. V. Lopes, P. Maj, P. Martins, V. Saini, D. Yang, J. Zitny, H. Sajnani, J. Vitek, Déjàvu: a map of code duplicates on github, in Proceedings of the ACM Program Lang 1(OOPSLA) (2017). https://doi.org/10.1145/3133908
C. Roy, M. Zibran, R. Koschke, The vision of software clone management: past, present, and future (keynote paper), in Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week–IEEE Conference on (2014), pp. 18–33
Google Scholar
H. Sajnani, V. Saini, J. Svajlenko, C.K. Roy, C.V. Lopes, Sourcerercc: scaling code clone detection to big-code, in Proceedings of the 38th International Conference on Software Engineering, Association for Computing Machinery (New York, NY, USA, ICSE ’16, 2016), pp. 1157–1168. https://doi.org/10.1145/2884781.2884877
J. Svajlenko, I. Keivanloo, C. Roy, Scaling classical clone detection tools for ultra-large datasets: an exploratory study, in Software Clones (IWSC), 2013 7th International Workshop on (2013), pp. 16–22
Google Scholar
C. Xiao, W. Wang, X. Lin, J. X. Yu, Efficient similarity joins for near duplicate detection, in Proceedings of the 17th International Conference on World Wide Web (ACM, New York, NY, USA, WWW ’08, 2008), pp. 131–140. https://doi.org/10.1145/1367497.1367516
Y. Zhang, R. Jin, Z.H. Zhou, Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1-4), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0

Download references

Author information

Authors and Affiliations

Microsoft, Redmond, USA
Hitesh Sajnani & Vaibhav Saini
University of Saskatchewan, Saskatoon, Canada
Chanchal K. Roy
University of California, Irvine, USA
Cristina Lopes

Authors

Hitesh Sajnani
View author publications
You can also search for this author in PubMed Google Scholar
Vaibhav Saini
View author publications
You can also search for this author in PubMed Google Scholar
Chanchal K. Roy
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hitesh Sajnani .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, Osaka University, Suita, Osaka, Japan
Katsuro Inoue
Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada
Chanchal K. Roy

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sajnani, H., Saini, V., Roy, C.K., Lopes, C. (2021). SourcererCC: Scalable and Accurate Clone Detection. In: Inoue, K., Roy, C.K. (eds) Code Clone Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-16-1927-4_4

Download citation

DOI: https://doi.org/10.1007/978-981-16-1927-4_4
Published: 04 August 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1926-7
Online ISBN: 978-981-16-1927-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics