Abstract
Developers sometimes maintain an internal copy of another software or fork development of an existing project. This practice can lead to software vulnerabilities when the embedded code is not kept up to date with upstream sources. We propose an automated solution to identify clones of packages without any prior knowledge of these relationships. We then correlate clones with vulnerability information to identify outstanding security problems. This approach motivates software maintainers to avoid using cloned packages and link against system wide libraries. We propose over 30 novel features that enable us to use to use pattern classification to accurately identify package-level clones. To our knowledge, we are the first to consider clone detection as a classification problem. Our results show our system, Clonewise, compares well to manually tracked databases. Based on our work, over 30 unknown package clones and vulnerabilities have been identified and patched.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gailly, J.-L., Adler, M.: zlib (2011), http://zlib.net
Debian Linux (2011), http://www.debian.org
Red_Hat, Fedora Linux (2001), http://fedoraproject.org
Basit, H.A., Jarzabek, S.: A Data Mining Approach for Detecting Higher-Level Clones in Software. IEEE Trans. Softw. Eng. 35, 497–514 (2009)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly (1955)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)
Dacheng, T., Xiaoou, T., Xuelong, L., Xindong, W.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1088–1099 (2006)
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. Presented at the Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada (1995)
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc. (1993)
Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)
Biedl, C., Adler, M., Weimer, F.: Discovering copies of zlib (2011), http://www.enyo.de/fw/security/zlib-fingerprint/
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007)
Jiang, L., Su, Z., Chiu, E.: Context-based detection of clone-related bugs. Presented at the Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia (2007)
Dang, Y., Ge, S., Huang, R., Zhang, D.: Code Clone Detection Experience at Microsoft. In: Proceedings of the 5th International Workshop on Software Clones (2011)
Jones, E.L.: Metrics based plagarism monitoring. Journal of Computing Sciences in Colleges 16, 253–261 (2001)
Son, J.-W., Park, S.-B., Park, S.-Y.: Program Plagiarism Detection Using Parse Tree Kernels. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 1000–1004. Springer, Heidelberg (2006)
Liu, C., Chen, C., Han, J., Yu, P.S.: GPLAG: detection of software plagiarism by program dependence graph analysis. Presented at the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA (2006)
Prechelt, L., Malpohl, G., Philippsen, M.: Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 1016–1038 (2002)
Wise, M.J.: YAP3: improved detection of similarities in computer program and other texts. SIGCSE Bull. 28, 130–134 (1996)
Ji, J.-H., Woo, G., Cho, H.-G.: A source code linearization technique for detecting plagiarized programs. SIGCSE Bull. 39, 73–77 (2007)
Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detecting duplicated code, p. 109 (1999)
Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 654–670 (2002)
Livieri, S., Higo, Y., Matushita, M., Inoue, K.: Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: Proceedings of the 29th International Conference on Software Engineering (ICSE 2007), pp. 106–115 (2007)
Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI 2004), p. 20 (2004)
Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 176–192 (2006)
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees, p. 368 (1998)
Krinke, J.: Identifying similar code with program dependence graphs, p. 301 (2001)
Kim, H., Jung, Y., Kim, S., Yi, K.: MeCC: memory comparison-based clone detector. Presented at the Proceedings of the 33rd International Conference on Software Engineering, Waikiki, Honolulu, HI, USA (2011)
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones. Presented at the Proceedings of the 29th International Conference on Software Engineering (2007)
Geeknet, Sourceforge (2011), http://sourceforge.net/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Cesare, S., Xiang, Y., Zhang, J. (2013). Clonewise – Detecting Package-Level Clones Using Machine Learning. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds) Security and Privacy in Communication Networks. SecureComm 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 127. Springer, Cham. https://doi.org/10.1007/978-3-319-04283-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-04283-1_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04282-4
Online ISBN: 978-3-319-04283-1
eBook Packages: Computer ScienceComputer Science (R0)