Skip to main content

Clonewise – Detecting Package-Level Clones Using Machine Learning

  • Conference paper
Security and Privacy in Communication Networks (SecureComm 2013)

Abstract

Developers sometimes maintain an internal copy of another software or fork development of an existing project. This practice can lead to software vulnerabilities when the embedded code is not kept up to date with upstream sources. We propose an automated solution to identify clones of packages without any prior knowledge of these relationships. We then correlate clones with vulnerability information to identify outstanding security problems. This approach motivates software maintainers to avoid using cloned packages and link against system wide libraries. We propose over 30 novel features that enable us to use to use pattern classification to accurately identify package-level clones. To our knowledge, we are the first to consider clone detection as a classification problem. Our results show our system, Clonewise, compares well to manually tracked databases. Based on our work, over 30 unknown package clones and vulnerabilities have been identified and patched.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gailly, J.-L., Adler, M.: zlib (2011), http://zlib.net

  2. Debian Linux (2011), http://www.debian.org

  3. Red_Hat, Fedora Linux (2001), http://fedoraproject.org

  4. Basit, H.A., Jarzabek, S.: A Data Mining Approach for Detecting Higher-Level Clones in Software. IEEE Trans. Softw. Eng. 35, 497–514 (2009)

    Article  Google Scholar 

  5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)

    Article  Google Scholar 

  6. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006)

    Article  Google Scholar 

  7. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 513–523 (1988)

    Article  Google Scholar 

  8. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly (1955)

    Google Scholar 

  9. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002)

    MATH  Google Scholar 

  10. Dacheng, T., Xiaoou, T., Xuelong, L., Xindong, W.: Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1088–1099 (2006)

    Article  Google Scholar 

  11. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. Presented at the Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada (1995)

    Google Scholar 

  12. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc. (1993)

    Google Scholar 

  13. Breiman, L.: Random Forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  14. Biedl, C., Adler, M., Weimer, F.: Discovering copies of zlib (2011), http://www.enyo.de/fw/security/zlib-fingerprint/

  15. Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School of Computing TR 541, 115 (2007)

    Google Scholar 

  16. Jiang, L., Su, Z., Chiu, E.: Context-based detection of clone-related bugs. Presented at the Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia (2007)

    Google Scholar 

  17. Dang, Y., Ge, S., Huang, R., Zhang, D.: Code Clone Detection Experience at Microsoft. In: Proceedings of the 5th International Workshop on Software Clones (2011)

    Google Scholar 

  18. Jones, E.L.: Metrics based plagarism monitoring. Journal of Computing Sciences in Colleges 16, 253–261 (2001)

    Google Scholar 

  19. Son, J.-W., Park, S.-B., Park, S.-Y.: Program Plagiarism Detection Using Parse Tree Kernels. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 1000–1004. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Liu, C., Chen, C., Han, J., Yu, P.S.: GPLAG: detection of software plagiarism by program dependence graph analysis. Presented at the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA (2006)

    Google Scholar 

  21. Prechelt, L., Malpohl, G., Philippsen, M.: Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 1016–1038 (2002)

    Google Scholar 

  22. Wise, M.J.: YAP3: improved detection of similarities in computer program and other texts. SIGCSE Bull. 28, 130–134 (1996)

    Article  Google Scholar 

  23. Ji, J.-H., Woo, G., Cho, H.-G.: A source code linearization technique for detecting plagiarized programs. SIGCSE Bull. 39, 73–77 (2007)

    Article  Google Scholar 

  24. Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detecting duplicated code, p. 109 (1999)

    Google Scholar 

  25. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering, 654–670 (2002)

    Google Scholar 

  26. Livieri, S., Higo, Y., Matushita, M., Inoue, K.: Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In: Proceedings of the 29th International Conference on Software Engineering (ICSE 2007), pp. 106–115 (2007)

    Google Scholar 

  27. Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation (OSDI 2004), p. 20 (2004)

    Google Scholar 

  28. Li, Z., Lu, S., Myagmar, S., Zhou, Y.: CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE Transactions on Software Engineering, 176–192 (2006)

    Google Scholar 

  29. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees, p. 368 (1998)

    Google Scholar 

  30. Krinke, J.: Identifying similar code with program dependence graphs, p. 301 (2001)

    Google Scholar 

  31. Kim, H., Jung, Y., Kim, S., Yi, K.: MeCC: memory comparison-based clone detector. Presented at the Proceedings of the 33rd International Conference on Software Engineering, Waikiki, Honolulu, HI, USA (2011)

    Google Scholar 

  32. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones. Presented at the Proceedings of the 29th International Conference on Software Engineering (2007)

    Google Scholar 

  33. Geeknet, Sourceforge (2011), http://sourceforge.net/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Cite this paper

Cesare, S., Xiang, Y., Zhang, J. (2013). Clonewise – Detecting Package-Level Clones Using Machine Learning. In: Zia, T., Zomaya, A., Varadharajan, V., Mao, M. (eds) Security and Privacy in Communication Networks. SecureComm 2013. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 127. Springer, Cham. https://doi.org/10.1007/978-3-319-04283-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-04283-1_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-04282-4

  • Online ISBN: 978-3-319-04283-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics