Advertisement

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

  • Bander Alsulami
  • Edwin Dauber
  • Richard Harang
  • Spiros Mancoridis
  • Rachel Greenstadt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10492)

Abstract

Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated.

In this work, we provide novel contributions to AST-based source code authorship attribution using deep neural networks. We implement Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models to automatically extract relevant features from the AST representation of programmers’ source code. We show that our models can automatically learn efficient representations of AST-based features without needing hand-constructed ancillary information used by previous methods. Our empirical study on multiple datasets with different programming languages shows that our proposed approach achieves the state-of-the-art performance for source code authorship attribution on AST-based features, despite not leveraging information that was previously thought to be required for high-confidence classification.

Keywords

Source code authorship attribution Code stylometry Long short-term memory Abstract syntax tree Security Privacy 

Notes

Acknowledgments

This work is supported by a Fellowship from the Isaac L. Auerbach Cybersecurity Institute at Drexel University, and by an appointment to the Student Research Participation Program at the U.S Army Research Laboratory administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S. Department of Energy and USARL.

References

  1. 1.
    Antoniol, G., Fiutem, R., Cristoforetti, L.: Using metrics to identify design patterns in object-oriented software. In: Proceedings of Fifth International Symposium on Software Metrics, 1998, pp. 23–34. IEEE (1998)Google Scholar
  2. 2.
    Balachandran, V., Tan, D.J., Thing, V.L., et al.: Control flow obfuscation for android applications. Comput. Secur. 61, 72–93 (2016)CrossRefGoogle Scholar
  3. 3.
    Barford, P., Yegneswaran, V.: An inside look at botnets. In: Christodorescu, M., Jha, S., Maughan, D., Song, D., Wang, C. (eds.) Malware Detection. Advances in Information Security, vol. 27, pp. 171–191. Springer, Boston, MA (2007). doi: 10.1007/978-0-387-44599-1_8 CrossRefGoogle Scholar
  4. 4.
    Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using abstract syntax trees. In: 1998 Proceedings of International Conference on Software Maintenance, pp. 368–377. IEEE (1998)Google Scholar
  5. 5.
    Bischofberger, W.R.: Sniff (abstract): a pragmatic approach to a c++ programming environment. ACM SIGPLAN OOPS Messenger 4(2), 229 (1993)CrossRefGoogle Scholar
  6. 6.
    Burrows, S., Uitdenbogerd, A.L., Turpin, A.: Application of information retrieval techniques for source code authorship attribution. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 699–713. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-00887-0_61 CrossRefGoogle Scholar
  7. 7.
    Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C., Yamaguchi, F., Greenstadt, R.: De-anonymizing programmers via code stylometry. In: 24th USENIX Security Symposium (USENIX Security), Washington, DC (2015)Google Scholar
  8. 8.
    Caliskan-Islam, A., Yamaguchi, F., Dauber, E., Harang, R., Rieck, K., Greenstadt, R., Narayanan, A.: When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint (2015). arXiv:1512.08546
  9. 9.
    Calliss, F.W.: Problems with automatic restructurers. ACM SIGPLAN Notices 23(3), 13–21 (1988)CrossRefGoogle Scholar
  10. 10.
    Caruana, R., Lawrence, S., Giles, L.: Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In: NIPS, pp. 402–408 (2000)Google Scholar
  11. 11.
    Chilowicz, M., Duris, E., Roussel, G.: Syntax tree fingerprinting for source code similarity detection. In: 2009 IEEE 17th International Conference on Program Comprehension, ICPC 2009, pp. 243–247. IEEE (2009)Google Scholar
  12. 12.
    Dauber, E., Caliskan-Islam, A., Harang, R., Greenstadt, R.: Git blame who?: stylistic authorship attribution of small, incomplete source code fragments. arXiv preprint (2017). arXiv:1701.05681
  13. 13.
    Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
  14. 14.
    Elenbogen, B.S., Seliya, N.: Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23(3), 50–57 (2008)Google Scholar
  15. 15.
    Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C.E., Howald, B.S.: Identifying authorship by byte-level n-grams: the source code author profile (SCAP) method. Int. J. Dig. Evid. 6(1), 1–18 (2007)Google Scholar
  16. 16.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRefGoogle Scholar
  17. 17.
    Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks architectures. Neural Comput. 7(2), 219–269 (1995)CrossRefGoogle Scholar
  18. 18.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, vol. 9, pp. 249–256 (2010)Google Scholar
  19. 19.
    Goller, C., Kuchler, A.: Learning task-dependent distributed representations by backpropagation through structure. In: 1996 IEEE International Conference on Neural Networks, vol. 1, pp. 347–352. IEEE (1996)Google Scholar
  20. 20.
    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)zbMATHGoogle Scholar
  21. 21.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005)CrossRefGoogle Scholar
  22. 22.
    Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2016)Google Scholar
  23. 23.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  24. 24.
    Heuzeroth, D., Holl, T., Hogstrom, G., Lowe, W.: Automatic design pattern detection. In: 2003 11th IEEE International Workshop on Program Comprehension, pp. 94–103. IEEE (2003)Google Scholar
  25. 25.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  26. 26.
    Kim, M., Notkin, D., Grossman, D.: Automatic inference of structural changes for matching across program versions. In: ICSE, vol. 7, pp. 333–343 (2007)Google Scholar
  27. 27.
    Koschke, R., Falke, R., Frenzel, P.: Clone detection using abstract syntax suffix trees. In: 2006 13th Working Conference on Reverse Engineering, WCRE 2006, pp. 253–262. IEEE (2006)Google Scholar
  28. 28.
    Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S.: A probabilistic approach to source code authorship identification. In: 2007 Fourth International Conference on Information Technology, ITNG 2007, pp. 243–248. IEEE (2007)Google Scholar
  29. 29.
    Lazar, F.M., Banias, O.: Clone detection algorithm based on the abstract syntax tree approach. In: 2014 IEEE 9th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 73–78. IEEE (2014)Google Scholar
  30. 30.
    Li, J., Luong, M.T., Jurafsky, D., Hovy, E.: When are tree structures necessary for deep learning of representations? arXiv preprint (2015). arXiv:1503.00185
  31. 31.
    Marquis-Boire, M., Marschalek, M., Guarnieri, C.: Big Game Hunting: The Peculiarities in Nation-State Malware Research. Black Hat, Las Vegas (2015)Google Scholar
  32. 32.
    Meng, X.: Fine-grained binary code authorship identification. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 1097–1099. ACM (2016)Google Scholar
  33. 33.
    Mens, T., Tourwé, T.: A survey of software refactoring. IEEE Trans. Softw. Eng. 30(2), 126–139 (2004)CrossRefGoogle Scholar
  34. 34.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). arXiv:1301.3781
  35. 35.
    Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: 2007 Twenty-Third Annual Computer security Applications Conference, ACSAC 2007, pp. 421–430. IEEE (2007)Google Scholar
  36. 36.
    Neamtiu, I., Foster, J.S., Hicks, M.: Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Softw. Eng. Notes 30(4), 1–5 (2005)CrossRefGoogle Scholar
  37. 37.
    Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. ICML 3(28), 1310–1318 (2013)Google Scholar
  38. 38.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)Google Scholar
  40. 40.
    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Cogn. Model. 5(3), 1 (1988)zbMATHGoogle Scholar
  41. 41.
    Russell, S., Norvig, P., Intelligence, A.: A Modern Approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs (1995). pp. 25, 27zbMATHGoogle Scholar
  42. 42.
    Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Interspeech, pp. 338–342 (2014)Google Scholar
  43. 43.
    Schrittwieser, S., Katzenbeisser, S., Kinder, J., Merzdovnik, G., Weippl, E.: Protecting software through obfuscation: can it keep pace with progress in code analysis? ACM Comput. Surv. (CSUR) 49(1), 4 (2016)CrossRefGoogle Scholar
  44. 44.
    Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: ACL, vol. 1, pp. 455–465 (2013)Google Scholar
  45. 45.
    Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML (3), vol. 28, pp. 1139–1147 (2013)Google Scholar
  46. 46.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  47. 47.
    Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint (2015). arXiv:1503.00075
  48. 48.
    Targ, S., Almeida, D., Lyman, K.: Resnet in resnet: generalizing residual architectures. arXiv preprint (2016). arXiv:1603.08029
  49. 49.
    Tennyson, M.F.: A replicated comparative study of source code authorship attribution. In: 2013 3rd International Workshop on Replication in Empirical Software Engineering Research (RESER), pp. 76–83. IEEE (2013)Google Scholar
  50. 50.
    Tokui, S., Oono, K., Hido, S., Clayton, J.: Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in the Twenty-Ninth Annual Conference on Neural Information Processing Systems (NIPS) (2015)Google Scholar
  51. 51.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  52. 52.
    Wisse, W., Veenman, C.: Scripting DNA: identifying the javascript programmer. Dig. Invest. 15, 61–71 (2015)CrossRefGoogle Scholar
  53. 53.
    Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint (2016). arXiv:1609.08144
  54. 54.
    Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P) (2014)Google Scholar
  55. 55.
    Yu, D.Q., Peng, X., Zhao, W.Y.: Automatic refactoring method of cloned code using abstract syntax tree and static analysis. J. Chin. Comput. Syst. 30(9), 1752–1760 (2009)Google Scholar
  56. 56.
    Zaremba, W.: An empirical exploration of recurrent network architectures (2015)Google Scholar
  57. 57.
    Zhang, F., Jhi, Y.C., Wu, D., Liu, P., Zhu, S.: A first step towards algorithm plagiarism detection. In: Proceedings of the 2012 International Symposium on Software Testing and Analysis, pp. 111–121. ACM (2012)Google Scholar
  58. 58.
    Zhou, C., Sun, C., Liu, Z., Lau, F.: A C-LSTM neural network for text classification. arXiv preprint (2015). arXiv:1511.08630
  59. 59.
    Zhu, X.D., Sobhani, P., Guo, H.: Long short-term memory over recursive structures. In: ICML, pp. 1604–1612 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Drexel UniversityPhiladelphiaUSA
  2. 2.SophosAbingdonUK

Personalised recommendations