Advertisement

Who Wrote This Code? Identifying the Authors of Program Binaries

  • Nathan Rosenblum
  • Xiaojin Zhu
  • Barton P. Miller
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6879)

Abstract

Program authorship attribution—identifying a programmer based on stylistic characteristics of code—has practical implications for detecting software theft, digital forensics, and malware analysis. Authorship attribution is challenging in these domains where usually only binary code is available; existing source code-based approaches to attribution have left unclear whether and to what extent programmer style survives the compilation process. Casting authorship attribution as a machine learning problem, we present a novel program representation and techniques that automatically detect the stylistic features of binary code. We apply these techniques to two attribution problems: identifying the precise author of a program, and finding stylistic similarities between programs by unknown authors. Our experiments provide strong evidence that programmer style is preserved in program binaries.

Keywords

Binary Code Parallel Corpus Stylistic Characteristic Programmer Style Feature Template 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)zbMATHGoogle Scholar
  2. 2.
    Christodorescu, M., Jha, S., Seshia, S.A., Song, D., Bryant, R.E.: Semantics-aware malware detection. In: IEEE Symposium on Security and Privacy (S&P 2005), Oakland, CA (May 2005)Google Scholar
  3. 3.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20 (1995)Google Scholar
  4. 4.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)zbMATHGoogle Scholar
  5. 5.
    Gray, A., Sallis, P., MacDonell, S.: Software forensics: Extending authorship analysis techniques to computer programs. In: 3rd Biennial Conference of the International Association of Forensic Linguists, Durham, NC (September 1997)Google Scholar
  6. 6.
    Hayes, J.H., Offutt, J.: Recognizing authors: an examination of the consistent programmer hypothesis. Software Testing, Verification and Reliability (2009)Google Scholar
  7. 7.
    Juola, P.: Authorship attribution. Foundations and Trends in Information Retrieval (December 2006)Google Scholar
  8. 8.
    Karim, M., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 13–23 (2005)CrossRefGoogle Scholar
  9. 9.
    Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7, 2721–2744 (2006)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Krsul, I., Spafford, E.H.: Authorship analysis: identifying the author of a program. Computers & Security 16(3), 233–257 (1997)CrossRefGoogle Scholar
  11. 11.
    Krügel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Mahalanobis, P.C.: On the generalised distance in statistics. In: Proceedings National Institute of Sciences of India, vol. 2 (1936)Google Scholar
  13. 13.
    Palmer, G.: A road map for digital forensic research. Technical Report DTR-T001-01 FINAL, Digital Forensics Research Workshop, DFRWS (2001)Google Scholar
  14. 14.
    Paradyn Project. ParseAPI: An application program interface for binary parsing (2011), http://paradyn.org/html/parse0.9-features.html
  15. 15.
    Rosenblum, N.E., Miller, B.P., Zhu, X.: Extracting compiler provenance from program binaries. In: 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2010), Toronto, Ontario, Canada (June 2010)Google Scholar
  16. 16.
    Rosenblum, N.E., Miller, B.P., Zhu, X.: Recovering the toolchain provenance of binary code (2011). In: 20th International Symposium on Software Testing and Analysis (ISSTA), Toronto, Ontario, Canada (July 2011)Google Scholar
  17. 17.
    Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: ACM SIGMOD International Conference on Management of Data, San Diego, CA (June 2003)Google Scholar
  18. 18.
    Spafford, E.H., Weeber, S.A.: Software forensics: Can we track code to its authors? Technical Report CSD-TR-92-010, Purdue University (February 1992)Google Scholar
  19. 19.
    Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada (June 2009)Google Scholar
  20. 20.
    Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research (February 2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Nathan Rosenblum
    • 1
  • Xiaojin Zhu
    • 1
  • Barton P. Miller
    • 1
  1. 1.University of WisconsinMadisonUSA

Personalised recommendations