Identifying Multiple Authors in a Binary Program

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10493)

Abstract

Knowing the authors of a binary program has significant application to forensics of malicious software (malware), software supply chain risk management, and software plagiarism detection. Existing techniques assume that a binary is written by a single author, which does not hold true in real world because most modern software, including malware, often contains code from multiple authors. In this paper, we make the first step toward identifying multiple authors in a binary. We present new fine-grained techniques to address the tougher problem of determining the author of each basic block. The decision of attributing authors at the basic block level is based on an empirical study of three large open source software, in which we find that a large fraction of basic blocks can be well attributed to a single author. We present new code features that capture programming style at the basic block level, our approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary. Our experiments show strong evidence that programming styles can be recovered at the basic block level and it is practical to identify multiple authors in a binary.

Keywords

Binary code authorship Code features Software forensics 

Notes

Acknowledgments

This work is supported in part by Department of Energy grant DE-SC0010474, National Science Foundation Cyber Infrastructure grants ACI-1547272, ACI-1449918, Department of Homeland Security under AFRL Contract FA8750-12-2-0289, and a grant from Intel Corporation. This research was performed using the compute resources and assistance of the UW-Madison Center For High Throughput Computing (CHTC) in the Department of Computer Sciences.

References

  1. 1.
    Abbasi, A., Li, W., Benjamin, V., Hu, S., Chen, H.: Descriptive analytics: examining expert hackers in web forums. In: 2014 IEEE Joint Intelligence and Security Informatics Conference (JISIC), Hague, Netherlands, September 2014Google Scholar
  2. 2.
    Allodi, L., Corradin, M., Massacci, F.: Then and now: on the maturity of the cybercrime markets (the lesson that black-hat marketeers learned). IEEE Trans. Emerg. Top. Comput. 4 (2015)Google Scholar
  3. 3.
    Alrabaee, S., Saleem, N., Preda, S., Wang, L., Debbabi, M.: Oba2: an onion approach to binary code authorship attribution. Digit. Investig. 11(Suppl. 1), S94–S103 (2014)CrossRefGoogle Scholar
  4. 4.
    Apache Software Foundation: Apache http server. http://httpd.apache.org
  5. 5.
    Benjamin, V., Chen, H.: Securing cyberspace: identifying key actors in hacker communities. In: 2012 IEEE International Conference on Intelligence and Security Informatics (ISI), Arlington, VA, USA, June 2012Google Scholar
  6. 6.
    Burrows, S.: Source code authorship attribution. Ph.D. thesis, RMIT University, Melbourne, Victoria, Australia (2010)Google Scholar
  7. 7.
    Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C., Yamaguchi, F., Greenstadt, R.: De-anonymizing programmers via code stylometry. In: 24th USENIX Security Symposium (SEC), Austin, TX, USA, August 2015Google Scholar
  8. 8.
    Caliskan-Islam, A., Yamaguchi, F., Dauber, E., Harang, R., Rieck, K., Greenstadt, R., Narayanan, A.: When coding style survives compilation: de-anonymizing programmers from executable binaries. Technical report. arxiv http://arxiv.org/pdf/1512.08546.pdf
  9. 9.
    Chatzicharalampous, E., Frantzeskou, G., Stamatatos, E.: Author identification in imbalanced sets of source code samples. In: 2012 IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, November 2012Google Scholar
  10. 10.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)MATHGoogle Scholar
  11. 11.
    Croll, P.R.: Supply chain risk management-understanding vulnerabilities in code you buy, build, or integrate. In: 2011 IEEE International System Conference (SysCon), Montreal, QC, Canada, April 2011Google Scholar
  12. 12.
    de la Cuadra, F.: The geneology of malware. Netw. Secur. 4, 17–20 (2007)CrossRefGoogle Scholar
  13. 13.
    David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. In: 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Santa Barbara, California, USA, June 2016Google Scholar
  14. 14.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  15. 15.
    GNU Project: GCC: the GNU compiler collection. http://gcc.gnu.org
  16. 16.
    Guilfanova, I., DataRescue: fast library identificatiion and recognition technology (1997). https://www.hex-rays.com/products/ida/tech/flirt/index.shtml
  17. 17.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)MATHGoogle Scholar
  18. 18.
    Hemel, A., Kalleberg, K.T., Vermaas, R., Dolstra, E.: Finding software license violations through binary code clone detection. In: 8th Working Conference on Mining Software Repositories (MSR), Waikiki, Honolulu, HI, USA, May 2011Google Scholar
  19. 19.
  20. 20.
    Ho, T.K.: Random decision forests. In: 3rd International Conference on Document Analysis and Recognition (ICDAR), Montreal, Canada, August 1995Google Scholar
  21. 21.
    Holt, T.J., Strumsky, D., Smirnova, O., Kilger, M.: Examining the social networks of malware writers and hackers. Int. J. Cyber Criminol. 6(1), 891–903 (2012)Google Scholar
  22. 22.
    HTCondor: High Throughput Computing (1988). https://research.cs.wisc.edu/htcondor/
  23. 23.
    Jacobson, E.R., Rosenblum, N., Miller, B.P.: Labeling library functions in stripped binaries. In: 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (PASTE), Szeged, Hungary, September 2011Google Scholar
  24. 24.
    Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: 22nd USENIX Conference on Security (SEC), Washington, D.C. (2013)Google Scholar
  25. 25.
    Khoo, W.M., Mycroft, A., Anderson, R.: Rendezvous: a search engine for binary code. In: 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, May 2013Google Scholar
  26. 26.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 8th International Conference on Machine Learning (ICML), Bellevue, Washington, USA, June 2001Google Scholar
  27. 27.
    Lindorfer, M., Di Federico, A., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: 28th Annual Computer Security Applications Conference (ACSAC), Orlando, Florida, USA, December 2012Google Scholar
  28. 28.
    Mandiant: Mandiant 2013 threat report. Mandiant White paper (2013). https://www2.fireeye.com/WEB-2013-MNDT-RPT-M-Trends-2013_LP.html
  29. 29.
    Marquis-Boire, M., Marschalek, M., Guarnieri, C.: Big game hunting: the peculiarities in nation-state malware research. In: Black Hat, Las Vegas, NV, USA, August 2015Google Scholar
  30. 30.
    Meng, X., Miller, B.P., Williams, W.R., Bernat, A.R.: Mining software repositories for accurate authorship. In: 2013 IEEE International Conference on Software Maintenance (ICSM), Eindhoven, Netherlands, September 2013Google Scholar
  31. 31.
    Okazaki, N.: Crfsuite: a fast implementation of conditional random fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/
  32. 32.
    Paradyn Project: Dyninst: Putting the Performance in High Performance Computing. http://www.dyninst.org
  33. 33.
    Qiu, J., Su, X., Ma, P.: Library functions identification in binary code by using graph isomorphism testings. In: 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, Quebec, Canada, March 2015Google Scholar
  34. 34.
    Rahimian, A., Shirani, P., Alrbaee, S., Wang, L., Debbabi, M.: Bincomp: a stratified approach to compiler provenance attribution. Digit. Investig. 14(Suppl. 1), S146–S155 (2015)CrossRefGoogle Scholar
  35. 35.
    Rahman, F., Devanbu, P.: Ownership, experience and defects: a fine-grained study of authorship. In: Proceedings of 33rd International Conference on Software Engineering (ICSE), Waikiki, Honolulu, HI, USA, May 2011Google Scholar
  36. 36.
    Roberts, R.: Malware development life cycle. In: Virus Bulletin Conference (VB), October 2008Google Scholar
  37. 37.
    Rosenblum, N., Miller, B.P., Zhu, X.: Recovering the toolchain provenance of binary code. In: 2011 International Symposium on Software Testing and Analysis (ISSTA), Toronto, Ontario, Canada, July 2011Google Scholar
  38. 38.
    Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? Identifying the authors of program binaries. In: 16th European Conference on Research in Computer Security (ESORICS), Leuven, Belgium, September 2011Google Scholar
  39. 39.
    Ruttenberg, B., Miles, C., Kellogg, L., Notani, V., Howard, M., LeDoux, C., Lakhotia, A., Pfeffer, A.: Identifying shared software components to support malware forensics. In: 11th Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), Egham, London, UK, July 2014Google Scholar
  40. 40.
    Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D., Su, Z.: Detecting code clones in binary executables. In: 18th International Symposium on Software Testing and Analysis (ISSTA), Chicago, IL, USA, July 2009Google Scholar
  41. 41.
    Śliwerski, J., Zimmermann, T., Zeller, A.: When do changes induce fixes? In: Proceedings of 2005 International Workshop on Mining Software Repositories (MSR), St. Louis, Missouri, USA, May 2005Google Scholar
  42. 42.
    Yavvari, C., Tokhtabayev, A., Rangwala, H., Stavrou, A.: Malware characterization using behavioral components. In: 6th International Conference on Mathematical Methods, Models and Architectures for Computer Network Security (MMM-ACNS), St. Petersburg, Russia, October 2012Google Scholar
  43. 43.
    Yin, Z., Yuan, D., Zhou, Y., Pasupathy, S., Bairavasundaram, L.: How do fixes become bugs? In: Proceedings of 19th ACM SIGSOFT Symposium and 13th European Conference on Foundations of Software Engineering (ESEC/FSE), Szeged, Hungary, September 2011Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Computer Sciences DepartmentUniversity of Wisconsin - MadisonMadisonUSA
  2. 2.Wisconsin Institutes for DiscoveryUniversity of Wisconsin - MadisonMadisonUSA

Personalised recommendations