Jackdaw: Towards Automatic Reverse Engineering of Large Datasets of Binaries

  • Mario PolinoEmail author
  • Andrea Scorti
  • Federico Maggi
  • Stefano Zanero
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9148)


When analyzing an untrusted binary, reverse engineers usually rely on ad-hoc collections of interesting dynamic patterns—known as behaviors in the malware-analysis community—and static patterns—known as signatures in the antivirus community. Such patterns are often part of the skill set of the analyst, sometimes implemented in manually-created post-processing scripts. It would be desirable to be able to automatically find such behaviors, present them to analysts, and create a systematic catalog of matching rules and relevant implementations. We propose Jackdaw, a system that finds interesting dynamic patterns, and ranks them to unveil potentially interesting behaviors. Then, it annotates them with static information, capturing the distinct implementations of each across different malware families. Finally, Jackdaw associates semantic information to the behaviors, so as to create a descriptive summary that helps the analysts in querying the catalog of behaviors by type. To do this, it leverages the dynamic information and an indexed Web-based knowledge databases.

We implement and demonstrate Jackdaw on the Win32 API (even if the technique can be generalized to any OS). On a dataset of 2,136 distinct binaries, including both malicious and benign libraries and executables, we compared the behaviors extracted automatically against a ground truth of 44 behaviors created manually by expert analysts. Jackdaw found 77.3 % of them and was able to exclude spurious behaviors in 99.6 % cases. We also discovered 466 novel behaviors, among which manual exploration and review by expert reverse engineers revealed interesting findings and confirmed the correctness of the semantic tagging.


Reverse Engineering Control Flow Graph Reverse Engineer Candidate Behavior Virtual Machine Introspection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: NDSS (2009)Google Scholar
  2. 2.
    Bayer, U., Habibi, I., Balzarotti, D., Kirda, E., Kruegel, C.: Insights into current malware behavior. In: LEET (2009)Google Scholar
  3. 3.
    Caselden, D., Bazhanyuk, A., Payer, M., McCamant, S., Song, D.: HI-CFG: construction by binary analysis and application to attack polymorphism. In: Crampton, J., Jajodia, S., Mayes, K. (eds.) ESORICS 2013. LNCS, vol. 8134, pp. 164–181. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  4. 4.
    Cesare, S., Xiang, Y.: Software Similarity and Classification. Springer Briefs in Computer Science. Springer, London (2012) zbMATHCrossRefGoogle Scholar
  5. 5.
    Cesare, S., Xiang, Y., Zhou, W.: Control flow-based malware variant detection. IEEE Trans. Dependable Secure Comput. 11(4), 307–317 (2014). doi: 10.1109/TDSC.2013.40 CrossRefGoogle Scholar
  6. 6.
    Comparetti, P.M., Salvaneschi, G., Kirda, E., Kolbitsch, C., Kruegel, C., Zanero, S.: Identifying dormant functionality in malware programs. In: SP, pp. 61–76. IEEE Computer Society, Washington, DC (2010)Google Scholar
  7. 7.
    Crandall, J.R., Wu, S.F., Chong, F.T.: Minos: architectural support for protecting control data. TACO 3(4), 359–389 (2006)CrossRefGoogle Scholar
  8. 8.
    Deng, Z., Zhang, X., Xu, D.: Spider: stealthy binary program instrumentation and debugging via hardware virtualization. In: ACSAC, New York, NY, USA (2013)Google Scholar
  9. 9.
    Dolan-Gavitt, B., Leek, T., Zhivich, M., Giffin, J., Lee, W.: Virtuoso: narrowing the semantic gap in virtual machine introspection. In: SP, pp. 297–312 (2011)Google Scholar
  10. 10.
    Eskandari, M., Khorshidpour, Z., Hashemi, S.: Hdm-analyser: a hybrid analysis approach based on data mining techniques for malware detection. JCV 9(2), 77–93 (2013)Google Scholar
  11. 11.
    Fredrikson, M., Jha, S., Christodorescu, M., Sailer, R., Yan, X.: Synthesizing near-optimal malware specifications from suspicious behaviors. In: SP, pp. 45–60. IEEE Computer Society, Washington, DC (2010)Google Scholar
  12. 12.
    Fu, Y., Lin, Z.: Space traveling across vm: automatically bridging the semantic gap in virtual machine introspection via online kernel data redirection. In: SP, pp. 586–600 (2012)Google Scholar
  13. 13.
    Garfinkel, T., Adams, K., Warfield, A., Franklin, J.: Compatibility is not transparency: Vmm detection myths and realities. In: HOTOS, pp. 6:1–6:6. USENIX Association, Berkeley (2007)Google Scholar
  14. 14.
    Holz, T., Raynal, F.: Detecting honeypots and other suspicious environments. In: 6th IEEE SMC Information Assurance Workshop (2005)Google Scholar
  15. 15.
    Jacob, G., Comparetti, P.M., Neugschwandtner, M., Kruegel, C., Vigna, G.: A static, packer-agnostic filter to detect similar malware samples. In: Flegel, U., Markatos, E., Robertson, W. (eds.) DIMVA 2012. LNCS, vol. 7591, pp. 102–122. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  16. 16.
    Jacob, G., Debar, H., Filiol, E.: Behavioral detection of malware: from a survey towards an established taxonomy. JCV 4(3), 251–266 (2008)Google Scholar
  17. 17.
    Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: USENIX Security, pp. 81–96. USENIX Association, Berkeley (2013)Google Scholar
  18. 18.
    Kirat, D., Vigna, G., Kruegel, C.: Barebox: efficient malware analysis on bare-metal. In: ACSAC, pp. 403–412. ACM, New York (2011)Google Scholar
  19. 19.
    Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006) CrossRefGoogle Scholar
  20. 20.
    Lee, J., Avgerinos, T., Brumley, D.: Tie: principled reverse engineering of types in binary programs. In: NDSS (2011)Google Scholar
  21. 21.
    Lindorfer, M., Federico, A.D., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: ACSAC, pp. 349–358. ACM, New York (2012)Google Scholar
  22. 22.
    Lindorfer, M., Kolbitsch, C., Milani Comparetti, P.: Detecting environment-sensitive malware. In: Sommer, R., Balzarotti, D., Maier, G. (eds.) RAID 2011. LNCS, vol. 6961, pp. 338–357. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  23. 23.
    Linn, C., Debray, S.: Obfuscation of executable code to improve resistance to static disassembly. In: CCS, pp. 290–299. ACM, New York (2003)Google Scholar
  24. 24.
    Maggi, F., Matteucci, M., Zanero, S.: Detecting intrusions through system call sequence and argument analysis. TODS 7(4), 381–395 (2008)Google Scholar
  25. 25.
    Martignoni, L., Christodorescu, M., Jha, S.: Omniunpack: fast, generic, and safe unpacking of malware. In: ACSAC, pp. 431–441. IEEE (2007)Google Scholar
  26. 26.
    Martignoni, L., Stinson, E., Fredrikson, M., Jha, S., Mitchell, J.C.: A layered architecture for detecting malicious behaviors. In: Lippmann, R., Kirda, E., Trachtenberg, A. (eds.) RAID 2008. LNCS, vol. 5230, pp. 78–97. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  27. 27.
    Moser, A., Kruegel, C., Kirda, E.: Exploring multiple execution paths for malware analysis. In: SP (2007)Google Scholar
  28. 28.
    Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for malware detection. In: ACSAC, pp. 421–430 (2007)Google Scholar
  29. 29.
    Mutz, D., Valeur, F., Vigna, G., Kruegel, C.: Anomalous system call detection. TISSEC 9(1), 61–93 (2006)CrossRefGoogle Scholar
  30. 30.
    Nance, K., Bishop, M., Hay, B.: Virtual machine introspection: observation or interference? IEEE Secur. Priv. 6(5), 32–37 (2008)CrossRefGoogle Scholar
  31. 31.
    Newsome, J.: Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In: NDSS. Internet Society (2005)Google Scholar
  32. 32.
    Palahan, S., Babic, D., Chaudhuri, S., Kifer, D.: Extraction of statistically signicant malware behaviors. In: ACSAC, New York, NY, USA, December 2013Google Scholar
  33. 33.
    Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. JCS 19(4), 639–668 (2011)Google Scholar
  34. 34.
    Royal, P., Halpin, M., Dagon, D., Edmonds, R., Lee, W.: Polyunpack: automating the hidden-code extraction of unpack-executing malware. In: ACSAC, pp. 289–300. IEEE Computer Society, Washington, DC (2006)Google Scholar
  35. 35.
    Schwartz, E.J., Lee, J., Woo, M., Brumley, D.: Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. In: USENIX Security (2013)Google Scholar
  36. 36.
    Slowinska, A., Stancescu, T., Bos, H.: Howard: a dynamic excavator for reverse engineering data structures. In: NDSS. Citeseer (2011)Google Scholar
  37. 37.
    Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M.G., Liang, Z., Newsome, J., Poosankam, P., Saxena, P.: BitBlaze: a new approach to computer security via binary analysis. In: Sekar, R., Pujari, A.K. (eds.) ICISS 2008. LNCS, vol. 5352, pp. 1–25. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  38. 38.
    Song, Q., Kasabov, N.: Ecm - a novel on-line, evolving clustering method and its applications. In: Posner, M.I. (ed.) Foundations of Cognitive Science, pp. 631–682. The MIT Press, Cambridge (2001)Google Scholar
  39. 39.
    Willems, C., Hund, R., Fobian, A., Felsch, D., Holz, T., Vasudevan, A.: Down to the bare metal: using processor features for binary analysis. In: ACSAC, pp. 189–198. ACM, New York (2012)Google Scholar
  40. 40.
    Yan, G., Brown, N., Kong, D.: Exploring discriminatory features for automated malware classification. In: Rieck, K., Stewin, P., Seifert, J.-P. (eds.) DIMVA 2013. LNCS, vol. 7967, pp. 41–61. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  41. 41.
    Yetiser, T.: Polymorphic Viruses, Implementation, Detection, and Protection (1993)Google Scholar
  42. 42.
    Yin, H., Song, D.X., Egele, M., Kruegel, C., Kirda, E.: Panorama: capturing system-wide information flow for malware detection and analysis. In: Ning, P., di Vimercati, S.D.C., Syverson, P.F. (eds.) CCS, pp. 116–127. ACM, New York (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Mario Polino
    • 1
    Email author
  • Andrea Scorti
    • 1
  • Federico Maggi
    • 1
  • Stefano Zanero
    • 1
  1. 1.DEIBPolitecnico di MilanoMilanItaly

Personalised recommendations