An Adaptable Infrastructure to Generate Training Datasets for Decompilation Issues

Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 276)


The conventional decompilation approach is based on a combination of heuristics and pattern matching. This approach depends on the processor architecture, the code generation templates used by the compiler, and the optimization level. In addition, there are specific scenarios where heuristics and pattern matching do not infer high-level information such as the return type of a function. Since AI has been previously used in similar scenarios, we have designed an adaptable infrastructure to facilitate the use of AI techniques for overcoming the decompilation issues detected. The proposed infrastructure is aimed at automatically generating training datasets. The architecture follows the Pipes and Filters architectural pattern that facilitates adapting the infrastructure to different kind of decompilation scenarios. It also makes it easier to parallelize the implementation. The generated datasets can be processed in any AI engine, training the predictive model obtained before adding it to the decompiler as a plug-in.


decompilation automatic pattern extraction automatic dataset generation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Guilfanov, I.: Decompilers and beyond. Black Hat USA (2008)Google Scholar
  2. 2.
    Troshina, K., Chernov, A., Derevenets, Y.: C Decompilation: Is It Possible. In: Proceedings of International Workshop on Program Understanding, Altai Mountains, Russia, pp. 18–27 (2009)Google Scholar
  3. 3.
    Troshina, K., Chernov, A., Fokin, A.: Profile-based type reconstruction for decompilation. In: 2009 IEEE 17th International Conference on Program Comprehension, pp. 263–267. IEEE (2009)Google Scholar
  4. 4.
    Cifuentes, C.: A structuring algorithm for decompilation. In: Proceedings of the XIX Conferencia Latinoamericana de Informática, pp. 267–276 (1993)Google Scholar
  5. 5.
    Schwartz, E., Lee, J., Woo, M., Brumley, D.: Native x86 decompilation using semantics-preserving structural analysis and iterative control-flow structuring. In: USENIX Secur. Symp. (2013)Google Scholar
  6. 6.
    Fokin, A., Derevenetc, E., Chernov, A., Troshina, K.: SmartDec: Approaching C++ Decompilation. In: 2011 18th Working Conference on Reverse Engineering, pp. 347–356. IEEE (2011)Google Scholar
  7. 7.
    Rosenblum, N., Zhu, X., Miller, B., Hunt, K.: Learning to analyze binary computer code. In: Proceedings of the 23rd Conference on Artificial Intelligence, Chicago, pp. 798–804 (2008)Google Scholar
  8. 8.
    Van Emmerik, M.: Boomerang: Information for students,
  9. 9.
    Wartell, R., Zhou, Y., Hamlen, K.W., Kantarcioglu, M., Thuraisingham, B.: Differentiating code from data in x86 binaries. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part III. LNCS, vol. 6913, pp. 522–536. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  10. 10.
    Watt, D., Brown, D., Sebesta, R.W.: Programming Language Processors in Java: Compilers and Interpreters and Concepts of Programming Languages (2007)Google Scholar
  11. 11.
    Muchnick, S.S.: Advanced compiler design and implementation (1998)Google Scholar
  12. 12.
    Alpaydin, E.: Introduction to Machine Learning. The MIT Press (2010)Google Scholar
  13. 13.
  14. 14.
    Jönsson, A.: Calling conventions on the x86 platform,
  15. 15.
    Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M.: Pattern-Oriented Software Architecture. A System of Patterns, vol. 1. Wiley (1996)Google Scholar
  16. 16.
    Clang: a C language family frontend for LLVM,
  17. 17.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1–13 (2008)CrossRefGoogle Scholar
  18. 18.
    Hanif, Z., Calhoun, T., Trost, J.: BinaryPig: Scalable Static Binary Analysis Over Hadoop. Black Hat USA 2013 (2012)Google Scholar
  19. 19.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning 18th International Conf. on Machine Learning, pp. 282–289 (2001)Google Scholar
  20. 20.
    Rosenblum, N.E., Miller, B.P., Zhu, X.: Extracting compiler provenance from program binaries. In: Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE 2010, p. 21. ACM Press, Toronto (2010)Google Scholar
  21. 21.
    Ugarte-Pedrero, X., Santos, I., Bringas, P.G.: Structural feature based anomaly detection for packed executable identification. In: Herrero, Á., Corchado, E. (eds.) CISIS 2011. LNCS, vol. 6694, pp. 230–237. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  22. 22.
    Santos, I., Ugarte-Pedrero, X., Sanz, B., Laorden, C., Bringas, P.G.: Collective classification for packed executable identification. In: Proceedings of the 8th Annual Anti-Abuse and Spam Conference on Collaboration, Electronic Messaging, CEAS 2011, pp. 23–30. ACM Press, Perth (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of OviedoOviedoSpain

Personalised recommendations