Skip to main content

Tree2tree Structural Language Modeling for Compiler Fuzzing

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Abstract

Compiler fuzzing requires well-formed test cases. Only syntactically correct programs can pass the parsing stage of a compiler. Recently, advanced compiler fuzzers produce test cases by learning a generative language model of regular programs. They treat programs as natural language texts without leveraging any syntactic structure, making them hard to produce syntactically correct programs when programs get long. In this paper, we propose a novel tree-to-tree (tree2tree) model to leverage the structural information for a robust test case generation. We adopt an encoder-decoder architecture to map a partial abstract syntax tree (AST) to a complete AST. We introduce a C compiler fuzzing framework, named TSmith. Specifically, TSmith employs a tree-based encoder to encode the input partial AST to capture the hierarchical structure information. It then adopts a tree decoder to generate a complete AST by expanding the non-terminals in a top-down manner. Finally, the output ASTs are converted into corresponding test programs. Experimental results show that our generation strategies have a maximum parsing pass rate of 83%, which is about 21% higher than sequential models. Besides, TSmith significantly improves the code coverage of the compiler. Benefiting from the high pass rate and broad code coverage, TSmith has found 14 new bugs in GCC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.

References

  1. Chakraborty, S., Allamanis, M., Ray, B.: Tree2tree neural translation model for learning source code changes. arXiv preprint arXiv:1810.00314 (2018)

  2. Chen, P., Chen, H.: Angora: efficient fuzzing by principled search. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 711–725. IEEE (2018)

    Google Scholar 

  3. Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Advances in Neural Information Processing Systems, pp. 2547–2557 (2018)

    Google Scholar 

  4. Cummins, C., Petoumenos, P., Murray, A., Leather, H.: Compiler fuzzing through deep learning. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 95–105. ACM (2018)

    Google Scholar 

  5. Dong, L., Lapata, M.: Language to logical form with neural attention. arXiv preprint arXiv:1601.01280 (2016)

  6. Godefroid, P., Peleg, H., Singh, R.: Learn&fuzz: machine learning for input fuzzing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 50–59. IEEE Press (2017)

    Google Scholar 

  7. Google: Honggfuzz (2016). https://github.com/google/honggfuzz

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 (2015)

  10. Liu, X., Li, X., Prajapati, R., Wu, D.: Deepfuzz: automatic generation of syntax valid c programs for fuzz testing. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  11. LLVM: libfuzzer: a library for coverage-guided fuzz testing (2017). https://llvm.org/docs/LibFuzzer.html

  12. Patra, J., Pradel, M.: Learning to fuzz: Application-independent fuzz testing with probabilistic, generative models of input data. TU Darmstadt, Department of Computer Science, Technical report, TUD-CS-2016-14664 (2016)

    Google Scholar 

  13. Rabinovich, M., Stern, M., Klein, D.: Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535 (2017)

  14. Rawat, S., Jain, V., Kumar, A., Cojocar, L., Giuffrida, C., Bos, H.: Vuzzer: application-aware evolutionary fuzzing. In: NDSS, vol. 17, pp. 1–14 (2017)

    Google Scholar 

  15. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)

  16. Wang, J., Chen, B., Wei, L., Liu, Y.: Skyfire: data-driven seed generation for fuzzing. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 579–594. IEEE (2017)

    Google Scholar 

  17. Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)

    Google Scholar 

  18. White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)

    Google Scholar 

  19. Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in c compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)

    Google Scholar 

  20. Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017)

  21. Zalewski, M.: American fuzzy lop (2017). http://lcamtuf.coredump.cx/afl/

Download references

Acknowledgments

This work is supported by National Key Research and Development Program of China (No. 2018YFB0204301), and the National Natural Science Foundation of China (No. 61472439).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongjun Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, H., Fan, S., Wang, Y., Huang, Z., Xu, H., Xie, P. (2020). Tree2tree Structural Language Modeling for Compiler Fuzzing. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_38

Download citation

Publish with us

Policies and ethics