Tree2tree Structural Language Modeling for Compiler Fuzzing

Xu, Haoran; Fan, Shuhui; Wang, Yongjun; Huang, Zhijian; Xu, Hongzuo; Xie, Peidai

doi:10.1007/978-3-030-60245-1_38

Haoran Xu⁹,
Shuhui Fan⁹,
Yongjun Wang⁹,
Zhijian Huang¹⁰,
Hongzuo Xu⁹ &
…
Peidai Xie⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12452))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1589 Accesses
1 Citations

Abstract

Compiler fuzzing requires well-formed test cases. Only syntactically correct programs can pass the parsing stage of a compiler. Recently, advanced compiler fuzzers produce test cases by learning a generative language model of regular programs. They treat programs as natural language texts without leveraging any syntactic structure, making them hard to produce syntactically correct programs when programs get long. In this paper, we propose a novel tree-to-tree (tree2tree) model to leverage the structural information for a robust test case generation. We adopt an encoder-decoder architecture to map a partial abstract syntax tree (AST) to a complete AST. We introduce a C compiler fuzzing framework, named TSmith. Specifically, TSmith employs a tree-based encoder to encode the input partial AST to capture the hierarchical structure information. It then adopts a tree decoder to generate a complete AST by expanding the non-terminals in a top-down manner. Finally, the output ASTs are converted into corresponding test programs. Experimental results show that our generation strategies have a maximum parsing pass rate of 83%, which is about 21% higher than sequential models. Besides, TSmith significantly improves the code coverage of the compiler. Benefiting from the high pass rate and broad code coverage, TSmith has found 14 new bugs in GCC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://gcc.gnu.org/onlinedocs/gcc/Gcov.html.

References

Chakraborty, S., Allamanis, M., Ray, B.: Tree2tree neural translation model for learning source code changes. arXiv preprint arXiv:1810.00314 (2018)
Chen, P., Chen, H.: Angora: efficient fuzzing by principled search. In: 2018 IEEE Symposium on Security and Privacy (SP), pp. 711–725. IEEE (2018)
Google Scholar
Chen, X., Liu, C., Song, D.: Tree-to-tree neural networks for program translation. In: Advances in Neural Information Processing Systems, pp. 2547–2557 (2018)
Google Scholar
Cummins, C., Petoumenos, P., Murray, A., Leather, H.: Compiler fuzzing through deep learning. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 95–105. ACM (2018)
Google Scholar
Dong, L., Lapata, M.: Language to logical form with neural attention. arXiv preprint arXiv:1601.01280 (2016)
Godefroid, P., Peleg, H., Singh, R.: Learn&fuzz: machine learning for input fuzzing. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pp. 50–59. IEEE Press (2017)
Google Scholar
Google: Honggfuzz (2016). https://github.com/google/honggfuzz
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078 (2015)
Liu, X., Li, X., Prajapati, R., Wu, D.: Deepfuzz: automatic generation of syntax valid c programs for fuzz testing. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Google Scholar
LLVM: libfuzzer: a library for coverage-guided fuzz testing (2017). https://llvm.org/docs/LibFuzzer.html
Patra, J., Pradel, M.: Learning to fuzz: Application-independent fuzz testing with probabilistic, generative models of input data. TU Darmstadt, Department of Computer Science, Technical report, TUD-CS-2016-14664 (2016)
Google Scholar
Rabinovich, M., Stern, M., Klein, D.: Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535 (2017)
Rawat, S., Jain, V., Kumar, A., Cojocar, L., Giuffrida, C., Bos, H.: Vuzzer: application-aware evolutionary fuzzing. In: NDSS, vol. 17, pp. 1–14 (2017)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)
Wang, J., Chen, B., Wei, L., Liu, Y.: Skyfire: data-driven seed generation for fuzzing. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 579–594. IEEE (2017)
Google Scholar
Wei, H., Li, M.: Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp. 3034–3040 (2017)
Google Scholar
White, M., Tufano, M., Vendome, C., Poshyvanyk, D.: Deep learning code fragments for code clone detection. In: 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 87–98. IEEE (2016)
Google Scholar
Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in c compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)
Google Scholar
Yin, P., Neubig, G.: A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017)
Zalewski, M.: American fuzzy lop (2017). http://lcamtuf.coredump.cx/afl/

Download references

Acknowledgments

This work is supported by National Key Research and Development Program of China (No. 2018YFB0204301), and the National Natural Science Foundation of China (No. 61472439).

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, China
Haoran Xu, Shuhui Fan, Yongjun Wang, Hongzuo Xu & Peidai Xie
National Key Laboratory of Science and Technology on Information System Security, Institute of System Engineering, Chinese Academy of Military Science, Beijing, China
Zhijian Huang

Authors

Haoran Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shuhui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yongjun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhijian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hongzuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Peidai Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongjun Wang .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, H., Fan, S., Wang, Y., Huang, Z., Xu, H., Xie, P. (2020). Tree2tree Structural Language Modeling for Compiler Fuzzing. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12452. Springer, Cham. https://doi.org/10.1007/978-3-030-60245-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-60245-1_38
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60244-4
Online ISBN: 978-3-030-60245-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics