Skip to main content

SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation

  • 386 Accesses

Part of the Lecture Notes in Computer Science book series (LNSC,volume 13007)

Abstract

Detecting code flaws in programs is a vital aspect of software maintenance and security. Classic code flaw detection techniques rely on program analysis to check whether the code logic violates certain pre-define rules. In many cases, however, program analysis falls short of understanding the semantics of a function (e.g., the functionality of an API), and thus is difficult to judge whether the function and its related behaviors would lead to a security bug. In response, we propose an automated data-driven annotation strategy to enhance the understanding of the semantics of functions during flaw detection. Our designed SparrowHawk source code analysis system utilizes a programming language aware text similarity comparison to efficiently annotate the attributes of functions. With the annotation results, SparrowHawk makes use of the Clang static analyzer to guide security analyses.

To evaluate the performance of SparrowHawk, we tested SparrowHawk for memory corruption detection, which relies on the annotation of customized memory allocation/release functions. The experiment results show that by introducing function annotation to the original source code analysis, SparrowHawk achieves more effective and efficient flaw detection, and successfully discovers 51 new memory corruption vulnerabilities in popular open source projects such as FFmpeg and kernel of OpenHarmony IoT operating system.

Keywords

  • Objective function recognition
  • Programming language understanding
  • Neural network
  • Vulnerability discovery

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-88323-2_7
  • Chapter length: 20 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-88323-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

Notes

  1. 1.

    https://www.varonis.com/blog/cybersecurity-statistics/.

References

  1. Clang Static Analyzer. http://clang-analyzer.llvm.org

  2. Abadi, M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, pp. 265–283. USENIX Association (2016)

    Google Scholar 

  3. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th International Conference on Neural Information Processing Systems, pp. 737–744. Morgan Kaufmann Publishers Inc (1993)

    Google Scholar 

  4. brown, F., Deian, S., Dawson, E.: Sys: A static/symbolic tool for finding good bugs in good (browser) code. In: 29th USENIX Security Symposium (USENIX Security 20), pp. 199–216. USENIX Association (2020)

    Google Scholar 

  5. Busybox. https://github.com/mirror/busybox

  6. Clang. https://clang.llvm.org/

  7. Cpython. https://github.com/python/cpython

  8. Curl. https://github.com/curl/curl

  9. Dam, H.K., Tran, T., Pham, T., Ng, S.W., Grundy, J., Ghose, A.: Automatic feature learning for vulnerability prediction. arXiv:1708.02368 (2017)

  10. Duan, X., et al.: Vulsniper: Focus your attention to shoot fine-grained vulnerabilities. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4665–4671. International Joint Conferences on Artificial Intelligence Organization (2019)

    Google Scholar 

  11. Ffmpeg. https://github.com/FFmpeg/FFmpeg

  12. Gens, D., Schmitt, S., Davi, L., Sadeghi, A.R.: K-miner: Uncovering memory corruption in linux. (2018)

    Google Scholar 

  13. Gensim. https://radimrehurek.com/gensim/

  14. Git. https://github.com/git/git

  15. Gnutls. https://gitlab.com/gnutls/gnutls/

  16. Google web trillion word corpus. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

  17. Graphicsmagick. http://www.graphicsmagick.org/

  18. Gravity. https://github.com/marcobambini/gravity

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    CrossRef  Google Scholar 

  20. Imagemagick. https://github.com/ImageMagick/ImageMagick

  21. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv:1511.05493 (2017)

  22. Li, Y., Liu, B.: A normalized levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)

    CrossRef  Google Scholar 

  23. Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access 7, 103184–103197 (2019)

    CrossRef  Google Scholar 

  24. Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of the 32nd Annual Conference on Computer Security Applications, pp. 201–213 (2016)

    Google Scholar 

  25. Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: A framework for using deep learning to detect software vulnerabilities. arXiv:1807.06756 (2018)

  26. Li, Z., et al.: Vuldeepecker: A deep learning-based system for vulnerability detection (2018)

    Google Scholar 

  27. Libtiff. http://www.libtiff.org/

  28. Ma, S., Thung, F., Lo, D., Sun, C., Deng, R.H.: Vurle: automatic vulnerability detection and repair by learning from examples. In: European Symposium on Research in Computer Security. pp. 229–246. Springer (2017). https://doi.org/10.1007/978-3-319-66399-9_13

  29. Machiry, A., Spensky, C., Corina, J., Stephens, N., Kruegel, C., Vigna, G.: DR. CHECKER: A soundy analysis for linux kernel drivers. In: 26th USENIX Security Symposium (USENIX Security 17), pp. 1007–1024. USENIX Association (2017)

    Google Scholar 

  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013)

    Google Scholar 

  31. Mean squared error. https://en.wikipedia.org/wiki/Mean_squared_error

  32. Openharmony. https://openharmony.gitee.com/openharmony

  33. Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: Simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892. Association for Computational Linguistics (2020)

    Google Scholar 

  34. Ramos, D.A., Engler, D.: Under-constrained symbolic execution: Correctness checking for real code. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 49–64. USENIX Association (2015)

    Google Scholar 

  35. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Association for Computational Linguistics (2019)

    Google Scholar 

  36. Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)

    Google Scholar 

  37. Schwartz, E.J., Cohen, C.F., Duggan, M., Gennari, J., Havrilla, J.S., Hines, C.: Using logic programming to recover C++ classes and methods from compiled executables. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS) (2018)

    Google Scholar 

  38. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics (2016)

    Google Scholar 

  39. Shen, Z., Chen, S.: A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Security and Communication Networks 2020 (2020)

    Google Scholar 

  40. Stackexchange archive site. https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z

  41. Stackoverflow forum. https://stackoverflow.com/

  42. Sui, Y., Xue, J.: Svf: Interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, pp. 265–266. Association for Computing Machinery (2016)

    Google Scholar 

  43. Tokenizers. https://github.com/huggingface/tokenizers

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30, pp. 5998–6008 (2017)

    Google Scholar 

  45. Vim. https://github.com/vim/vim

  46. Wang, J., et al.: Nlp-eye: Detecting memory corruptions via semantic-aware memory operation function identification. In: 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), pp. 309–321. USENIX Association (2019)

    Google Scholar 

  47. Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604. IEEE (2014)

    Google Scholar 

  48. Yan, H., Sui, Y., Chen, S., Xue, J.: Spatio-temporal context reduction: a pointer-analysis-based static approach for detecting use-after-free vulnerabilities. In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 327–337. IEEE (2018)

    Google Scholar 

  49. Zhai, Y., yzhai: Ubitect: a precise and scalable method to detect use-before-initialization bugs in linux kernel. In: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). ACM (2020)

    Google Scholar 

  50. Zhang, Y., Ma, S., Li, J., Li, K., Nepal, S., Gu, D.: Smartshield: automatic smart contract protection made easy. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 23–34. IEEE (2020)

    Google Scholar 

  51. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 32, 10197–10207 (2019)

    Google Scholar 

Download references

Acknowledgment

We would like to thank the anonymous reviewers for their helpful comments. This work was partially supported by the National Natural Science Foundation of China (U19B2023), the National Key Research and Development Program of China (Grant No.2020AAA0107800), and the National Natural Science Foundation of China (Grant No.62002222).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yunlong Lyu or Juanru Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Lyu, Y., Gao, W., Ma, S., Sun, Q., Li, J. (2021). SparrowHawk: Memory Safety Flaw Detection via Data-Driven Source Code Annotation. In: Yu, Y., Yung, M. (eds) Information Security and Cryptology. Inscrypt 2021. Lecture Notes in Computer Science(), vol 13007. Springer, Cham. https://doi.org/10.1007/978-3-030-88323-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88323-2_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88322-5

  • Online ISBN: 978-3-030-88323-2

  • eBook Packages: Computer ScienceComputer Science (R0)