Skip to main content

Malware Detection with Limited Supervised Information via Contrastive Learning on API Call Sequences

  • Conference paper
  • First Online:
Information and Communications Security (ICICS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13407))

Included in the following conference series:

  • 1431 Accesses

Abstract

Malware is a software capable of causing damage to computer systems. Conventional malware detection methods either require feature engineering to extract specific features or require a large amount of labeled data to train an end-to-end deep learning model. Both feature engineering and labelling are laborious. In this paper, we propose a semi-supervised contrastive learning malware detection method based on API call sequences with limited label information, called SCLMD. Specifically, a heterogeneous graph is constructed from API behavior to express the rich relationships among labeled and unlabeled software. After extracting the structural and sequential features of software by two encoders, we adopt the cross-view contrastive learning to obtain the shared and consistent feature of software. A hybrid positive selection strategy is designed to select positive pairs for contrastive learning by the guidance of the limited label information. Experimental results on two real world datasets show that the SCLMD outperforms the baseline methods, especially when the supervised information is limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://tianchi.aliyun.com/competition/entrance/231694/information?lang=en-us.

  2. 2.

    https://tianchi.aliyun.com/competition/entrance/231668/information?lang=en-us.

  3. 3.

    https://github.com/Noctilux-M/SCLMD.

References

  1. Ahmadi, M., Sami, A., Rahimi, H., Yadegari, B.: Malware detection by behavioural sequential patterns. Comput. Fraud Secur. 2013(8), 11–19 (2013)

    Article  Google Scholar 

  2. Aslan, Ö.A., Samet, R.: A comprehensive review on malware detection approaches. IEEE Access 8, 6249–6271 (2020)

    Article  Google Scholar 

  3. Chai, Y., Qiu, J., Su, S., et al.: LGMal: A joint framework based on local and global features for malware detection. In: 2020 International Wireless Communications and Mobile Computing (IWCMC), pp. 463–468. IEEE (2020)

    Google Scholar 

  4. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  5. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  6. Dong, Y., Ziniu, H., Wang, K., Sun, Y., Tang, J.: Heterogeneous network representation learning. In: IJCAI, vol. 20, pp. 4861–4867 (2020)

    Google Scholar 

  7. Gao, H., Cheng, S., Zhang, W.: GDroid: android malware detection and classification with graph convolutional network. Comput. Secur. 106, 102264 (2021)

    Google Scholar 

  8. Gavriluţ, D., Cimpoeşu, M., Anton, D., Ciortuz, D.: Malware detection using machine learning. In: 2009 International Multiconference on Computer Science and Information Technology, pp. 735–741. IEEE (2009)

    Google Scholar 

  9. Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: International Conference on Machine Learning, pp. 4116–4126. PMLR (2020)

    Google Scholar 

  10. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  12. Hosseini, A., Chen, T., Wu, W., Sun, Y., Sarrafzadeh, M.: Heteromed: heterogeneous information network for medical diagnosis. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 763–772 (2018)

    Google Scholar 

  13. Hu, G., Venugopal, D.: A malware signature extraction and detection method applied to mobile networks. In: 2007 IEEE International Performance, Computing, and Communications Conference, pp. 19–26. IEEE (2007)

    Google Scholar 

  14. Ziniu, H., Dong, Y., Wang, K., Sun, Y.: Heterogeneous graph transformer. In: Proceedings of The Web Conference, vol. 2020, pp. 2704–2710 (2020)

    Google Scholar 

  15. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  16. Kwon, I., Im, E.G.: Extracting the representative API call patterns of malware families using recurrent neural network. In: Proceedings of the International Conference on Research in Adaptive and Convergent Systems, pp. 202–207 (2017)

    Google Scholar 

  17. Lansheng, H., Kunlun, G.: Behavior detection of malware based on combination of API function and its parameters. Appl. Res. Comput. 30(11), 3407–3410 (2011)

    Google Scholar 

  18. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988)

    Article  Google Scholar 

  19. Liu, X., et al.: Generative or contrastive. IEEE Trans. Knowl. Data Eng. Self-supervised learn. (2021)

    Google Scholar 

  20. Mariconti, E., Onwuzurike, L., Andriotis, P., De Cristofaro, E., Ross, G., Stringhini, G.: Mamadroid: detecting android malware by building Markov chains of behavioral models. arXiv preprint arXiv:1612.04433 (2016)

  21. Murad, K., Shirazi, S.N.--H., Zikria, Y.B., Ikram, N.: Evading virus detection using code obfuscation. In: Kim, T., Lee, Y., Kang, B.-H., Ślęzak, D. (eds.) FGIT 2010. LNCS, vol. 6485, pp. 394–401. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17569-5_39

    Chapter  Google Scholar 

  22. O’Kane, P., Sezer, S., McLaughlin, K.: Obfuscation: the hidden malware. IEEE Secur. Priv. 9(5), 41–47 (2011)

    Article  Google Scholar 

  23. Park, C., Kim, D., Han, J., Hwanjo, Yu.: Unsupervised attributed multiplex network embedding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5371–5378 (2020)

    Google Scholar 

  24. Peng, Z., Huang, W., Luo, M., Qinghua Zheng, Yu., Rong, T.X., Huang, J.: Graph representation learning via graphical mutual information maximization. In: Proceedings of The Web Conference, vol. 2020, pp. 259–270 (2020)

    Google Scholar 

  25. Qiu, J., et al.: Gcc: graph contrastive coding for graph neural network pre-training. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1150–1160 (2020)

    Google Scholar 

  26. Roundy, K.A., Miller, B.P.: Binary-code obfuscations in prevalent packer tools. ACM Comput. Surv. (CSUR) 46(1), 1–32 (2013)

    Article  Google Scholar 

  27. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_38

    Chapter  Google Scholar 

  28. Shi, C., Li, Y., Zhang, J., Sun, Y., Philip, S.Y.: A survey of heterogeneous information network analysis. IEEE Trans. Knowl. Data Eng. 29(1), 17–37 (2016)

    Article  Google Scholar 

  29. Torres, J.F., Hadjout, D., Sebaa, A., Martínez-Álvarez, F., Troncoso, A.: Deep learning for time series forecasting: a survey. Big Data 9(1), 3–21 (2021)

    Article  Google Scholar 

  30. Veličković, P., GCucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

  31. Velickovic, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. ICLR (Poster) 2(3), 4 (2019)

    Google Scholar 

  32. Wang, S., Philip, S.Y.: Heterogeneous graph matching networks: application to unknown malware detection. In: 2019 IEEE International Conference on Big Data (Big Data), pp. 5401–5408. IEEE (2019)

    Google Scholar 

  33. Wang, X., Liu, N., Han, H., Shi, C.: Self-supervised heterogeneous graph neural network with co-contrastive learning. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1726–1736 (2021)

    Google Scholar 

  34. Yazi, A.F., Çatak, F.Ö., Gül, E.: Classification of methamorphic malware with deep learning (LSTM). In: 2019 27th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE (2019)

    Google Scholar 

  35. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13(3), 55–75 (2018)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 62002219, 62172278), Shanghai Sailing Program (No. 19YF1424700), Startup Fund for Youngman Research at SJTU (SFYR at SJTU).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Wu .

Editor information

Editors and Affiliations

Appendices

Appendix

A Implementation Details

For the proposed SCLMD, we use Glorot initialization and Adam optimizer. Under the condition of training ratio 70%, We manually adjust and set the learning rate to 0.01, the temperature parameter \(\tau \) is set to 0.5. The number of attention heads H is set to 3. The number k of positive pairs for each sample is set to 32. The balance coefficient \(\lambda \) is set as 0.5. The maximum length of the API call sequence is set to 6000.

For all methods, we set the input dimension as 128, hidden dimension as 60 and representation dimension as 64. The source code of SCLMD are publicly available on GithubFootnote 3.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gao, M., Wu, P., Pan, L. (2022). Malware Detection with Limited Supervised Information via Contrastive Learning on API Call Sequences. In: Alcaraz, C., Chen, L., Li, S., Samarati, P. (eds) Information and Communications Security. ICICS 2022. Lecture Notes in Computer Science, vol 13407. Springer, Cham. https://doi.org/10.1007/978-3-031-15777-6_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15777-6_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15776-9

  • Online ISBN: 978-3-031-15777-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics