Abstract
Unix Shell is a powerful tool for system developers and engineers, but it poses serious security risks when used by cybercriminals to execute malicious scripts. These scripts can compromise servers, steal confidential data, or cause system crashes. Therefore, detecting and preventing malicious scripts is an important task for intrusion detection systems. In this paper, we propose a novel framework, called SIFAST, for embedding and detecting malicious Unix Shell scripts. Our framework consists of Smooth Inverse Frequency (SIF) and Abstract Syntax Tree (AST) techniques to rapidly convert Unix Shell commands and scripts into vectors and capture their semantic and syntactic features. These vectors can then be beneficial for various downstream machine learning models for classification or anomaly detection. Compared with other embedding methods with multiple downstream detection models, We have demonstrated that SIFAST can significantly improve the accuracy and efficiency on different downstream models. We also provide a supervised dataset of normal and abnormal Unix commands and scripts, which was collected from various open-source data. Hopefully, we can make a humble contribution to the field of intrusion detection systems by offering a solution to identifying malicious scripts in Unix Shell.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Different linux Commands and Utilities Commonly Used by Attackers. https://www.uptycs.com/blog/linux-commands-and-utilities-commonly-used-by-attackers
Evasive techniques used by malicious shell scripts on different unix systems. https://www.uptycs.com/blog/evasive-techniques-used-by-malicious-linux-shell-scripts
Tree-sitter Using Parsers. https://tree-sitter.github.io/tree-sitter/using-parsers
What Is a Reverse Shell \(|\) Examples & Prevention Techniques \(|\) Imperva
GTFOBins (2022). https://gtfobins.github.io/
Living Off the Land: How to Defend Against Malicious Use of Legitimate Utilities (2022). https://threatpost.com/living-off-the-land-malicious-use-legitimate-utilities/177762/
Al-Janabi, M., Altamimi, A.M.: A comparative analysis of machine learning techniques for classification and detection of Malware. In: 2020 21st International Arab Conference on Information Technology (ACIT), pp. 1–9 (2020). https://doi.org/10.1109/ACIT50332.2020.9300081
Alahmadi, A., Alkhraan, N., BinSaeedan, W.: MPSAutodetect: a malicious powershell script detection model based on stacked denoising auto-encoder. Comput. Secur. 116, 102658 (2022). https://doi.org/10.1016/j.cose.2022.102658
Andrew, Y., Lim, C., Budiarto, E.: Mapping Linux shell commands to MITRE ATT &CK using NLP-based approach. In: 2022 International Conference on Electrical Engineering and Informatics (ICELTICs), pp. 37–42 (2022). https://doi.org/10.1109/ICELTICs56128.2022.9932097
Boffa, M., Milan, G., Vassio, L., Drago, I., Mellia, M., Ben Houidi, Z.: Towards NLP-based processing of honeypot logs. In: 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), pp. 314–321 (2022). https://doi.org/10.1109/EuroSPW55150.2022.00038
Bohannon, D., Holmes, L.: Revoke-Obfuscation: PowerShell Obfuscation Detection Using Science (2017)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information (2017)
Chai, H., Ying, L., Duan, H., Zha, D.: Invoke-Deobfuscation: AST-based and semantics-preserving deobfuscation for powershell scripts. In: 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 295–306 (2022). https://doi.org/10.1109/DSN53405.2022.00039
Elmasry, W., Akbulut, A., Zaim, A.H.: Deep learning approaches for predictive masquerade detection. Secur. Commun. Netw. 2018, e9327215 (2018). https://doi.org/10.1155/2018/9327215
Fang, Y., Huang, C., Zeng, M., Zhao, Z., Huang, C.: JStrong: malicious JavaScript detection based on code semantic representation and graph neural network. Comput. Secur. 118, 102715 (2022). https://doi.org/10.1016/j.cose.2022.102715
Fang, Y., Zhou, X., Huang, C.: Effective method for detecting malicious PowerShell scripts based on hybrid features. Neurocomputing 448, 30–39 (2021). https://doi.org/10.1016/j.neucom.2021.03.117
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages (2020). https://doi.org/10.48550/arXiv.2002.08155
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.552
Goudie, M.: The Rise of “Living off the Land” Attacks \(|\) CrowdStrike (2019). https://www.crowdstrike.com/blog/going-beyond-malware-the-rise-of-living-off-the-land-attacks/
Hendler, D., Kels, S., Rubin, A.: Detecting malicious powershell commands using deep neural networks. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 187–197. ASIACCS ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3196494.3196511
Hendler, D., Kels, S., Rubin, A.: AMSI-based detection of malicious powershell code using contextual embeddings. In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 679–693. ASIA CCS ’20, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3320269.3384742
Hussain, Z., Nurminen, J., Mikkonen, T., Kowiel, M.: Command Similarity Measurement Using NLP (2021). https://doi.org/10.4230/OASIcs.SLATE.2021.13
Kidwai, A., et al.: A comparative study on shells in Linux: a review. Mater. Today Proc. 37, 2612–2616 (2021). https://doi.org/10.1016/j.matpr.2020.08.508
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, vol. 32, pp. II-1188-II-1196. ICML’14, JMLR.org, Beijing, China (2014)
Lin, X.V., Wang, C., Zettlemoyer, L., Ernst, M.D.: NL2Bash: a corpus and semantic parser for natural language interface to the Linux operating system (2018). arXiv:1802.08979 [cs]
Liu, C., et al.: Code execution with pre-trained language models (2023). https://doi.org/10.48550/arXiv.2305.05383
Liu, W., Mao, Y., Ci, L., Zhang, F.: A new approach of user-level intrusion detection with command sequence-to-sequence model. J. Intell. Fuzzy Syst. 38(5), 5707–5716 (2020). https://doi.org/10.3233/JIFS-179659
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv:1301.3781 [cs]
Mimura, M., Tajiri, Y.: Static detection of malicious PowerShell based on word embeddings. Internet Things 15, 100404 (2021). https://doi.org/10.1016/j.iot.2021.100404
Ongun, T., et al.: Living-off-the-land command detection using active learning. In: Proceedings of the 24th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 442–455. RAID ’21, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3471621.3471858
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49
Rathore, H., Agarwal, S., Sahay, S.K., Sewak, M.: Malware detection using machine learning and deep learning. In: Mondal, A., Gupta, H., Srivastava, J., Reddy, P.K., Somayajulu, D.V.L.N. (eds.) BDA 2018. LNCS, vol. 11297, pp. 402–411. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04780-1_28
Rebootuser: LinEnum (2023)
Rousseau, A.: Hijacking.NET to Defend PowerShell (2017). https://doi.org/10.48550/arXiv.1709.07508
Song, J., Kim, J., Choi, S., Kim, J., Kim, I.: Evaluations of AI-based malicious PowerShell detection with feature optimizations. ETRI J. 43(3), 549–560 (2021). https://doi.org/10.4218/etrij.2020-0215
Swissky: Payloads All The Things (2023)
Trizna, D.: Shell language processing: Unix command parsing for machine learning (2021). arXiv:2107.02438 [cs]
Tsai, M.H., Lin, C.C., He, Z.G., Yang, W.C., Lei, C.L.: PowerDP: de-obfuscating and profiling malicious PowerShell commands with multi-label classifiers. IEEE Access 11, 256–270 (2023). https://doi.org/10.1109/ACCESS.2022.3232505
Zhai, H., et al.: Masquerade detection based on temporal convolutional network. In: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 305–310 (2022). https://doi.org/10.1109/CSCWD54268.2022.9776088
Acknowledgements
This work is supported by the Strategic Priority Research Program of the Chinese Academy of Sciences with No.XDC02030400, the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0041101), the Scaling Program of Institute of Information Engineering, CAS (Grant No. E3Z0191101).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
-
TF-IDF This is the most basic sentence vector generation method in the field of natural language, and it is also the sentence vector embedding method mentioned in Trizna et al. It first generates a TF-IDF representation of each word in a sentence, and then adds each word to form a TF-IDF representation of a sentence.
-
Doc2Vec [25] This is a method that trains the sentence vector and other words in the sentence to directly generate the sentence vector. Its method is similar to Word2Vec, but on the basis of Word2Vec, sentence vectors are added for joint training.
-
MPSAutodetect [9] This is a deep learning framework for detecting PowerShell malicious scripts, which uses a character-based embedding method and inputs it into a denoising AutoEncoder to extract features, and finally inputs the features into a classifier for classification.
-
SimCSE [19] An Advanced Pretrained Sentence Vector Embedding Model Based on Contrastive Learning. We employ the unsupervised learning part of SimCSE to learn code representations for shell scripts. Although SimCSE requires powerful hardware capabilities, making it impossible to be embedded in Unix, we still use it as one of our comparison objects to illustrate the gap between our model and conventional deep learning models.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, S. et al. (2023). SIFAST: An Efficient Unix Shell Embedding Framework for Malicious Detection. In: Athanasopoulos, E., Mennink, B. (eds) Information Security. ISC 2023. Lecture Notes in Computer Science, vol 14411. Springer, Cham. https://doi.org/10.1007/978-3-031-49187-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-49187-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49186-3
Online ISBN: 978-3-031-49187-0
eBook Packages: Computer ScienceComputer Science (R0)