Abstract
WebShell is called a webpage backdoor. After hackers invade a website, they usually mix backdoor files with normal webpage files in the WEB directory of the website service area. Then, they use a browser to access the backdoor and obtain a command execution environment to control the website server. WebShell detection methods have stringent requirements because of the flexibility of the PHP language and the increasing number of hidden techniques used by hackers. The term frequency–inverse document frequency (TF-IDF) used in the existing random forest–gradient boosting decision tree (RF-GBDT) algorithm does not consider the distribution information and classification capabilities of feature words among classes, and no balance exists between false negative and false positive rates. This work proposes a PHP WebShell detection model called RF-AdaCost, which stands for random forest–misclassification cost-sensitive AdaBoost, based on RF-GBDT. We used the statistical characteristics of PHP source files, including information entropy and index of coincidence, and extracted the opcode sequences of PHP source files, thus merging statistical features and opcode sequences to improve the detection efficiency of the WebShell. Experimental results show that the RF-AdaCost algorithm demonstrates better performance than the RF-GBDT algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kim, J., Yoo, D.H., Jang, H., et al.: WebSHArk 1.0: a benchmark collection for malicious web shell detection. J. Inf. Process. Syst. 11(2), 229–238 (2015)
WEB TECHNOLOGY SURVEYS: Usage statistics of server-side programming languages for websites [EB/OL]. https://w3techs.com/technologies/overview/programming_lan-guage. Accessed 15 May 2020
Argyros, G., Stais, I., Kiayias, A., et al.: Back in black: towards formal, black box analysis of sanitizers and filters. In: 2016 IEEE Symposium on Security and Privacy (SP). IEEE (2016)
Cui, H., Huang, D., Fang, Y., et al.: Webshell detection based on random forest–gradient boosting decision tree algorithm. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 153–160. IEEE (2018)
Salton, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. ACM 9(3) (1973)
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–97. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-45219-5_7
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, p. 161175 (1994)
Ying, Ma., Hui, Z., WanLong, L.: Optimization of TF-IDF algorithm combined with improved CHI statistical method. Appl. Res. Comput. 9, 2596–2598 (2019)
Breiman, L.: Random forests. Mach. Learning 45(1), 5–32 (2001)
Stolfo, W.F.S.J.: AdaCost: misclassification cost-sensitive boosting. In: Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. (1999)
Biwei, H.: Research on webshell detection method based on Bayesian theory. Sci. Mosaic (2016)
Hewei, Z., Xiaojie, L.: PHP webshell detection method based on text vector. Data Commun. 04, 16–21 (2019)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (2016)
Truong, T.D., Cheng, G., Guo, X.J., et al.: Webshell detection techniques in web applications In: International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2014)
Lv, Z.-H., Yan, H.-B., Mei, R.: Automatic and accurate detection of webshell based on convolutional neural network. In: Yun, X., et al. (eds.) CNCERT 2018. CCIS, vol. 970, pp. 73–85. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-6621-5_6
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, pp. 1746–1751 (2014)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Ho, T.K.: Random decision forests. In: 1995 Proceedings of the third International Conference on Document Analysis and Recognition. IEEE Computer Society (1995)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1999)
Schapire, R., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM (1998)
Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training sets. In: Annual Conference on Neural Information Processing Systems, pp. 253–259 (1999)
Shannon, C.E.: A mathematical theory of communication. Bell Labs Tech. J. 27(4), 379–423 (1948)
Friedman, W.F.: The index of coincidence and its applications in cryptology. Department of Ciphers. Publ 22. Geneva, Illinois, USA: Riverbank Laboratories (1922)
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Upper Saddle River (1990)
Sklar, D.: “Understanding PHP Internals” Essential PHP Tools: Modules, Extensions, and Accelerators, pp. 265–274. Apress, Berkeley (2004)
Xue, X.B., Zhou, Z.H.: Distributional features for text categorization. IEEE Trans. Knowl. Data Eng. 21, 428–442 (2006)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML), pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
D shield takes the initiative to protect your website[CP/DK]. http://www.d99net.net/. Accessed 15 May 2020
SHELLPUB.COM Focus on killing[CP/DK]. https://www.shellpub.com/. Accessed 15 May 2020
Next generation webshell detection engine[CP/DK]. https://scanner.baidu.com/. Accessed 15 May 2020
Acknowledgments
This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 61972187, the Scientific Research Project of Science and Education Park Development Center of Fuzhou University, Jinjiang under Grant 2019-JJFDKY-53 and the Tianjin University-Fuzhou University Joint Fund under Grant TF2020-6.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kang, W., Zhong, S., Chen, K., Lai, J., Xu, G. (2020). RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode. In: Xu, G., Liang, K., Su, C. (eds) Frontiers in Cyber Security. FCS 2020. Communications in Computer and Information Science, vol 1286. Springer, Singapore. https://doi.org/10.1007/978-981-15-9739-8_49
Download citation
DOI: https://doi.org/10.1007/978-981-15-9739-8_49
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9738-1
Online ISBN: 978-981-15-9739-8
eBook Packages: Computer ScienceComputer Science (R0)