RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode

Kang, Wenzhuang; Zhong, Shangping; Chen, Kaizhi; Lai, Jianhua; Xu, Guangquan

doi:10.1007/978-981-15-9739-8_49

Wenzhuang Kang⁸,
Shangping Zhong⁸,
Kaizhi Chen⁸,
Jianhua Lai⁹ &
…
Guangquan Xu¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1286))

Included in the following conference series:

International Conference on Frontiers in Cyber Security

1300 Accesses
5 Citations

Abstract

WebShell is called a webpage backdoor. After hackers invade a website, they usually mix backdoor files with normal webpage files in the WEB directory of the website service area. Then, they use a browser to access the backdoor and obtain a command execution environment to control the website server. WebShell detection methods have stringent requirements because of the flexibility of the PHP language and the increasing number of hidden techniques used by hackers. The term frequency–inverse document frequency (TF-IDF) used in the existing random forest–gradient boosting decision tree (RF-GBDT) algorithm does not consider the distribution information and classification capabilities of feature words among classes, and no balance exists between false negative and false positive rates. This work proposes a PHP WebShell detection model called RF-AdaCost, which stands for random forest–misclassification cost-sensitive AdaBoost, based on RF-GBDT. We used the statistical characteristics of PHP source files, including information entropy and index of coincidence, and extracted the opcode sequences of PHP source files, thus merging statistical features and opcode sequences to improve the detection efficiency of the WebShell. Experimental results show that the RF-AdaCost algorithm demonstrates better performance than the RF-GBDT algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Detecting Phishing Websites with Random Forest

Detection of phishing websites using an efficient feature-based machine learning framework

Article 06 January 2018

Machine Learning and Feature Selection Based Ransomware Detection Using Hexacodes

References

Kim, J., Yoo, D.H., Jang, H., et al.: WebSHArk 1.0: a benchmark collection for malicious web shell detection. J. Inf. Process. Syst. 11(2), 229–238 (2015)
Google Scholar
WEB TECHNOLOGY SURVEYS: Usage statistics of server-side programming languages for websites [EB/OL]. https://w3techs.com/technologies/overview/programming_lan-guage. Accessed 15 May 2020
Argyros, G., Stais, I., Kiayias, A., et al.: Back in black: towards formal, black box analysis of sanitizers and filters. In: 2016 IEEE Symposium on Security and Privacy (SP). IEEE (2016)
Google Scholar
Cui, H., Huang, D., Fang, Y., et al.: Webshell detection based on random forest–gradient boosting decision tree algorithm. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 153–160. IEEE (2018)
Google Scholar
Salton, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. ACM 9(3) (1973)
Google Scholar
Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–97. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-45219-5_7
Chapter Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, p. 161175 (1994)
Google Scholar
Ying, Ma., Hui, Z., WanLong, L.: Optimization of TF-IDF algorithm combined with improved CHI statistical method. Appl. Res. Comput. 9, 2596–2598 (2019)
Google Scholar
Breiman, L.: Random forests. Mach. Learning 45(1), 5–32 (2001)
Article Google Scholar
Stolfo, W.F.S.J.: AdaCost: misclassification cost-sensitive boosting. In: Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. (1999)
Google Scholar
Biwei, H.: Research on webshell detection method based on Bayesian theory. Sci. Mosaic (2016)
Google Scholar
Hewei, Z., Xiaojie, L.: PHP webshell detection method based on text vector. Data Commun. 04, 16–21 (2019)
Article Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (2016)
Google Scholar
Truong, T.D., Cheng, G., Guo, X.J., et al.: Webshell detection techniques in web applications In: International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2014)
Google Scholar
Lv, Z.-H., Yan, H.-B., Mei, R.: Automatic and accurate detection of webshell based on convolutional neural network. In: Yun, X., et al. (eds.) CNCERT 2018. CCIS, vol. 970, pp. 73–85. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-6621-5_6
Chapter Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, pp. 1746–1751 (2014)
Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)
Article MathSciNet Google Scholar
Ho, T.K.: Random decision forests. In: 1995 Proceedings of the third International Conference on Document Analysis and Recognition. IEEE Computer Society (1995)
Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1999)
Article MathSciNet Google Scholar
Schapire, R., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM (1998)
Google Scholar
Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training sets. In: Annual Conference on Neural Information Processing Systems, pp. 253–259 (1999)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Labs Tech. J. 27(4), 379–423 (1948)
Article MathSciNet Google Scholar
Friedman, W.F.: The index of coincidence and its applications in cryptology. Department of Ciphers. Publ 22. Geneva, Illinois, USA: Riverbank Laboratories (1922)
Google Scholar
Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Upper Saddle River (1990)
Google Scholar
Sklar, D.: “Understanding PHP Internals” Essential PHP Tools: Modules, Extensions, and Accelerators, pp. 265–274. Apress, Berkeley (2004)
Book Google Scholar
Xue, X.B., Zhou, Z.H.: Distributional features for text categorization. IEEE Trans. Knowl. Data Eng. 21, 428–442 (2006)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML), pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
D shield takes the initiative to protect your website[CP/DK]. http://www.d99net.net/. Accessed 15 May 2020
SHELLPUB.COM Focus on killing[CP/DK]. https://www.shellpub.com/. Accessed 15 May 2020
Next generation webshell detection engine[CP/DK]. https://scanner.baidu.com/. Accessed 15 May 2020

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 61972187, the Scientific Research Project of Science and Education Park Development Center of Fuzhou University, Jinjiang under Grant 2019-JJFDKY-53 and the Tianjin University-Fuzhou University Joint Fund under Grant TF2020-6.

Author information

Authors and Affiliations

College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350100, Fujian, China
Wenzhuang Kang, Shangping Zhong & Kaizhi Chen
Fujian Institute of Scientific and Technological Information, Fuzhou, 350003, Fujian, China
Jianhua Lai
Tianjin Key Laboratory of Advanced Networking (TANK), College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
Guangquan Xu

Authors

Wenzhuang Kang
View author publications
You can also search for this author in PubMed Google Scholar
Shangping Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Kaizhi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Lai
View author publications
You can also search for this author in PubMed Google Scholar
Guangquan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenzhuang Kang .

Editor information

Editors and Affiliations

Tianjin University, Tianjin, China
Guangquan Xu
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
University of Aizu, Aizuwakamatsu, Japan
Chunhua Su

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, W., Zhong, S., Chen, K., Lai, J., Xu, G. (2020). RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode. In: Xu, G., Liang, K., Su, C. (eds) Frontiers in Cyber Security. FCS 2020. Communications in Computer and Information Science, vol 1286. Springer, Singapore. https://doi.org/10.1007/978-981-15-9739-8_49

Download citation

DOI: https://doi.org/10.1007/978-981-15-9739-8_49
Published: 04 November 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9738-1
Online ISBN: 978-981-15-9739-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode

Abstract

Access this chapter

Similar content being viewed by others

Detecting Phishing Websites with Random Forest

Detection of phishing websites using an efficient feature-based machine learning framework

Machine Learning and Feature Selection Based Ransomware Detection Using Hexacodes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode

Abstract

Access this chapter

Similar content being viewed by others

Detecting Phishing Websites with Random Forest

Detection of phishing websites using an efficient feature-based machine learning framework

Machine Learning and Feature Selection Based Ransomware Detection Using Hexacodes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation