Skip to main content

RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode

  • Conference paper
  • First Online:
Frontiers in Cyber Security (FCS 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1286))

Included in the following conference series:

Abstract

WebShell is called a webpage backdoor. After hackers invade a website, they usually mix backdoor files with normal webpage files in the WEB directory of the website service area. Then, they use a browser to access the backdoor and obtain a command execution environment to control the website server. WebShell detection methods have stringent requirements because of the flexibility of the PHP language and the increasing number of hidden techniques used by hackers. The term frequency–inverse document frequency (TF-IDF) used in the existing random forest–gradient boosting decision tree (RF-GBDT) algorithm does not consider the distribution information and classification capabilities of feature words among classes, and no balance exists between false negative and false positive rates. This work proposes a PHP WebShell detection model called RF-AdaCost, which stands for random forest–misclassification cost-sensitive AdaBoost, based on RF-GBDT. We used the statistical characteristics of PHP source files, including information entropy and index of coincidence, and extracted the opcode sequences of PHP source files, thus merging statistical features and opcode sequences to improve the detection efficiency of the WebShell. Experimental results show that the RF-AdaCost algorithm demonstrates better performance than the RF-GBDT algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Kim, J., Yoo, D.H., Jang, H., et al.: WebSHArk 1.0: a benchmark collection for malicious web shell detection. J. Inf. Process. Syst. 11(2), 229–238 (2015)

    Google Scholar 

  2. WEB TECHNOLOGY SURVEYS: Usage statistics of server-side programming languages for websites [EB/OL]. https://w3techs.com/technologies/overview/programming_lan-guage. Accessed 15 May 2020

  3. Argyros, G., Stais, I., Kiayias, A., et al.: Back in black: towards formal, black box analysis of sanitizers and filters. In: 2016 IEEE Symposium on Security and Privacy (SP). IEEE (2016)

    Google Scholar 

  4. Cui, H., Huang, D., Fang, Y., et al.: Webshell detection based on random forest–gradient boosting decision tree algorithm. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 153–160. IEEE (2018)

    Google Scholar 

  5. Salton, G., Yu, C.T.: On the construction of effective vocabularies for information retrieval. ACM 9(3) (1973)

    Google Scholar 

  6. Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Sirmakessis, S. (ed.) Text Mining and its Applications. Studies in Fuzziness and Soft Computing, vol. 138, pp. 81–97. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-45219-5_7

    Chapter  Google Scholar 

  7. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, p. 161175 (1994)

    Google Scholar 

  8. Ying, Ma., Hui, Z., WanLong, L.: Optimization of TF-IDF algorithm combined with improved CHI statistical method. Appl. Res. Comput. 9, 2596–2598 (2019)

    Google Scholar 

  9. Breiman, L.: Random forests. Mach. Learning 45(1), 5–32 (2001)

    Article  Google Scholar 

  10. Stolfo, W.F.S.J.: AdaCost: misclassification cost-sensitive boosting. In: Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. (1999)

    Google Scholar 

  11. Biwei, H.: Research on webshell detection method based on Bayesian theory. Sci. Mosaic (2016)

    Google Scholar 

  12. Hewei, Z., Xiaojie, L.: PHP webshell detection method based on text vector. Data Commun. 04, 16–21 (2019)

    Article  Google Scholar 

  13. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM (2016)

    Google Scholar 

  14. Truong, T.D., Cheng, G., Guo, X.J., et al.: Webshell detection techniques in web applications In: International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2014)

    Google Scholar 

  15. Lv, Z.-H., Yan, H.-B., Mei, R.: Automatic and accurate detection of webshell based on convolutional neural network. In: Yun, X., et al. (eds.) CNCERT 2018. CCIS, vol. 970, pp. 73–85. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-6621-5_6

    Chapter  Google Scholar 

  16. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, pp. 1746–1751 (2014)

    Google Scholar 

  17. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001)

    Article  MathSciNet  Google Scholar 

  18. Ho, T.K.: Random decision forests. In: 1995 Proceedings of the third International Conference on Document Analysis and Recognition. IEEE Computer Society (1995)

    Google Scholar 

  19. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1999)

    Article  MathSciNet  Google Scholar 

  20. Schapire, R., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM (1998)

    Google Scholar 

  21. Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training sets. In: Annual Conference on Neural Information Processing Systems, pp. 253–259 (1999)

    Google Scholar 

  22. Shannon, C.E.: A mathematical theory of communication. Bell Labs Tech. J. 27(4), 379–423 (1948)

    Article  MathSciNet  Google Scholar 

  23. Friedman, W.F.: The index of coincidence and its applications in cryptology. Department of Ciphers. Publ 22. Geneva, Illinois, USA: Riverbank Laboratories (1922)

    Google Scholar 

  24. Bell, T.C., Cleary, J.G., Witten, I.H.: Text Compression. Prentice Hall, Upper Saddle River (1990)

    Google Scholar 

  25. Sklar, D.: “Understanding PHP Internals” Essential PHP Tools: Modules, Extensions, and Accelerators, pp. 265–274. Apress, Berkeley (2004)

    Book  Google Scholar 

  26. Xue, X.B., Zhou, Z.H.: Distributional features for text categorization. IEEE Trans. Knowl. Data Eng. 21, 428–442 (2006)

    Google Scholar 

  27. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML), pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)

    Google Scholar 

  28. D shield takes the initiative to protect your website[CP/DK]. http://www.d99net.net/. Accessed 15 May 2020

  29. SHELLPUB.COM Focus on killing[CP/DK]. https://www.shellpub.com/. Accessed 15 May 2020

  30. Next generation webshell detection engine[CP/DK]. https://scanner.baidu.com/. Accessed 15 May 2020

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 61972187, the Scientific Research Project of Science and Education Park Development Center of Fuzhou University, Jinjiang under Grant 2019-JJFDKY-53 and the Tianjin University-Fuzhou University Joint Fund under Grant TF2020-6.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenzhuang Kang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kang, W., Zhong, S., Chen, K., Lai, J., Xu, G. (2020). RF-AdaCost: WebShell Detection Method that Combines Statistical Features and Opcode. In: Xu, G., Liang, K., Su, C. (eds) Frontiers in Cyber Security. FCS 2020. Communications in Computer and Information Science, vol 1286. Springer, Singapore. https://doi.org/10.1007/978-981-15-9739-8_49

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-9739-8_49

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-9738-1

  • Online ISBN: 978-981-15-9739-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics