Abstract
With the rapid development of the Internet, more and more methods of text steganography have emerged. However, these methods are easily abused in public networks for malicious purposes, which poses a great threat to cyberspace security. At present, a large number of text steganalysis methods have been proposed to game with text steganography. However, existing methods typically assume a balanced class distribution. In reality, stego texts are far less than cover texts. How to accurately detect stego texts in massive texts becomes a challenge. In this paper, we propose a text steganalysis method based on an under-sample method and ensemble learning in imbalanced scenarios. Specifically, we introduce the thinking of clustering to under-sample the majority class samples (cover texts) based on the detection difficulty of the samples, in order to select samples with rich information. Ensemble learning is then used to ensemble the detection results of multiple base classifiers and guide the sampling process. We designed several experiments to test the detection performance of the proposed model. Experimental results show that the proposed model can effectively compensate for the deficiencies of existing methods, even in highly imbalanced datasets, the model can still detect stego texts effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Anal. Appl. 6, 245–256 (2003)
Chen, Z., Huang, L., Miao, H., Yang, W., Meng, P.: Steganalysis against substitution-based linguistic steganography based on context clusters. Comput. Electr. Eng. 37(6), 1071–1081 (2011)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Freund, Y.: Boosting a weak learning algorithm by majority. Inf. Comput. 121(2), 256–285 (1995)
Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46(12), 3460–3471 (2013)
Gao, L., Zhang, L., Liu, C., Wu, S.: Handling imbalanced medical image data: a deep-learning-based one-class classification approach. Artif. Intell. Med. 108, 101935 (2020)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Huang, Y.F., Tang, S., Yuan, J.: Steganography in inactive frames of VoIP streams encoded by source codec. IEEE Trans. Inf. Forensics Secur. 6(2), 296–306 (2011)
Johnson, N.F., Sallee, P.A.: Detection of hidden information, covert channels and information flows. In: Wiley Handbook of Science and Technology for Homeland Security, pp. 1–37 (2008)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME 2001. LNCS, vol. 2101, pp. 63–66. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-48229-6_9
Li, S., Wang, J., Liu, P.: Detection of generative linguistic steganography based on explicit and latent text word relation mining using deep learning. IEEE Trans. Dependable Secure Comput. 20(2), 1476–1487 (2022)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(2), 539–550 (2008)
Liu, Y., Chawla, N.V., Harper, M.P., Shriberg, E., Stolcke, A.: A study in machine learning from imbalanced data for sentence boundary detection in speech. Comput. Speech Lang. 20(4), 468–494 (2006)
Liu, Z., Wei, P., Jiang, J., Cao, W., Bian, J., Chang, Y.: MESA: boost ensemble imbalanced learning with meta-sampler. In: Advances in Neural Information Processing Systems, vol. 33, pp. 14463–14474 (2020)
Niu, Y., Wen, J., Zhong, P., Xue, Y.: A hybrid R-BILSTM-C neural network based text steganalysis. IEEE Sig. Process. Lett. 26(12), 1907–1911 (2019)
Samanta, S., Dutta, S., Sanyal, G.: A real time text steganalysis by using statistical method. In: 2016 IEEE International Conference on Engineering and Technology (ICETECH), pp. 264–268. IEEE (2016)
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern.-Part A Syst. Hum. 40(1), 185–197 (2009)
Sun, B., Chen, H., Wang, J., Xie, H.: Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front. Comput. Sci. 12, 331–350 (2018)
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., Zhou, Y.: A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48(5), 1623–1637 (2015)
Tang, W., Li, B., Tan, S., Barni, M., Huang, J.: CNN-based adversarial embedding for image steganography. IEEE Trans. Inf. Forensics Secur. 14(8), 2074–2087 (2019)
Wang, Y., Zhang, W., Li, W., Yu, X., Yu, N.: Non-additive cost functions for color image steganography based on inter-channel correlations and differences. IEEE Trans. Inf. Forensics Secur. 15, 2081–2095 (2019)
Wang, Y., Gan, W., Yang, J., Wu, W., Yan, J.: Dynamic curriculum learning for imbalanced data classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5017–5026 (2019)
Wu, H., Yi, B., Ding, F., Feng, G., Zhang, X.: Linguistic steganalysis with graph neural networks. IEEE Sig. Process. Lett. 28, 558–562 (2021)
Xiang, L., Sun, X., Luo, G., Xia, B.: Linguistic steganalysis using the features derived from synonym frequency. Multimedia Tools Appl. 71, 1893–1911 (2014)
Yang, H., Bao, Y., Yang, Z., Liu, S., Huang, Y., Jiao, S.: Linguistic steganalysis via densely connected LSTM with feature pyramid. In: Proceedings of the 2020 ACM Workshop on Information Hiding and Multimedia Security, pp. 5–10 (2020)
Yang, H., Cao, X.: Linguistic steganalysis based on meta features and immune mechanism. Chin. J. Electron. 19(4), 661–666 (2010)
Yang, J., Yang, Z., Zhang, S., Tu, H., Huang, Y.: SeSy: linguistic steganalysis framework integrating semantic and syntactic features. IEEE Sig. Process. Lett. 29, 31–35 (2021)
Yang, Z.L., Guo, X.Q., Chen, Z.M., Huang, Y.F., Zhang, Y.J.: RNN-Stega: linguistic steganography based on recurrent neural networks. IEEE Trans. Inf. Forensics Secur. 14(5), 1280–1295 (2018)
Yang, Z., Du, X., Tan, Y., Huang, Y., Zhang, Y.J.: AAG-Stega: automatic audio generation-based steganography. arXiv preprint arXiv:1809.03463 (2018)
Yang, Z., Huang, Y., Zhang, Y.J.: A fast and efficient text steganalysis method. IEEE Sig. Process. Lett. 26(4), 627–631 (2019)
Yang, Z., Huang, Y., Zhang, Y.J.: TS-CSW: text steganalysis and hidden capacity estimation based on convolutional sliding windows. Multimedia Tools Appl. 79, 18293–18316 (2020)
Zhang, S., Yang, Z., Yang, J., Huang, Y.: Provably secure generative linguistic steganography. arXiv preprint arXiv:2106.02011 (2021)
Zhou, F., et al.: Dynamic self-paced sampling ensemble for highly imbalanced and class-overlapped data classification. Data Min. Knowl. Disc. 36(5), 1601–1622 (2022)
Ziegler, Z.M., Deng, Y., Rush, A.M.: Neural linguistic steganography. arXiv preprint arXiv:1909.01496 (2019)
Zou, J., Yang, Z., Zhang, S., ur Rehman, S., Huang, Y.: High-performance linguistic steganalysis, capacity estimation and steganographic positioning. In: Zhao, X., Shi, Y.Q., Piva, A., Kim, H.J. (eds.) IWDW 2020. LNSC, vol. 12617, pp. 80–93. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69449-4_7
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFC3303301 and in part by the National Natural Science Foundation of China under Grant 62172053 and Grant 62302059.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Guo, S., Chen, X., Wang, Z., Yang, Z., Zhou, L. (2024). Linguistic Steganalysis Based on Clustering and Ensemble Learning in Imbalanced Scenario. In: Ma, B., Li, J., Li, Q. (eds) Digital Forensics and Watermarking. IWDW 2023. Lecture Notes in Computer Science, vol 14511. Springer, Singapore. https://doi.org/10.1007/978-981-97-2585-4_22
Download citation
DOI: https://doi.org/10.1007/978-981-97-2585-4_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2584-7
Online ISBN: 978-981-97-2585-4
eBook Packages: Computer ScienceComputer Science (R0)