Abstract
The use of synthetic data for training machine learning (ML) models has garnered significant attention among researchers as a potential solution to the challenge of balancing privacy protection and data utilization. This paper introduces a novel approach for generating synthetic data that specifically addresses this challenge. Unlike existing methods that focus on closely replicating real data distributions, our proposed approach aims to generate synthetic data without directly using real data, while still enabling the training of ML models to extract specific bits of information. This can be achieved by leveraging only general knowledge about the problem domain, acquired without accessing real data. We applied this approach to the task of malware detection and conducted experiments to evaluate its effectiveness. The results not only validated the efficacy of our proposed approach but also led to a significant discovery in the field of malware detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
IP Geolocation API (2022). https://ip-api.com
The network simulator – ns-2 (2022). https://www.isi.edu/nsnam/ns/
Bassily, R., Cheu, A., Moran, S., Nikolov, A., Ullman, J., Wu, S.: Private query release assisted by public data. CoRR, abs/2004.10941 (2020)
Berlin, K., Slater, D., Saxe, J.: Malicious behavior detection using windows audit logs. In: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, pp. 35–44 (2015)
Białczak, P., Mazurczyk, W.: Hfinger: malware http request fingerprinting. Entropy 23(5), 507 (2021)
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Dwork, C. (ed.), Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17–20 2008, pp. 609–618. ACM (2008)
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Routledge, Milton Park (2017)
Burgess, J., O’Kane, P., Sezer, S., Carlin, D.: LSTM RNN: detecting exploit kits using redirection chain sequences. Cybersecurity 4(1), 1–15 (2021)
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Cox, D.R.: The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodological) 20(2), 215–232 (1958)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Gallagher, S.: Nearly half of malware now use TLS to conceal communications (2021). https://news.sophos.com/en-us/2021/04/21/nearly-half-of-malware-now-use-tls-to-conceal-communications/
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Gupta, A., Roth, A., Ullman, J.R.: Iterative constructions and private data release. CoRR, abs/1107.3731 (2011)
Hinton, G.E.: Connectionist learning procedures. In Machine Learning, pp. 555–610. Elsevier (1990)
Lewis, E., Mlinarević, T., Wilkinson, A.: Machine learning for static malware analysis. Technical report, University College London (2021)
Li, Z., Oprea, Al.: Operational security log analytics for enterprise breach detection. In: 2016 IEEE Cybersecurity Development (SecDev), pp. 15–22. IEEE (2016)
Nikolaev, I., Grill, M., Valeros, V.: Exploit kit website detection using http proxy logs. In: Proceedings of the Fifth International Conference on Network, Communication and Computing, pp. 120–125 (2016)
Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Schulman, L.J. (ed.), Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5–8 June 2010, pp. 765–774. ACM (2010)
ANY Run. Interactive malware hunting service (2022). https://any.run/
Sadique, F., Kaul, R., Badsha, S., Sengupta, S.: An automated framework for real-time phishing url detection. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0335–0341 (2020)
Saxe, J., Berlin, K.: Deep neural network based malware detection using two dimensional binary program features. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11–20. IEEE (2015)
Sommers, J., Kim, H., Barford, P.: Harpoon: a flow-level traffic generator for router and network tests. In: Coffman, E.G., Jr., Liu, Z., Merchant, A. (eds.), Proceedings of the International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS 2004, 10–14 June 2004, New York, NY, USA, pp. 392. ACM (2004)
Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Data Privacy Working Paper 3, Carnegie Mellon University (2000)
Virus Total. Virustotal (2022). https://www.virustotal.com/gui/home/upload
Vishwanath, K.V., Vahdat, A.: Swing: realistic and responsive network traffic generation. IEEE/ACM Trans. Netw. 17(3), 712–725 (2009)
OPEN VPN. Openvpn access server 2.8.5 (2020)
Wake, W.C.: Sentence-length distributions of Greek authors. J. R. Stat. Soc. Ser. A (General) 120(3), 331–346 (1957)
William, C.B.: A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3–4), 356–361 (1940)
Yin, Y., Lin, Z., Jin, M., Fanti, G., Sekar, V.: Practical GAN-based synthetic IP header trace generation using netshare. In: Kuipers, F., Orda, A. (eds.), SIGCOMM 2022: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, 22–26 August 2022, pp. 458–472. ACM (2022)
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number JP21H05052.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, C., Maeda, K., Takai, J., Murota, K., Shin, K. (2024). Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-031-53963-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-53963-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53962-6
Online ISBN: 978-3-031-53963-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)