Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection

Liu, Chris; Maeda, Katsuyuki; Takai, Junnosuke; Murota, Keisuke; Shin, Kilho

doi:10.1007/978-3-031-53963-3_17

Chris Liu¹⁰,
Katsuyuki Maeda¹¹,
Junnosuke Takai¹¹,
Keisuke Murota¹² &
…
Kilho Shin¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 920))

Included in the following conference series:

Future of Information and Communication Conference

192 Accesses

Abstract

The use of synthetic data for training machine learning (ML) models has garnered significant attention among researchers as a potential solution to the challenge of balancing privacy protection and data utilization. This paper introduces a novel approach for generating synthetic data that specifically addresses this challenge. Unlike existing methods that focus on closely replicating real data distributions, our proposed approach aims to generate synthetic data without directly using real data, while still enabling the training of ML models to extract specific bits of information. This can be achieved by leveraging only general knowledge about the problem domain, acquired without accessing real data. We applied this approach to the task of malware detection and conducted experiments to evaluate its effectiveness. The results not only validated the efficacy of our proposed approach but also led to a significant discovery in the field of malware detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

IP Geolocation API (2022). https://ip-api.com
The network simulator – ns-2 (2022). https://www.isi.edu/nsnam/ns/
Bassily, R., Cheu, A., Moran, S., Nikolov, A., Ullman, J., Wu, S.: Private query release assisted by public data. CoRR, abs/2004.10941 (2020)
Google Scholar
Berlin, K., Slater, D., Saxe, J.: Malicious behavior detection using windows audit logs. In: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, pp. 35–44 (2015)
Google Scholar
Białczak, P., Mazurczyk, W.: Hfinger: malware http request fingerprinting. Entropy 23(5), 507 (2021)
Article Google Scholar
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Dwork, C. (ed.), Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17–20 2008, pp. 609–618. ACM (2008)
Google Scholar
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Routledge, Milton Park (2017)
Google Scholar
Burgess, J., O’Kane, P., Sezer, S., Carlin, D.: LSTM RNN: detecting exploit kits using redirection chain sequences. Cybersecurity 4(1), 1–15 (2021)
Article Google Scholar
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
Google Scholar
Cox, D.R.: The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodological) 20(2), 215–232 (1958)
MathSciNet Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Article MathSciNet Google Scholar
Gallagher, S.: Nearly half of malware now use TLS to conceal communications (2021). https://news.sophos.com/en-us/2021/04/21/nearly-half-of-malware-now-use-tls-to-conceal-communications/
Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Gupta, A., Roth, A., Ullman, J.R.: Iterative constructions and private data release. CoRR, abs/1107.3731 (2011)
Google Scholar
Hinton, G.E.: Connectionist learning procedures. In Machine Learning, pp. 555–610. Elsevier (1990)
Google Scholar
Lewis, E., Mlinarević, T., Wilkinson, A.: Machine learning for static malware analysis. Technical report, University College London (2021)
Google Scholar
Li, Z., Oprea, Al.: Operational security log analytics for enterprise breach detection. In: 2016 IEEE Cybersecurity Development (SecDev), pp. 15–22. IEEE (2016)
Google Scholar
Nikolaev, I., Grill, M., Valeros, V.: Exploit kit website detection using http proxy logs. In: Proceedings of the Fifth International Conference on Network, Communication and Computing, pp. 120–125 (2016)
Google Scholar
Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Schulman, L.J. (ed.), Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5–8 June 2010, pp. 765–774. ACM (2010)
Google Scholar
ANY Run. Interactive malware hunting service (2022). https://any.run/
Sadique, F., Kaul, R., Badsha, S., Sengupta, S.: An automated framework for real-time phishing url detection. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0335–0341 (2020)
Google Scholar
Saxe, J., Berlin, K.: Deep neural network based malware detection using two dimensional binary program features. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11–20. IEEE (2015)
Google Scholar
Sommers, J., Kim, H., Barford, P.: Harpoon: a flow-level traffic generator for router and network tests. In: Coffman, E.G., Jr., Liu, Z., Merchant, A. (eds.), Proceedings of the International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS 2004, 10–14 June 2004, New York, NY, USA, pp. 392. ACM (2004)
Google Scholar
Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Data Privacy Working Paper 3, Carnegie Mellon University (2000)
Google Scholar
Virus Total. Virustotal (2022). https://www.virustotal.com/gui/home/upload
Vishwanath, K.V., Vahdat, A.: Swing: realistic and responsive network traffic generation. IEEE/ACM Trans. Netw. 17(3), 712–725 (2009)
Article Google Scholar
OPEN VPN. Openvpn access server 2.8.5 (2020)
Google Scholar
Wake, W.C.: Sentence-length distributions of Greek authors. J. R. Stat. Soc. Ser. A (General) 120(3), 331–346 (1957)
Article Google Scholar
William, C.B.: A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3–4), 356–361 (1940)
Google Scholar
Yin, Y., Lin, Z., Jin, M., Fanti, G., Sekar, V.: Practical GAN-based synthetic IP header trace generation using netshare. In: Kuipers, F., Orda, A. (eds.), SIGCOMM 2022: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, 22–26 August 2022, pp. 458–472. ACM (2022)
Google Scholar

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP21H05052.

Author information

Authors and Affiliations

Deloitte Tohmatsu Cyber LLC., Tokyo, 1000005, Japan
Chris Liu
Gakushuin University, Tokyo, 1718588, Japan
Katsuyuki Maeda, Junnosuke Takai & Kilho Shin
University of Tokyo, Tokyo, Japan
Keisuke Murota

Authors

Chris Liu
View author publications
You can also search for this author in PubMed Google Scholar
Katsuyuki Maeda
View author publications
You can also search for this author in PubMed Google Scholar
Junnosuke Takai
View author publications
You can also search for this author in PubMed Google Scholar
Keisuke Murota
View author publications
You can also search for this author in PubMed Google Scholar
Kilho Shin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Liu .

Editor information

Editors and Affiliations

Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, C., Maeda, K., Takai, J., Murota, K., Shin, K. (2024). Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-031-53963-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-53963-3_17
Published: 17 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53962-6
Online ISBN: 978-3-031-53963-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection