Skip to main content

Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2024)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 920))

Included in the following conference series:

  • 192 Accesses

Abstract

The use of synthetic data for training machine learning (ML) models has garnered significant attention among researchers as a potential solution to the challenge of balancing privacy protection and data utilization. This paper introduces a novel approach for generating synthetic data that specifically addresses this challenge. Unlike existing methods that focus on closely replicating real data distributions, our proposed approach aims to generate synthetic data without directly using real data, while still enabling the training of ML models to extract specific bits of information. This can be achieved by leveraging only general knowledge about the problem domain, acquired without accessing real data. We applied this approach to the task of malware detection and conducted experiments to evaluate its effectiveness. The results not only validated the efficacy of our proposed approach but also led to a significant discovery in the field of malware detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. IP Geolocation API (2022). https://ip-api.com

  2. The network simulator – ns-2 (2022). https://www.isi.edu/nsnam/ns/

  3. Bassily, R., Cheu, A., Moran, S., Nikolov, A., Ullman, J., Wu, S.: Private query release assisted by public data. CoRR, abs/2004.10941 (2020)

    Google Scholar 

  4. Berlin, K., Slater, D., Saxe, J.: Malicious behavior detection using windows audit logs. In: Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, pp. 35–44 (2015)

    Google Scholar 

  5. Białczak, P., Mazurczyk, W.: Hfinger: malware http request fingerprinting. Entropy 23(5), 507 (2021)

    Article  Google Scholar 

  6. Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Dwork, C. (ed.), Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17–20 2008, pp. 609–618. ACM (2008)

    Google Scholar 

  7. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)

    Google Scholar 

  8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  9. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Routledge, Milton Park (2017)

    Google Scholar 

  10. Burgess, J., O’Kane, P., Sezer, S., Carlin, D.: LSTM RNN: detecting exploit kits using redirection chain sequences. Cybersecurity 4(1), 1–15 (2021)

    Article  Google Scholar 

  11. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    Google Scholar 

  12. Cox, D.R.: The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodological) 20(2), 215–232 (1958)

    MathSciNet  Google Scholar 

  13. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  14. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  Google Scholar 

  15. Gallagher, S.: Nearly half of malware now use TLS to conceal communications (2021). https://news.sophos.com/en-us/2021/04/21/nearly-half-of-malware-now-use-tls-to-conceal-communications/

  16. Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  17. Gupta, A., Roth, A., Ullman, J.R.: Iterative constructions and private data release. CoRR, abs/1107.3731 (2011)

    Google Scholar 

  18. Hinton, G.E.: Connectionist learning procedures. In Machine Learning, pp. 555–610. Elsevier (1990)

    Google Scholar 

  19. Lewis, E., Mlinarević, T., Wilkinson, A.: Machine learning for static malware analysis. Technical report, University College London (2021)

    Google Scholar 

  20. Li, Z., Oprea, Al.: Operational security log analytics for enterprise breach detection. In: 2016 IEEE Cybersecurity Development (SecDev), pp. 15–22. IEEE (2016)

    Google Scholar 

  21. Nikolaev, I., Grill, M., Valeros, V.: Exploit kit website detection using http proxy logs. In: Proceedings of the Fifth International Conference on Network, Communication and Computing, pp. 120–125 (2016)

    Google Scholar 

  22. Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Schulman, L.J. (ed.), Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5–8 June 2010, pp. 765–774. ACM (2010)

    Google Scholar 

  23. ANY Run. Interactive malware hunting service (2022). https://any.run/

  24. Sadique, F., Kaul, R., Badsha, S., Sengupta, S.: An automated framework for real-time phishing url detection. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0335–0341 (2020)

    Google Scholar 

  25. Saxe, J., Berlin, K.: Deep neural network based malware detection using two dimensional binary program features. In: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11–20. IEEE (2015)

    Google Scholar 

  26. Sommers, J., Kim, H., Barford, P.: Harpoon: a flow-level traffic generator for router and network tests. In: Coffman, E.G., Jr., Liu, Z., Merchant, A. (eds.), Proceedings of the International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS 2004, 10–14 June 2004, New York, NY, USA, pp. 392. ACM (2004)

    Google Scholar 

  27. Sweeney, L.: Simple demographics often identify people uniquely. Technical Report Data Privacy Working Paper 3, Carnegie Mellon University (2000)

    Google Scholar 

  28. Virus Total. Virustotal (2022). https://www.virustotal.com/gui/home/upload

  29. Vishwanath, K.V., Vahdat, A.: Swing: realistic and responsive network traffic generation. IEEE/ACM Trans. Netw. 17(3), 712–725 (2009)

    Article  Google Scholar 

  30. OPEN VPN. Openvpn access server 2.8.5 (2020)

    Google Scholar 

  31. Wake, W.C.: Sentence-length distributions of Greek authors. J. R. Stat. Soc. Ser. A (General) 120(3), 331–346 (1957)

    Article  Google Scholar 

  32. William, C.B.: A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3–4), 356–361 (1940)

    Google Scholar 

  33. Yin, Y., Lin, Z., Jin, M., Fanti, G., Sekar, V.: Practical GAN-based synthetic IP header trace generation using netshare. In: Kuipers, F., Orda, A. (eds.), SIGCOMM 2022: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, 22–26 August 2022, pp. 458–472. ACM (2022)

    Google Scholar 

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP21H05052.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, C., Maeda, K., Takai, J., Murota, K., Shin, K. (2024). Synthetic Data Generation Without Real Data: Uncovering Insights in Malware Detection. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 920. Springer, Cham. https://doi.org/10.1007/978-3-031-53963-3_17

Download citation

Publish with us

Policies and ethics