Feature Extraction and Malware Detection on Large HTTPS Data Using MapReduce

  • Přemysl ČechEmail author
  • Jan Kohout
  • Jakub Lokoč
  • Tomáš Komárek
  • Jakub Maroušek
  • Tomáš Pevný
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9939)


Secure HTTP network traffic represents a challenging immense data source for machine learning tasks. The tasks usually try to learn and identify infected network nodes, given only limited traffic features available for secure HTTP data. In this paper, we investigate the performance of grid histograms that can be used to aggregate traffic features of network nodes considering just 5-min batches for snapshots. We compare the representation using linear and k-NN classifiers. We also demonstrate that all presented feature extraction and classification tasks can be implemented in a scalable way using the MapReduce approach.


Hadoop MapReduce HTTPS data Intrusion detection Approximate similarity join 



This project was supported by the GAČR 15-08916S and GAUK 201515 grants.


  1. 1.
  2. 2.
    Bohm, C., Kriegel, H.P.: A cost model and index architecture for the similarity join. In: Proceedings of the 17th International Conference on Data Engineering, pp. 411–420 (2001)Google Scholar
  3. 3.
    Chávez, E., Navarro, G., Baeza-Yates, R., Marroquín, J.L.: Searching in metric spaces. ACM Comput. Surv. 33(3), 273–321 (2001)CrossRefGoogle Scholar
  4. 4.
    Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic classification through simple statistical fingerprinting. SIGCOMM Comput. Commun. Rev. 37, 5–16 (2007)CrossRefGoogle Scholar
  5. 5.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  6. 6.
    Dusi, M., Crotti, M., Gringoli, F., Salgarelli, L.: Tunnel hunter: detecting application-layer tunnels with statistical fingerprinting. Comput. Netw. 53, 81–97 (2009)CrossRefGoogle Scholar
  7. 7.
    Kohout, J., Pevny, T.: Automatic discovery of web servers hosting similar applications. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM) (2015)Google Scholar
  8. 8.
    Kohout, J., Pevny, T.: Unsupervised detection of malware in persistent web traffic. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)Google Scholar
  9. 9.
    Lee, Y., Lee, Y.: Toward scalable internet traffic measurement and analysis with hadoop. SIGCOMM Comput. Commun. Rev. 43(1), 5–13 (2012)CrossRefGoogle Scholar
  10. 10.
    Lokoc, J., Kohout, J., Cech, P., Skopal, T., Pevný, T.: k-NN classification of malware in HTTPS traffic using the metric space approach. In: Chau, M., Wang, G.A. (eds.) PAISI 2016. LNCS, vol. 9650, pp. 131–145. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-31863-9_10 CrossRefGoogle Scholar
  11. 11.
    Lu, W., Shen, Y., Chen, S., Ooi, B.C.: Efficient processing of k nearest neighbor joins using MapReduce. Proc. VLDB Endow. 5(10), 1016–1027 (2012)CrossRefGoogle Scholar
  12. 12.
    Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–733 (2011)CrossRefGoogle Scholar
  13. 13.
    Pevny, T., Ker, A.D.: Towards dependable steganalysis. In: IS&T/SPIE Electronic Imaging (2015)Google Scholar
  14. 14.
    Roesch, M.: Snort - lightweight intrusion detection for networks. In: Proceedings of the 13th USENIX Conference on System Administration, LISA 1999, pp. 229–238. USENIX Association, Berkeley (1999)Google Scholar
  15. 15.
    Wright, C., Monrose, F., Masson, G.M.: On inferring application protocol behaviors in encrypted network traffic. J. Mach. Learn. Res. 7, 2745–2769 (2006)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Xia, C., Lu, H., Ooi, B.C., Hu, J.: Gorder: an efficient method for KNN join processing. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, vol. 30, pp. 756–767. VLDB Endowment (2004)Google Scholar
  17. 17.
    Yu, C., Cui, B., Wang, S., Su, J.: Efficient index-based KNN join processing for high-dimensional data. Inf. Softw. Technol. 49(4), 332–344 (2007)CrossRefGoogle Scholar
  18. 18.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. Springer, New York (2005)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Přemysl Čech
    • 1
    Email author
  • Jan Kohout
    • 2
  • Jakub Lokoč
    • 1
  • Tomáš Komárek
    • 2
  • Jakub Maroušek
    • 1
  • Tomáš Pevný
    • 2
  1. 1.SIRET Research Group, Faculty of Mathematics and Physics, Department of Software EngineeringCharles University in PraguePragueCzech Republic
  2. 2.FEE, Cognitive Research Center in PragueCzech Technical University in Prague, Cisco Systems, Inc.PragueCzech Republic

Personalised recommendations