Skip to main content

EnCoD: Distinguishing Compressed and Encrypted File Fragments

Part of the Lecture Notes in Computer Science book series (LNSC,volume 12570)

Abstract

Reliable identification of encrypted file fragments is a requirement for several security applications, including ransomware detection, digital forensics, and traffic analysis. A popular approach consists of estimating high entropy as a proxy for randomness. However, many modern content types (e.g. office documents, media files, etc.) are highly compressed for storage and transmission efficiency. Compression algorithms also output high-entropy data, thus reducing the accuracy of entropy-based encryption detectors.

Over the years, a variety of approaches have been proposed to distinguish encrypted file fragments from high-entropy compressed fragments. However, these approaches are typically only evaluated over a few, selected data types and fragment sizes, which makes a fair assessment of their practical applicability impossible. This paper aims to close this gap by comparing existing statistical tests on a large, standardized dataset. Our results show that current approaches cannot reliably tell apart encryption and compression, even for large fragment sizes. To address this issue, we design EnCoD, a learning-based classifier which can reliably distinguish compressed and encrypted data, starting with fragments as small as 512 bytes. We evaluate EnCoD against current approaches over a large dataset of different data types, showing that it outperforms current state-of-the-art for most considered fragment sizes and data types.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-65745-1_3
  • Chapter length: 21 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-65745-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.

References

  1. Pycriptodome library. https://pycryptodome.readthedocs.io/en/latest/src/introduction.html

  2. DOCX Transitional (Office Open XML), January 2017. https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml

  3. Atlanta spent \$2.6m to recover from a \$52,000 ransomware scare (2018). https://www.wired.com/story/atlanta-spent-26m-recover-from-ransomware-scare/

  4. Wannacry cyber attack cost the NHS £92m as 19,000 appointments cancelled (2018). https://www.telegraph.co.uk/technology/2018/10/11/wannacry-cyber-attack-cost-nhs-92m-19000-appointments-cancelled/

  5. Evolvingai: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, December 2019. http://www.evolvingai.org/fooling

  6. FMA: A dataset for music analysis, December 2019. https://github.com/mdeff/fma

  7. Open images dataset v5, December 2019. https://www.figure-eight.com/dataset/open-images-annotated-with-bounding-boxes/

  8. Wikipedia: database download, December 2019. https://dumps.wikimedia.org/enwiki/

  9. arXiv.org e-Print archive, February 2020. https://arxiv.org/

  10. Ransomware attacks grow, crippling cities and businesses (2020). https://www.nytimes.com/2020/02/09/technology/ransomware-attacks.html

  11. Ameeno, N., Sherry, K., Gagneja, K.: Using machine learning to detect the file compression or encryption. Amity J. Comput. Sci. 3(1), 6 (2019)

    Google Scholar 

  12. Casino, F., Choo, K.K.R., Patsakis, C.: HEDGE: efficient traffic classification of encrypted and compressed packets. IEEE Trans. Inf. Forensics Secur. 14(11), 2916–2926 (2019)

    CrossRef  Google Scholar 

  13. Chollet, F., et al.: Keras (2015). https://keras.io

  14. Choudhury, P., Kumar, K.R.P., Nandi, S., Athithan, G.: An empirical approach towards characterization of encrypted and unencrypted VoIP traffic. Multimedia Tools Appl. 79(1–2), 603–631 (2020)

    CrossRef  Google Scholar 

  15. Computer Security Division, I.T.L.: NIST SP 800-22: Documentation and Software, May 2016. https://csrc.nist.gov/projects/random-bit-generation/documentation-and-software

  16. Conti, G., et al.: Automated mapping of large binary objects using primitive fragment type classification. Digital Invest. 7, S3–S12 (2010)

    CrossRef  Google Scholar 

  17. Continella, A., et al.: Shieldfs: a self-healing, ransomware-aware filesystem. In: ACSAC (2016)

    Google Scholar 

  18. De Carli, L., Torres, R., Modelo-Howard, G., Tongaonkar, A., Jha, S.: Botnet protocol inference in the presence of encrypted traffic. In: INFOCOM (2017)

    Google Scholar 

  19. De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V.: The naked sun: malicious cooperation between benign-looking processes. In: 18th International Conference on Applied Cryptography and Network Security. ACNS (2020)

    Google Scholar 

  20. Dorfinger, P., Panholzer, G., John, W.: Entropy estimation for real-time encrypted traffic identification. In: Traffic Monitoring and Analysis (2011)

    Google Scholar 

  21. Fielding, R., et al.: RFC 2616, hypertext transfer protocol - HTTP/1.1 (1999). http://www.rfc.net/rfc2616.html

  22. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)

    Google Scholar 

  23. Hahn, D., Apthorpe, N., Feamster, N.: Detecting compressed cleartext traffic from consumer internet of things devices (2018)

    Google Scholar 

  24. Hahn, D., Apthorpe, N., Feamster, N.: Detecting Compressed Cleartext Traffic from Consumer Internet of Things Devices. arXiv:1805.02722 [cs], May 2018. http://arxiv.org/abs/1805.02722

  25. Kharraz, A., Kirda, E.: Redemption: real-time protection against ransomware at end-hosts. In: RAID (2017)

    Google Scholar 

  26. Kirda, E.: Unveil: a large-scale, automated approach to detecting ransomware (keynote). In: SANER (2017)

    Google Scholar 

  27. Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. CoRR abs/1706.02515 (2017). http://arxiv.org/abs/1706.02515

  28. LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R.: Efficient backprop. In: Neural Networks: Tricks of the Trade (1998)

    Google Scholar 

  29. Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. In: Kale, S., Shamir, O. (eds.) Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7–10 July 2017. Proceedings of Machine Learning Research, vol. 65, pp. 1271–1296. PMLR (2017). http://proceedings.mlr.press/v65/lee17a.html

  30. Malhotra, P.: Detection of encrypted streams for egress monitoring. Master of Science, Iowa State University, Ames (2007). https://lib.dr.iastate.edu/rtd/14632/

  31. Mamun, M.S.I., Ghorbani, A.A., Stakhanova, N.: An entropy based encrypted traffic classifier. In: Qing, S., Okamoto, E., Kim, K., Liu, D. (eds.) ICICS 2015. LNCS, vol. 9543, pp. 282–294. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29814-6_23

    CrossRef  Google Scholar 

  32. Mbol, F., Robert, J.-M., Sadighian, A.: An efficient approach to detect TorrentLocker ransomware in computer systems. In: Foresti, S., Persiano, G. (eds.) CANS 2016. LNCS, vol. 10052, pp. 532–541. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48965-0_32

    CrossRef  Google Scholar 

  33. Mehnaz, S., Mudgerikar, A., Bertino, E.: Rwguard: a real-time detection system against cryptographic ransomware. In: Research in Attacks, Intrusions, and Defenses. RAID 2018 (2018)

    Google Scholar 

  34. Palisse, A., Durand, A., Le Bouder, H., Le Guernic, C., Lanet, J.-L.: Data aware defense (DaD): towards a generic and practical ransomware countermeasure. In: Lipmaa, H., Mitrokotsa, A., Matulevičius, R. (eds.) NordSec 2017. LNCS, vol. 10674, pp. 192–208. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70290-2_12

    CrossRef  Google Scholar 

  35. Park, B., Savoldi, A., Gubian, P., Park, J., Lee, S.H., Lee, S.: Data extraction from damage compressed file for computer forensic purposes. Int. J. Hybrid Inf. Technol. 1(4), 14 (2008)

    Google Scholar 

  36. Rukhin, A., et al.: A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. Special Publication 800-22r1a, NIST, April 2010

    Google Scholar 

  37. Trottier, L., Giguere, P., Chaib-draa, B.: Parametric exponential linear unit for deep convolutional neural networks. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) (2017)

    Google Scholar 

  38. Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Consum. Electron. 38(1), xviii–xxxiv (1992)

    Google Scholar 

  39. Walls, R.J., Learned-Miller, E., Levine, B.N.: Forensic triage for mobile phones with DEC0DE. In: USENIX Security Symposium (2011)

    Google Scholar 

  40. Wang, R., Shoshitaishvili, Y., Kruegel, C., Vigna, G.: Steal this movie - automatically bypassing DRM protection in streaming media services. In: USENIX (2013)

    Google Scholar 

  41. Wang, Y., Zhang, Z., Guo, L., Li, S.: Using entropy to classify traffic more deeply. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, pp. 45–52, July 2011

    Google Scholar 

  42. Zhang, H., Papadopoulos, C., Massey, D.: Detecting encrypted botnet traffic. In: 2013 Proceedings IEEE INFOCOM, pp. 3453–1358, April 2013

    Google Scholar 

Download references

Acknowledgments

We would like to thank Daniele Venturi and Guinevere Gilman for their useful insights and comments. This work was supported by Gen4olive, a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101000427, and in part by the Italian MIUR through the Dipartimento di Informatica, Sapienza University of Rome, under Grant Dipartimenti di eccellenza 2018–2022.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dorjan Hitaj .

Editor information

Editors and Affiliations

Appendices

Appendix

A Entropy Analysis Results

Full results for the entropy analysis discussed in Sect. 2.4:

Chunk size: 512B
Format Min Q1 Median Q3 Max
enc 7.427 7.569 7.591 7.613 7.709
zip 7.163 7.560 7.584 7.607 7.695
gzip 7.154 7.560 7.585 7.607 7.703
rar 7.381 7.563 7.587 7.610 7.692
jpeg 3.820 7.512 7.548 7.576 7.676
mp3 0.000 7.451 7.527 7.565 7.680
png 0.000 1.070 2.605 4.549 7.572
pdf 0.000 7.453 7.534 7.574 7.676
Chunk size: 2048B
Format Min Q1 Median Q3 Max
enc 7.873 7.903 7.908 7.914 7.938
zip 7.816 7.898 7.904 7.910 7.935
gzip 7.847 7.898 7.904 7.910 7.933
rar 7.795 7.900 7.905 7.911 7.933
jpeg 5.123 7.856 7.873 7.884 7.917
mp3 0.379 7.703 7.838 7.871 7.916
png 0.000 1.312 2.815 4.752 7.808
pdf 0.000 7.820 7.875 7.893 7.930
Chunk size: 8192B
Format Min Q1 Median Q3 Max
enc 7.969 7.976 7.978 7.979 7.984
zip 7.955 7.973 7.975 7.976 7.983
gzip 7.955 7.973 7.975 7.976 7.983
rar 7.960 7.974 7.976 7.977 7.983
jpeg 5.646 7.930 7.945 7.952 7.967
mp3 0.497 7.789 7.918 7.942 7.971
png 0.014 1.451 2.963 4.852 7.914
pdf 0.010 7.903 7.953 7.968 7.981

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

De Gaspari, F., Hitaj, D., Pagnotta, G., De Carli, L., Mancini, L.V. (2020). EnCoD: Distinguishing Compressed and Encrypted File Fragments. In: Kutyłowski, M., Zhang, J., Chen, C. (eds) Network and System Security. NSS 2020. Lecture Notes in Computer Science(), vol 12570. Springer, Cham. https://doi.org/10.1007/978-3-030-65745-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65745-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65744-4

  • Online ISBN: 978-3-030-65745-1

  • eBook Packages: Computer ScienceComputer Science (R0)