The duplication issue within the Drebin dataset

  • Paul Irolla
  • Alexandre Dey


The Drebin dataset (in: NDSS, 2014) is the most supplied academic dataset of Android malware. Therefore it is the most used dataset in research papers on Android malware detection. The research community is using it for evaluation and comparison of their algorithms. We discovered that 49.35% of samples in this dataset has at least one other sample that is a repackaged version containing exactly the same sequence of opcode. The only differences between the original malware and the duplicated ones, in all cases, are the resources embedded and some strings in the code. For assessing the performance of malware detectors or classifiers, a part of the dataset is used for this purpose. So a major part of the testing set end up beeing the same samples that have been used in the training set. This situation can lead us, the research community, to overrate the performance of algorithms we are designing. In the worst case, it leads us to wrong conclusions and wrong directions for future research. Then we conduct an experiment where we test several classification algorithms on the Drebin dataset with and without the duplicates. Our results show that depending on the classifier the full dataset can lead from moderately (124%) to strongly (172%) underrated inaccuracy, and the order of performance of the algorithms is modified. Finally we provide the list of unique malware samples from the Drebin dataset, available on Github.


Android Malware detection Machine learning Dataset 

Supplementary material

11416_2018_316_MOESM1_ESM.txt (193 kb)
Supplementary material 1 (txt 192 KB)


  1. 1.
    Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., Siemens, C.: Drebin: effective and explainable detection of android malware in your pocket. In: NDSS (2014)Google Scholar
  2. 2.
    Bell, C.: Mutual information and maximal correlation as measures of dependence. Ann. Math. Stat. pp. 587–595 (1962)Google Scholar
  3. 3.
    Dimjašević, M., Atzeni, S., Ugrina, I., Rakamaric, Z.: Evaluation of android malware detection based on system calls. In: Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics, pp. 1–8. ACM (2016)Google Scholar
  4. 4.
    Gonzalez, H., Stakhanova, N., Ghorbani, A.A.: Droidkin: Lightweight detection of android apps similarity. In: International Conference on Security and Privacy in Communication Systems, pp. 436–453. Springer (2014)Google Scholar
  5. 5.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  6. 6.
    McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., Safaei, Y., Trickel, E., Zhao, Z., Doupe, A., et al.: Deep android malware detection. In: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 301–308. ACM (2017)Google Scholar

Copyright information

© Springer-Verlag France SAS, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Laboratoire de cryptologie et virologie oprationnelles (CVOLab)École d’ingnieurs du monde numrique (ESIEA)LavalFrance
  2. 2.École d’ingnieurs du monde numrique (ESIEA)LavalFrance

Personalised recommendations