Skip to main content

Explainable Multiple Instance Learning with Instance Selection Randomized Trees

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

Abstract

Multiple Instance Learning (MIL) aims at extracting patterns from a collection of samples, where individual samples (called bags) are represented by a group of multiple feature vectors (called instances) instead of a single feature vector. Grouping instances into bags not only helps to formulate some learning problems more naturally, it also significantly reduces label acquisition costs as only the labels for bags are needed, not for the inner instances. However, in application domains where inference transparency is demanded, such as in network security, the sample attribution requirements are often asymmetric with respect to the training/application phase. While in the training phase it is very convenient to supply labels only for bags, in the application phase it is generally not enough to just provide decisions on the bag-level because the inferred verdicts need to be explained on the level of individual instances. Unfortunately, the majority of recent MIL classifiers does not focus on this real-world need. In this paper, we address this problem and propose a new tree-based MIL classifier able to identify instances responsible for positive bag predictions. Results from an empirical evaluation on a large-scale network security dataset also show that the classifier achieves superior performance when compared with prior art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For example, a seemingly legitimate request to google.com might be in reality related to malicious activity when it is issued by malware checking Internet connection. Similarly, requesting ad servers in low volumes is considered as a legitimate behavior, but higher numbers might indicate Click-fraud infection.

  2. 2.

    Term extremely in Extremely Randomized Trees [11] corresponds to setting \(T=1\).

  3. 3.

    We used implementation from https://github.com/komartom/BLRT.jl.

  4. 4.

    We used implementation from https://github.com/CTUAvastLab/Mill.jl.

  5. 5.

    MI-SVM is trained with Algorithm 1 for complete feature space (\(\mathbf {s}\) is vector of ones).

  6. 6.

    36 virtual Intel Xeon CPUs @ 2.9 GHz and 60 Gb of memory.

  7. 7.

    It was shown in the work of BLRT [14], and we confirm that for ISRT in Sect. 4.2, that tuning of these parameters usually does not bring any additional performance.

  8. 8.

    While precision answers to the question: “With how big percentage of false alarms the network administrators will have to deal with?”, false positive rate gives answer to: “How big percentage of clean users will be bothered?”.

  9. 9.

    This way of identifying malicious communications is not so effective in production, since new threats are not on the deny list yet and need to be first discovered.

  10. 10.

    Datasets are accessible at https://doi.org/10.6084/m9.figshare.6633983.v1.

  11. 11.

    AUC is agnostic to class imbalance and classifier’s decision threshold value.

  12. 12.

    The best model is assigned the lowest rank (i.e. one).

  13. 13.

    The performance of any two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference, which is (for 12 datasets, four methods and \(\alpha =0.05\)) approximately 1.35.

  14. 14.

    Source codes are available at https://github.com/komartom/ISRT.jl.

References

  1. Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). http://dx.doi.org/10.1016/j.artint.2013.06.003

  2. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp. 577–584. NIPS 2002. MIT Press, Cambridge, MA, USA (2002). http://dl.acm.org/citation.cfm?id=2968618.2968690

  3. Brabec, J., Komárek, T., Franc, V., Machlica, L.: On model evaluation under non-constant class imbalance. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 74–87. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50423-6_6

    Chapter  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324

  5. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)

    Google Scholar 

  6. Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: a survey of problem characteristics and applications. Patt. Recogn. 77, 329–353 (2018). https://www.sciencedirect.com/science/article/pii/S0031320317304065

  7. Cheplygina, V., Tax, D.M.J.: Characterizing multiple instance datasets. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 15–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_2

    Chapter  Google Scholar 

  8. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248548

  9. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). http://www.sciencedirect.com/science/article/pii/S0004370296000343

  10. Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious network traffic from weak labels. In: Bifet, A., et al. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9286, pp. 85–99. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23461-8_6

    Chapter  Google Scholar 

  11. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1

  12. Ho, T.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998)

    Article  Google Scholar 

  13. Kohout, J., Komárek, T., Čech, P., Bodnár, J., Lokoč, J.: Learning communication patterns for malware discovery in https data. Expert Syst. Appl. 101, 129–142 (2018). http://www.sciencedirect.com/science/article/pii/S0957417418300794

  14. Komárek, T., Somol, P.: Multiple instance learning with bag-level randomized trees. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) ECML PKDD 2018. LNCS (LNAI), vol. 11051, pp. 259–272. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10925-7_16

    Chapter  Google Scholar 

  15. Li, K., Chen, R., Gu, L., Liu, C., Yin, J.: A method based on statistical characteristics for detection malware requests in network traffic. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 527–532 (2018). https://doi.org/10.1109/DSC.2018.00084

  16. Machlica, L., Bartos, K., Sofka, M.: Learning detectors of malicious web requests for intrusion detection in network traffic (2017)

    Google Scholar 

  17. Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 729–746. USENIX Association, Santa Clara, CA, August 2019. https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury

  18. Pevny, T., Somol, P.: Discriminative models for multi-instance problems with tree structure. In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, pp. 83–91. AISec 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2996758.2996761

  19. Pevný, T., Somol, P.: Using neural network formalism to solve multiple-instance problems. In: Cong, F., Leung, A., Wei, Q. (eds.) ISNN 2017. LNCS, vol. 10261, pp. 135–142. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59072-1_17

    Chapter  Google Scholar 

  20. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). http://dx.doi.org/10.1023/A:1022643204877

  21. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814. ICML 2007. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1273496.1273598

  22. Stiborek, J., Pevný, T., Rehák, M.: Multiple instance learning for malware classification. Expert Syst. Appl. 93, 346–357 (2018). http://www.sciencedirect.com/science/article/pii/S0957417417307170

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomáš Komárek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Komárek, T., Brabec, J., Somol, P. (2021). Explainable Multiple Instance Learning with Instance Selection Randomized Trees. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86520-7_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86519-1

  • Online ISBN: 978-3-030-86520-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics