Explainable Multiple Instance Learning with Instance Selection Randomized Trees

Komárek, Tomáš; Brabec, Jan; Somol, Petr

doi:10.1007/978-3-030-86520-7_44

Tomáš Komárek^13,14,
Jan Brabec^13,14 &
Petr Somol¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12976))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1873 Accesses
2 Citations

Abstract

Multiple Instance Learning (MIL) aims at extracting patterns from a collection of samples, where individual samples (called bags) are represented by a group of multiple feature vectors (called instances) instead of a single feature vector. Grouping instances into bags not only helps to formulate some learning problems more naturally, it also significantly reduces label acquisition costs as only the labels for bags are needed, not for the inner instances. However, in application domains where inference transparency is demanded, such as in network security, the sample attribution requirements are often asymmetric with respect to the training/application phase. While in the training phase it is very convenient to supply labels only for bags, in the application phase it is generally not enough to just provide decisions on the bag-level because the inferred verdicts need to be explained on the level of individual instances. Unfortunately, the majority of recent MIL classifiers does not focus on this real-world need. In this paper, we address this problem and propose a new tree-based MIL classifier able to identify instances responsible for positive bag predictions. Results from an empirical evaluation on a large-scale network security dataset also show that the classifier achieves superior performance when compared with prior art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For example, a seemingly legitimate request to google.com might be in reality related to malicious activity when it is issued by malware checking Internet connection. Similarly, requesting ad servers in low volumes is considered as a legitimate behavior, but higher numbers might indicate Click-fraud infection.
2.
Term extremely in Extremely Randomized Trees [11] corresponds to setting \(T=1\).
3.
We used implementation from https://github.com/komartom/BLRT.jl.
4.
We used implementation from https://github.com/CTUAvastLab/Mill.jl.
5.
MI-SVM is trained with Algorithm 1 for complete feature space (\(\mathbf {s}\) is vector of ones).
6.
36 virtual Intel Xeon CPUs @ 2.9 GHz and 60 Gb of memory.
7.
It was shown in the work of BLRT [14], and we confirm that for ISRT in Sect. 4.2, that tuning of these parameters usually does not bring any additional performance.
8.
While precision answers to the question: “With how big percentage of false alarms the network administrators will have to deal with?”, false positive rate gives answer to: “How big percentage of clean users will be bothered?”.
9.
This way of identifying malicious communications is not so effective in production, since new threats are not on the deny list yet and need to be first discovered.
10.
Datasets are accessible at https://doi.org/10.6084/m9.figshare.6633983.v1.
11.
AUC is agnostic to class imbalance and classifier’s decision threshold value.
12.
The best model is assigned the lowest rank (i.e. one).
13.
The performance of any two classifiers is significantly different if the corresponding average ranks differ by at least the critical difference, which is (for 12 datasets, four methods and \(\alpha =0.05\)) approximately 1.35.
14.
Source codes are available at https://github.com/komartom/ISRT.jl.

References

Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). http://dx.doi.org/10.1016/j.artint.2013.06.003
Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for multiple-instance learning. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp. 577–584. NIPS 2002. MIT Press, Cambridge, MA, USA (2002). http://dl.acm.org/citation.cfm?id=2968618.2968690
Brabec, J., Komárek, T., Franc, V., Machlica, L.: On model evaluation under non-constant class imbalance. In: Krzhizhanovskaya, V.V., et al. (eds.) ICCS 2020. LNCS, vol. 12140, pp. 74–87. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50423-6_6
Chapter Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Google Scholar
Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: a survey of problem characteristics and applications. Patt. Recogn. 77, 329–353 (2018). https://www.sciencedirect.com/science/article/pii/S0031320317304065
Cheplygina, V., Tax, D.M.J.: Characterizing multiple instance datasets. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 15–27. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_2
Chapter Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248548
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). http://www.sciencedirect.com/science/article/pii/S0004370296000343
Franc, V., Sofka, M., Bartos, K.: Learning detector of malicious network traffic from weak labels. In: Bifet, A., et al. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9286, pp. 85–99. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23461-8_6
Chapter Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
Ho, T.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998)
Article Google Scholar
Kohout, J., Komárek, T., Čech, P., Bodnár, J., Lokoč, J.: Learning communication patterns for malware discovery in https data. Expert Syst. Appl. 101, 129–142 (2018). http://www.sciencedirect.com/science/article/pii/S0957417418300794
Komárek, T., Somol, P.: Multiple instance learning with bag-level randomized trees. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds.) ECML PKDD 2018. LNCS (LNAI), vol. 11051, pp. 259–272. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10925-7_16
Chapter Google Scholar
Li, K., Chen, R., Gu, L., Liu, C., Yin, J.: A method based on statistical characteristics for detection malware requests in network traffic. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 527–532 (2018). https://doi.org/10.1109/DSC.2018.00084
Machlica, L., Bartos, K., Sofka, M.: Learning detectors of malicious web requests for intrusion detection in network traffic (2017)
Google Scholar
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., Cavallaro, L.: TESSERACT: eliminating experimental bias in malware classification across space and time. In: 28th USENIX Security Symposium (USENIX Security 19), pp. 729–746. USENIX Association, Santa Clara, CA, August 2019. https://www.usenix.org/conference/usenixsecurity19/presentation/pendlebury
Pevny, T., Somol, P.: Discriminative models for multi-instance problems with tree structure. In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, pp. 83–91. AISec 2016. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2996758.2996761
Pevný, T., Somol, P.: Using neural network formalism to solve multiple-instance problems. In: Cong, F., Leung, A., Wei, Q. (eds.) ISNN 2017. LNCS, vol. 10261, pp. 135–142. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59072-1_17
Chapter Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986). http://dx.doi.org/10.1023/A:1022643204877
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th International Conference on Machine Learning, pp. 807–814. ICML 2007. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1273496.1273598
Stiborek, J., Pevný, T., Rehák, M.: Multiple instance learning for malware classification. Expert Syst. Appl. 93, 346–357 (2018). http://www.sciencedirect.com/science/article/pii/S0957417417307170

Download references

Author information

Authors and Affiliations

Czech Technical University in Prague, FEE, Prague, Czech Republic
Tomáš Komárek & Jan Brabec
Cisco Systems, Cognitive Intelligence, Prague, Czech Republic
Tomáš Komárek & Jan Brabec
Avast Software, Prague, Czech Republic
Petr Somol

Authors

Tomáš Komárek
View author publications
You can also search for this author in PubMed Google Scholar
Jan Brabec
View author publications
You can also search for this author in PubMed Google Scholar
Petr Somol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomáš Komárek .

Editor information

Editors and Affiliations

ELLIS - The European Laboratory for Learning and Intelligent Systems, Alicante, Spain
Nuria Oliver
ETHZ and EPFL, Zürich, Switzerland
Fernando Pérez-Cruz
Johannes Gutenberg University of Mainz, Mainz, Germany
Stefan Kramer
École Polytechnique, Palaiseau, France
Jesse Read
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Komárek, T., Brabec, J., Somol, P. (2021). Explainable Multiple Instance Learning with Instance Selection Randomized Trees. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-86520-7_44
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)