Abstract
This paper presents a technique that aims to increase human understanding of trace clustering solutions. The clustering techniques under scrutiny stem from the process mining domain, where the clustering of process instances is deemed a useful technique to analyse process data with a large variety of behaviour. Until now, the most often used method to inspect clustering solutions in this domain is visual inspection of the clustering results. This paper proposes a more thorough approach based on the post hoc application of supervised learning with support vector machines on cluster results. Our approach learns concise rules to describe why a specific instance is included in a certain cluster based on specific control-flow based feature variables. An extensive experimental evaluation is presented showing that our technique outperforms alternatives. Likewise, we are able to identify features that lead to shorter and more accurate explanations.
Similar content being viewed by others
Notes
Consider \({\textit{Exists}}\mathrm{(}{} { a}\mathrm{)}\) in a case where each trace starts with activity a, then this feature would correspond to a column of ones in the constructed data set. Hence, it can contain no discriminating information and is a redundant feature. In the same data set, with activity a at the start of each trace, \({\textit{Exists(b)}}\) and \({\textit{SometimesWeaklyFollows}}(a,b)\) will be perfectly correlated, making one of these features redundant.
The plugin itself, screen captures and further explanation can be retrieved from: http://www.processmining.be/svmexplainer.
The event logs are available for download on http://www.processmining.be/svmexplainer/datasets.
Consider for example the explanation “\({\textit{SometimesDirectlyFollows}}(a,b) = 0\) AND \({\textit{SometimesDirectlyFollows}}(b,d) = 0\) AND \({\textit{SometimesDirectlyFollows}}(f,g) = 0\)” denoting that this instance would leave the cluster if the three attributes corresponding to the sometimes directly follows relations listed above would be set to zero. The length of this explanation is thus equal to 3.
References
Abello J, van Ham F, Krishnan Neeraj (2006) ASK-GraphView: A Large Scale Graph Visualization System. IEEE Trans Vis Comput Graph 12(5):669–676. doi:10.1109/TVCG.2006.120
Adriansyah A, van Dongen BF, van der Aalst WMP (2011) Conformance checking using cost-based fitness analysis. In: Proc. IEEE Enterprise Computing Conf. (EDOC-11), pp 55–64. doi:10.1109/EDOC.2011.12
Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389
Appice A, Malerba D (2015) A co-training strategy for multiple view clustering in process mining. IEEE Trans Serv Comput (99): 1–1. doi:10.1109/TSC.2015.2430327
Bose RPJC, van der Aalst WMP (2009) Context aware trace clustering: Towards improving process mining results. In: Proc. SIAM Int. Conf. on Data Mining (SDM-09), pp 401–412. doi:10.1137/1.9781611972795.35
Bose RPJC, van der Aalst WMP (2010) Trace clustering based on conserved patterns: towards achieving better process models. In: Lecture Notes in Business Information Processing, LNBIP, vol 43, pp 170–181. doi:10.1007/978-3-642-12186-9_16
Buijs J (2014) Environmental permit application process (wabo), coselog project. Eindhoven University of Technology, Dataset. doi:10.4121/uuid:26aba40d-8b2d-435b-b5af-6d4bfbd7a270
Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4):399–424. doi:10.1023/A:1024992613384
Chesani F, Lamma E, Mello P, Montali M, Riguzzi F, Storari S (2009) Exploiting inductive logic programming techniques for declarative process mining. In: Jensen K, van der Aalst WMP (eds.) Transactions on petri nets and other models of concurrency II: special issue on concurrency in process-aware information systems, Springer, Berlin, pp 278–295. doi:10.1007/978-3-642-00899-3_16
Cohen W (1995) Fast effective rule induction. In: Prieditis A, Russell S (eds.) Proceedings of the 12th international conference on machine learning. Morgan Kaufmann Publishers, Tahoe City, pp 115–123
Collins C, Carpendale S (2007) VisLink: Revealing relationships amongst visualizations. IEEE Trans Vis Comput Graph 13(6):1192–1199. doi:10.1109/TVCG.2007.70521
Cook JE, Wolf AL (1998) Discovering models of software processes from event-based data. ACM Trans Softw Eng Methodol 7(3):215–249
de Medeiros AKA, Weijters AJMM, van der Aalst WMP (2007) Genetic process mining: an experimental evaluation. Data Min Knowl Discov 14(2):245–304. doi:10.1007/s10618-006-0061-7
de Medeiros AKA, van der Aalst WMP, Weijters AJMM (2008) Quantifying process equivalence based on observed behavior. Data Knowl Eng 64(1):55–74. doi:10.1016/j.datak.2007.06.010
De Weerdt J, Vanden Broucke S (2014) SECPI: searching for explanations for clustered process instances. In: Lecture Notes in Computer Science (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), LNCS, vol 8659, pp 408–415. doi:10.1007/978-3-319-10172-9_29
De Weerdt J, De Backer M, Vanthienen J, Baesens B (2012) A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs. Inf Syst 37(7):654–676. doi:10.1016/j.is.2012.02.004
De Weerdt J, Vanden Broucke S, Vanthienen J, Baesens B (2013) Active trace clustering for improved process discovery. IEEE Trans Knowl Data Eng 25(12):2708–2720. doi:10.1109/TKDE.2013.64
Delias P, Doumpos M, Grigoroudis E, Manolitzas P, Matsatsinis N (2015) Supporting healthcare management decisions via robust clustering of event logs. Knowl-Based Syst 84:203–213. doi:10.1016/j.knosys.2015.04.012
Dijkman R, Dumas M, Van Dongen B, Krik R, Mendling J (2011) Similarity of business process models: metrics and evaluation. Inf Syst 36(2):498–516. doi:10.1016/j.is.2010.09.006
Dijkman RM (2007) A classification of differences between similar business processes. In: EDOC, pp 37–50. doi:10.1109/EDOC.2007.24
Dijkman RM (2008) Diagnosing differences between business process models. In: BPM, pp 261–277. doi:10.1007/978-3-540-85758-7_20
Dumas M, La Rosa M, Mendling J, Reijers HA (2013) Fundamentals of business process management. Springer, Heidelberg. doi:10.1007/978-3-642-33143-5
Ekanayake CC, Dumas M, García-Bañuelos L, La Rosa M (2013) Slice, mine and dice: complexity-aware automated discovery of business process models. In: BPM, pp 49–64. doi:10.1007/978-3-642-40176-3_6
Evermann J, Thaler T, Fettke P (2016) Clustering traces using sequence alignment. In: Reichert M, Reijers HA (eds.) Business process management workshops: BPM 2015. In: 13th international workshops, Innsbruck, Austria, August 31–September 3, 2015, Revised Papers. Springer International Publishing, Cham, pp 179–190. doi:10.1007/978-3-319-42887-1_15
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874. doi:10.1038/oby.2011.351
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, pp 82–88
Ferreira DR, Zacarias M, Malheiros M, Ferreira P (2007) Approaching process mining with sequence clustering: experiments and findings. In: BPM, pp 360–374. doi:10.1007/978-3-540-75183-0_26
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Folino F, Greco G, Guzzo A, Pontieri L (2011) Mining usage scenarios in business processes: outlier-aware discovery and run-time prediction. Data Knowl Eng 70(12):1005–1029. doi:10.1016/j.datak.2011.07.002
Fred A, Lourenço A (2008) Cluster ensemble methods: from single clusterings to combined solutions. In: Supervised and unsupervised ensemble methods and their applications, Springer, Berlin, pp 3–30. doi:10.1007/978-3-540-78981-9_1
Gansner ER, Hu Y, Kobourov S (2010) Visualizing graphs and clusters as maps. IEEE Comput Graph Appl 30(6):54–66. doi:10.1109/MCG.2010.101
Goedertier S, Martens D, Vanthienen J, Baesens B (2009) Robust process discovery with artificial negative events. J Mach Learn Res 10:1305–1340. doi:10.1145/1577069.1577113
Greco G, Guzzo A, Pontieri L, Saccà D (2006) Discovering expressive process models by clustering log traces. IEEE Trans Knowl Data Eng 18(8):1010–1027. doi:10.1109/TKDE.2006.123
Günther CW, Verbeek H (2014) Xes-standard definition. BPM Center Report BPM-14-09, BPMcenterorg
Hidders J, Dumas M, van der Aalst WMP, ter Hofstede AHM, Verelst J (2005) When are two workflows the same? In: Proceedings of the 2005 Australasian symposium on theory of computing, CATS ’05, vol 41, pp 3–11. Australian Computer Society Inc., Darlinghurst. http://dl.acm.org/citation.cfm?id=1082260.1082261
Kiepuszewski B, ter Hofstede AHM, van der Aalst WMP (2003) Fundamentals of control flow in workflows. Acta Inf 39(3):143–209. doi:10.1007/s00236-002-0105-4
Lamma E, Mello P, Riguzzi F, Storari S (2008) Applying inductive logic programming to process mining. In: Blockeel H, Ramon J, Shavlik J, Tadepalli P (eds.) Inductive logic programming: 17th international conference, ILP 2007, Corvallis, June 19–21, 2007, Revised Selected Papers. Springer, Berlin, pp 132–146. doi:10.1007/978-3-540-78469-2_16
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707–710
Martens D, Provost F (2014) Explaining data-driven document classifications. MIS Q 38(1):73–99
Martens D, Baesens B, Gestel TV, Vanthienen J (2007) Comprehensible credit scoring models using rule extraction from support vector machines. Eur J Oper Res 183(3):1466–1476. doi:10.1016/j.ejor.2006.04.051
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning. Springer, Berlin, pp 331–363
Mitchell TM, Keller RM, Kedar-Cabelli ST (1986) Explanation-based generalization: a unifying view. Mach Learn 1(1):47–80. doi:10.1023/A:1022691120807
Pesic M, Schonenberg H, van der Aalst WM (2007) Declare: full support for loosely-structured processes. In: Enterprise distributed object computing conference, 2007. EDOC 2007. 11th IEEE international, pp 287–287. doi:10.1109/EDOC.2007.14
Quinlan J (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
Ribeiro MT, Singh S, Guestrin C (2016) “why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. ACM, New York, pp 1135–1144. doi:10.1145/2939672.2939778
Rozinat A, van der Aalst WMP (2006) Decision mining in ProM. In: Business process management, pp 420–425. doi:10.1007/11841760_33
Rozinat A, van der Aalst WMP (2008) Conformance checking of processes based on monitoring real behavior. Inf Syst 33(1):64–95. doi:10.1016/j.is.2007.07.001
Sole M, Carmona J (2011) Region-based foldings in process discovery. IEEE Trans Knowl Data Eng 25(1):192–205. doi:10.1109/TKDE.2011.192
Song M, Günther CW, van der Aalst WMP (2008) Trace clustering in process mining. In: BPM workshops, pp 109–120. doi:10.1007/978-3-642-00328-8_11
Song M, Yang H, Siadat SH, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40:3722–3737. doi:10.1016/j.eswa.2012.12.078
Steeman W (2013) BPI challenge 2013. Ghent University, Dataset. doi:10.4121/uuid:a7ce5c55-03a7-4583-b855-98b86e1a2b07
van der Aalst WMP (1999) Formalization and verification of event-driven process chains. Inf Softw Technol 41(10):639–650. doi:10.1016/S0950-5849(99)00016-6
van der Aalst WMP (2016) Process mining—data science in action, 2nd edn. Springer, Berlin. doi:10.1007/978-3-662-49851-4
van der Aalst WMP, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142. doi:10.1109/TKDE.2004.47
van der Aalst WMP, de Medeiros AKA, Weijters AJMM (2006) Process equivalence: comparing two process models based on observed behavior. In: Business process management, pp 129–144. doi:10.1007/11841760_10
van Dongen BF, Dijkman RM, Mendling J (2008) Measuring similarity between business process models. In: CAiSE, pp 450–464. doi:10.1007/978-3-540-69534-9_34
van Glabbeek RJ, Goltz U (2001) Refinement of actions and equivalence notions for concurrent systems. Acta Inf 37(4/5):229–327. doi:10.1007/s002360000041
Veiga GM, Ferreira DR (2010) Understanding spaghetti models with sequence clustering for prom. In: Rinderle-Ma, S et al (ed.) BPM workshops, Springer, LNBIP, vol 43, pp 92–103. doi:10.1007/978-3-642-12186-9
Viau C, McGuffin MJ, Chiricota Y, Jurisica I (2010) The FlowVizMenu and parallel scatterplot matrix: hybrid multidimensional visualizations for network exploration. IEEE Trans Vis Comput Graph 16(6):1100–1108. doi:10.1109/TVCG.2010.205
Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564. doi:10.1007/s10618-014-0356-z
Weidlich M, Mendling J, Weske M (2011) Efficient consistency measurement based on behavioral profiles of process models. IEEE Trans Softw Eng 37(3):410–429. doi:10.1109/TSE.2010.96
Weijters AJMM, van der Aalst WMP, Alves de Medeiros AK (2006) Process mining with the heuristicsminer algorithm. In: BETA working paper series 166, TU Eindhoven
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Toon Calders.
Rights and permissions
About this article
Cite this article
De Koninck, P., De Weerdt, J. & vanden Broucke, S.K.L.M. Explaining clusterings of process instances. Data Min Knowl Disc 31, 774–808 (2017). https://doi.org/10.1007/s10618-016-0488-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0488-4