Skip to main content
Log in

Explaining clusterings of process instances

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper presents a technique that aims to increase human understanding of trace clustering solutions. The clustering techniques under scrutiny stem from the process mining domain, where the clustering of process instances is deemed a useful technique to analyse process data with a large variety of behaviour. Until now, the most often used method to inspect clustering solutions in this domain is visual inspection of the clustering results. This paper proposes a more thorough approach based on the post hoc application of supervised learning with support vector machines on cluster results. Our approach learns concise rules to describe why a specific instance is included in a certain cluster based on specific control-flow based feature variables. An extensive experimental evaluation is presented showing that our technique outperforms alternatives. Likewise, we are able to identify features that lead to shorter and more accurate explanations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Consider \({\textit{Exists}}\mathrm{(}{} { a}\mathrm{)}\) in a case where each trace starts with activity a, then this feature would correspond to a column of ones in the constructed data set. Hence, it can contain no discriminating information and is a redundant feature. In the same data set, with activity a at the start of each trace, \({\textit{Exists(b)}}\) and \({\textit{SometimesWeaklyFollows}}(a,b)\) will be perfectly correlated, making one of these features redundant.

  2. http://www.promtools.org/prom6/.

  3. The plugin itself, screen captures and further explanation can be retrieved from: http://www.processmining.be/svmexplainer.

  4. The event logs are available for download on http://www.processmining.be/svmexplainer/datasets.

  5. Consider for example the explanation “\({\textit{SometimesDirectlyFollows}}(a,b) = 0\) AND \({\textit{SometimesDirectlyFollows}}(b,d) = 0\) AND \({\textit{SometimesDirectlyFollows}}(f,g) = 0\)” denoting that this instance would leave the cluster if the three attributes corresponding to the sometimes directly follows relations listed above would be set to zero. The length of this explanation is thus equal to 3.

References

  • Abello J, van Ham F, Krishnan Neeraj (2006) ASK-GraphView: A Large Scale Graph Visualization System. IEEE Trans Vis Comput Graph 12(5):669–676. doi:10.1109/TVCG.2006.120

    Article  Google Scholar 

  • Adriansyah A, van Dongen BF, van der Aalst WMP (2011) Conformance checking using cost-based fitness analysis. In: Proc. IEEE Enterprise Computing Conf. (EDOC-11), pp 55–64. doi:10.1109/EDOC.2011.12

  • Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl-Based Syst 8(6):373–389

    Article  Google Scholar 

  • Appice A, Malerba D (2015) A co-training strategy for multiple view clustering in process mining. IEEE Trans Serv Comput (99): 1–1. doi:10.1109/TSC.2015.2430327

    Article  Google Scholar 

  • Bose RPJC, van der Aalst WMP (2009) Context aware trace clustering: Towards improving process mining results. In: Proc. SIAM Int. Conf. on Data Mining (SDM-09), pp 401–412. doi:10.1137/1.9781611972795.35

    Chapter  Google Scholar 

  • Bose RPJC, van der Aalst WMP (2010) Trace clustering based on conserved patterns: towards achieving better process models. In: Lecture Notes in Business Information Processing, LNBIP, vol 43, pp 170–181. doi:10.1007/978-3-642-12186-9_16

    Chapter  Google Scholar 

  • Buijs J (2014) Environmental permit application process (wabo), coselog project. Eindhoven University of Technology, Dataset. doi:10.4121/uuid:26aba40d-8b2d-435b-b5af-6d4bfbd7a270

  • Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4):399–424. doi:10.1023/A:1024992613384

    Article  MathSciNet  Google Scholar 

  • Chesani F, Lamma E, Mello P, Montali M, Riguzzi F, Storari S (2009) Exploiting inductive logic programming techniques for declarative process mining. In: Jensen K, van der Aalst WMP (eds.) Transactions on petri nets and other models of concurrency II: special issue on concurrency in process-aware information systems, Springer, Berlin, pp 278–295. doi:10.1007/978-3-642-00899-3_16

    Chapter  Google Scholar 

  • Cohen W (1995) Fast effective rule induction. In: Prieditis A, Russell S (eds.) Proceedings of the 12th international conference on machine learning. Morgan Kaufmann Publishers, Tahoe City, pp 115–123

    Chapter  Google Scholar 

  • Collins C, Carpendale S (2007) VisLink: Revealing relationships amongst visualizations. IEEE Trans Vis Comput Graph 13(6):1192–1199. doi:10.1109/TVCG.2007.70521

    Article  Google Scholar 

  • Cook JE, Wolf AL (1998) Discovering models of software processes from event-based data. ACM Trans Softw Eng Methodol 7(3):215–249

    Article  Google Scholar 

  • de Medeiros AKA, Weijters AJMM, van der Aalst WMP (2007) Genetic process mining: an experimental evaluation. Data Min Knowl Discov 14(2):245–304. doi:10.1007/s10618-006-0061-7

    Article  MathSciNet  Google Scholar 

  • de Medeiros AKA, van der Aalst WMP, Weijters AJMM (2008) Quantifying process equivalence based on observed behavior. Data Knowl Eng 64(1):55–74. doi:10.1016/j.datak.2007.06.010

    Article  Google Scholar 

  • De Weerdt J, Vanden Broucke S (2014) SECPI: searching for explanations for clustered process instances. In: Lecture Notes in Computer Science (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), LNCS, vol 8659, pp 408–415. doi:10.1007/978-3-319-10172-9_29

    Google Scholar 

  • De Weerdt J, De Backer M, Vanthienen J, Baesens B (2012) A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs. Inf Syst 37(7):654–676. doi:10.1016/j.is.2012.02.004

    Article  Google Scholar 

  • De Weerdt J, Vanden Broucke S, Vanthienen J, Baesens B (2013) Active trace clustering for improved process discovery. IEEE Trans Knowl Data Eng 25(12):2708–2720. doi:10.1109/TKDE.2013.64

    Article  Google Scholar 

  • Delias P, Doumpos M, Grigoroudis E, Manolitzas P, Matsatsinis N (2015) Supporting healthcare management decisions via robust clustering of event logs. Knowl-Based Syst 84:203–213. doi:10.1016/j.knosys.2015.04.012

    Article  Google Scholar 

  • Dijkman R, Dumas M, Van Dongen B, Krik R, Mendling J (2011) Similarity of business process models: metrics and evaluation. Inf Syst 36(2):498–516. doi:10.1016/j.is.2010.09.006

    Article  Google Scholar 

  • Dijkman RM (2007) A classification of differences between similar business processes. In: EDOC, pp 37–50. doi:10.1109/EDOC.2007.24

  • Dijkman RM (2008) Diagnosing differences between business process models. In: BPM, pp 261–277. doi:10.1007/978-3-540-85758-7_20

    Google Scholar 

  • Dumas M, La Rosa M, Mendling J, Reijers HA (2013) Fundamentals of business process management. Springer, Heidelberg. doi:10.1007/978-3-642-33143-5

    Book  Google Scholar 

  • Ekanayake CC, Dumas M, García-Bañuelos L, La Rosa M (2013) Slice, mine and dice: complexity-aware automated discovery of business process models. In: BPM, pp 49–64. doi:10.1007/978-3-642-40176-3_6

    Google Scholar 

  • Evermann J, Thaler T, Fettke P (2016) Clustering traces using sequence alignment. In: Reichert M, Reijers HA (eds.) Business process management workshops: BPM 2015. In: 13th international workshops, Innsbruck, Austria, August 31–September 3, 2015, Revised Papers. Springer International Publishing, Cham, pp 179–190. doi:10.1007/978-3-319-42887-1_15

    Chapter  Google Scholar 

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874. doi:10.1038/oby.2011.351

    Article  MATH  Google Scholar 

  • Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) Knowledge discovery and data mining: towards a unifying framework. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, pp 82–88

  • Ferreira DR, Zacarias M, Malheiros M, Ferreira P (2007) Approaching process mining with sequence clustering: experiments and findings. In: BPM, pp 360–374. doi:10.1007/978-3-540-75183-0_26

  • Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  • Folino F, Greco G, Guzzo A, Pontieri L (2011) Mining usage scenarios in business processes: outlier-aware discovery and run-time prediction. Data Knowl Eng 70(12):1005–1029. doi:10.1016/j.datak.2011.07.002

    Article  Google Scholar 

  • Fred A, Lourenço A (2008) Cluster ensemble methods: from single clusterings to combined solutions. In: Supervised and unsupervised ensemble methods and their applications, Springer, Berlin, pp 3–30. doi:10.1007/978-3-540-78981-9_1

    Chapter  Google Scholar 

  • Gansner ER, Hu Y, Kobourov S (2010) Visualizing graphs and clusters as maps. IEEE Comput Graph Appl 30(6):54–66. doi:10.1109/MCG.2010.101

    Article  Google Scholar 

  • Goedertier S, Martens D, Vanthienen J, Baesens B (2009) Robust process discovery with artificial negative events. J Mach Learn Res 10:1305–1340. doi:10.1145/1577069.1577113

    Article  MathSciNet  MATH  Google Scholar 

  • Greco G, Guzzo A, Pontieri L, Saccà D (2006) Discovering expressive process models by clustering log traces. IEEE Trans Knowl Data Eng 18(8):1010–1027. doi:10.1109/TKDE.2006.123

    Article  Google Scholar 

  • Günther CW, Verbeek H (2014) Xes-standard definition. BPM Center Report BPM-14-09, BPMcenterorg

  • Hidders J, Dumas M, van der Aalst WMP, ter Hofstede AHM, Verelst J (2005) When are two workflows the same? In: Proceedings of the 2005 Australasian symposium on theory of computing, CATS ’05, vol 41, pp 3–11. Australian Computer Society Inc., Darlinghurst. http://dl.acm.org/citation.cfm?id=1082260.1082261

  • Kiepuszewski B, ter Hofstede AHM, van der Aalst WMP (2003) Fundamentals of control flow in workflows. Acta Inf 39(3):143–209. doi:10.1007/s00236-002-0105-4

    Article  MathSciNet  MATH  Google Scholar 

  • Lamma E, Mello P, Riguzzi F, Storari S (2008) Applying inductive logic programming to process mining. In: Blockeel H, Ramon J, Shavlik J, Tadepalli P (eds.) Inductive logic programming: 17th international conference, ILP 2007, Corvallis, June 19–21, 2007, Revised Selected Papers. Springer, Berlin, pp 132–146. doi:10.1007/978-3-540-78469-2_16

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707–710

    MathSciNet  Google Scholar 

  • Martens D, Provost F (2014) Explaining data-driven document classifications. MIS Q 38(1):73–99

    Article  Google Scholar 

  • Martens D, Baesens B, Gestel TV, Vanthienen J (2007) Comprehensible credit scoring models using rule extraction from support vector machines. Eur J Oper Res 183(3):1466–1476. doi:10.1016/j.ejor.2006.04.051

    Article  MATH  Google Scholar 

  • Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning. Springer, Berlin, pp 331–363

    Chapter  Google Scholar 

  • Mitchell TM, Keller RM, Kedar-Cabelli ST (1986) Explanation-based generalization: a unifying view. Mach Learn 1(1):47–80. doi:10.1023/A:1022691120807

    Article  Google Scholar 

  • Pesic M, Schonenberg H, van der Aalst WM (2007) Declare: full support for loosely-structured processes. In: Enterprise distributed object computing conference, 2007. EDOC 2007. 11th IEEE international, pp 287–287. doi:10.1109/EDOC.2007.14

  • Quinlan J (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco

  • Ribeiro MT, Singh S, Guestrin C (2016) “why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. ACM, New York, pp 1135–1144. doi:10.1145/2939672.2939778

  • Rozinat A, van der Aalst WMP (2006) Decision mining in ProM. In: Business process management, pp 420–425. doi:10.1007/11841760_33

    Google Scholar 

  • Rozinat A, van der Aalst WMP (2008) Conformance checking of processes based on monitoring real behavior. Inf Syst 33(1):64–95. doi:10.1016/j.is.2007.07.001

    Article  Google Scholar 

  • Sole M, Carmona J (2011) Region-based foldings in process discovery. IEEE Trans Knowl Data Eng 25(1):192–205. doi:10.1109/TKDE.2011.192

    Article  Google Scholar 

  • Song M, Günther CW, van der Aalst WMP (2008) Trace clustering in process mining. In: BPM workshops, pp 109–120. doi:10.1007/978-3-642-00328-8_11

    Chapter  Google Scholar 

  • Song M, Yang H, Siadat SH, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40:3722–3737. doi:10.1016/j.eswa.2012.12.078

    Article  Google Scholar 

  • Steeman W (2013) BPI challenge 2013. Ghent University, Dataset. doi:10.4121/uuid:a7ce5c55-03a7-4583-b855-98b86e1a2b07

  • van der Aalst WMP (1999) Formalization and verification of event-driven process chains. Inf Softw Technol 41(10):639–650. doi:10.1016/S0950-5849(99)00016-6

    Article  Google Scholar 

  • van der Aalst WMP (2016) Process mining—data science in action, 2nd edn. Springer, Berlin. doi:10.1007/978-3-662-49851-4

    Book  Google Scholar 

  • van der Aalst WMP, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142. doi:10.1109/TKDE.2004.47

    Article  Google Scholar 

  • van der Aalst WMP, de Medeiros AKA, Weijters AJMM (2006) Process equivalence: comparing two process models based on observed behavior. In: Business process management, pp 129–144. doi:10.1007/11841760_10

    Google Scholar 

  • van Dongen BF, Dijkman RM, Mendling J (2008) Measuring similarity between business process models. In: CAiSE, pp 450–464. doi:10.1007/978-3-540-69534-9_34

    Google Scholar 

  • van Glabbeek RJ, Goltz U (2001) Refinement of actions and equivalence notions for concurrent systems. Acta Inf 37(4/5):229–327. doi:10.1007/s002360000041

    Article  MathSciNet  MATH  Google Scholar 

  • Veiga GM, Ferreira DR (2010) Understanding spaghetti models with sequence clustering for prom. In: Rinderle-Ma, S et al (ed.) BPM workshops, Springer, LNBIP, vol 43, pp 92–103. doi:10.1007/978-3-642-12186-9

    Google Scholar 

  • Viau C, McGuffin MJ, Chiricota Y, Jurisica I (2010) The FlowVizMenu and parallel scatterplot matrix: hybrid multidimensional visualizations for network exploration. IEEE Trans Vis Comput Graph 16(6):1100–1108. doi:10.1109/TVCG.2010.205

    Article  Google Scholar 

  • Wang F, Sun J (2014) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Discov 29(2):534–564. doi:10.1007/s10618-014-0356-z

    Article  MathSciNet  Google Scholar 

  • Weidlich M, Mendling J, Weske M (2011) Efficient consistency measurement based on behavioral profiles of process models. IEEE Trans Softw Eng 37(3):410–429. doi:10.1109/TSE.2010.96

    Article  Google Scholar 

  • Weijters AJMM, van der Aalst WMP, Alves de Medeiros AK (2006) Process mining with the heuristicsminer algorithm. In: BETA working paper series 166, TU Eindhoven

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pieter De Koninck.

Additional information

Responsible editor: Toon Calders.

Appendix

Appendix

This appendix contains Tables 9, 10 and Figs. 7, 8, 9, 10, 11, 12, 13, 14 and 15.

Table 9 Results of the experimental evaluation comparing SECPI with C4.5 and RIPPER averaged over clustering techniques and datasets, replicated with a cluster number of 6
Table 10 Results of the experimental evaluation comparing SECPI with C4.5 and RIPPER averaged over clustering techniques and datasets, replicated with a cluster number of 8
Fig. 7
figure 7

Average accuracy (higher is better) and explanation length results (lower is better) for telecom log

Fig. 8
figure 8

Average accuracy (higher is better) and explanation length results (lower is better) for purchase log

Fig. 9
figure 9

Average accuracy (higher is better) and explanation length results (lower is better) for admin log

Fig. 10
figure 10

Average accuracy (higher is better) and explanation length results (lower is better) for tender log

Fig. 11
figure 11

Average accuracy (higher is better) and explanation length results (lower is better) for incident log

Fig. 12
figure 12

Average accuracy (higher is better) and explanation length results (lower is better) for cProblemVolvo log

Fig. 13
figure 13

Average accuracy (higher is better) and explanation length results (lower is better) for environment log

Fig. 14
figure 14

Average accuracy (higher is better) and explanation length results (lower is better) for incman log

Fig. 15
figure 15

Average accuracy (higher is better) and explanation length results (lower is better) for reviewing log

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

De Koninck, P., De Weerdt, J. & vanden Broucke, S.K.L.M. Explaining clusterings of process instances. Data Min Knowl Disc 31, 774–808 (2017). https://doi.org/10.1007/s10618-016-0488-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0488-4

Keywords

Navigation