Similarity-Based Approaches for Determining the Number of Trace Clusters in Process Discovery

  • Pieter De KoninckEmail author
  • Jochen De Weerdt
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10470)


Given the complexity of real-life event logs, several trace clustering techniques have been proposed to partition an event log into subsets with a lower degree of variation. In general, these techniques assume that the number of clusters is known in advance. However, this will rarely be the case in practice. Therefore, this paper presents approaches to determine the appropriate number of clusters in a trace clustering context. In order to fulfil the objective of identifying the most appropriate number of trace clusters, two approaches built on similarity are proposed: a stability- and a separation-based method. The stability-based method iteratively calculates the similarity between clustered versions of perturbed and unperturbed event logs. Alternatively, an approach based on between-cluster dissimilarity, or separation, is proposed. Regarding practical validation, both approaches are tested on multiple real-life datasets to investigate the complementarity of the different components. Our results suggest that both methods are successful in identifying an appropriate number of trace clusters.


Stability Trace clustering Validity Log perturbation Process discovery Separation 


  1. 1.
    van der Aalst, W.: Process Mining: Data Science in Action. Springer, Berlin (2016)CrossRefGoogle Scholar
  2. 2.
    Bose, R.P.J.C., van der Aalst, W.M.P.: Trace clustering based on conserved patterns: towards achieving better process models. In: Rinderle-Ma, S., Sadiq, S., Leymann, F. (eds.) BPM 2009. LNBIP, vol. 43, pp. 170–181. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-12186-9_16 CrossRefGoogle Scholar
  3. 3.
    Bose, R., Aalst, W.V.D.: Context aware trace clustering: towards improving process mining results. In: SDM, pp. 401–412 (2009)Google Scholar
  4. 4.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 2, 224–227 (1979)CrossRefGoogle Scholar
  5. 5.
    De Koninck, P., De Weerdt, J.: Determining the number of trace clusters: a stability-based approach. In: Proceedings of the International Workshop on Algorithms & Theories for the Analysis of Event Data (ATAED) 2016, vol. 1592, pp. 1–15. CEUR-ws Workshop Proceedings (2016)Google Scholar
  6. 6.
    De Koninck, P., De Weerdt, J.: A stability assessment framework for process discovery techniques. In: La Rosa, M., Loos, P., Pastor, O. (eds.) BPM 2016. LNCS, vol. 9850, pp. 57–72. Springer, Cham (2016). doi: 10.1007/978-3-319-45348-4_4 CrossRefGoogle Scholar
  7. 7.
    De Medeiros, A.K.A., Weijters, A.J.M.M., Van Der Aalst, W.M.P.: Genetic process mining: an experimental evaluation. Data Min. Knowl. Discov. 14(2), 245–304 (2007)MathSciNetCrossRefGoogle Scholar
  8. 8.
    De Weerdt, J., De Backer, M., Vanthienen, J., Baesens, B.: A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs. Inform. Syst. 37(7), 654–676 (2012)CrossRefGoogle Scholar
  9. 9.
    De Weerdt, J., Vanden Broucke, S., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery. IEEE Trans. Knowl. Data Eng. 25(12), 2708–2720 (2013)CrossRefGoogle Scholar
  10. 10.
    Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., Matsatsinis, N.: Supporting healthcare management decisions via robust clustering of event logs. Knowledge-Based Syst. 84, 203–213 (2015)CrossRefGoogle Scholar
  11. 11.
    Di Ciccio, C., Mecella, M., Mendling, J.: The effect of noise on mined declarative constraints. In: Ceravolo, P., Accorsi, R., Cudre-Mauroux, P. (eds.) SIMPDA 2013. LNBIP, vol. 203, pp. 1–24. Springer, Heidelberg (2015). doi: 10.1007/978-3-662-46436-6_1 Google Scholar
  12. 12.
    Dijkman, R., Dumas, M., Van Dongen, B., Krik, R., Mendling, J.: Similarity of business process models: metrics and evaluation. Inform. Syst. 36(2), 498–516 (2011)CrossRefGoogle Scholar
  13. 13.
    van Dongen, B., Dijkman, R., Mendling, J.: Measuring similarity between business process models. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 450–464. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-69534-9_34 CrossRefGoogle Scholar
  14. 14.
    Ekanayake, C.C., Dumas, M., García-Bañuelos, L., La Rosa, M.: Slice, mine and dice: complexity-aware automated discovery of business process models. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 49–64. Springer, Heidelberg (2013). doi: 10.1007/978-3-642-40176-3_6 CrossRefGoogle Scholar
  15. 15.
    Evermann, J., Thaler, T., Fettke, P.: Clustering traces using sequence alignment. In: Reichert, M., Reijers, H.A. (eds.) BPM 2015. LNBIP, vol. 256, pp. 179–190. Springer, Cham (2016). doi: 10.1007/978-3-319-42887-1_15 CrossRefGoogle Scholar
  16. 16.
    Ferreira, D., Zacarias, M., Malheiros, M., Ferreira, P.: Approaching process mining with sequence clustering: experiments and findings. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 360–374. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-75183-0_26 CrossRefGoogle Scholar
  17. 17.
    Folino, F., Greco, G., Guzzo, A., Pontieri, L.: Editorial: mining usage scenarios in business processes: outlier-aware discovery and run-time prediction. Data Knowl. Eng. 70, 1005–1029 (2011)CrossRefGoogle Scholar
  18. 18.
    Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to combined solutions. Stud. Comput. Intell. 126, 3–30 (2008)Google Scholar
  19. 19.
    Goedertier, S., Martens, D., Vanthienen, J., Baesens, B.: Robust process discovery with artificial negative events. J. Mach. Learn. Res. 10, 1305–1340 (2009)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Greco, G., Guzzo, A., Pontieri, L., Saccà, D.: Discovering expressive process models by clustering log traces. IEEE Trans. Knowl. Data Eng. 18(8), 1010–1027 (2006)CrossRefGoogle Scholar
  21. 21.
    Jagadeesh Chandra Bose, R.P., van der Aalst, W.M.P.: Abstractions in process mining: a taxonomy of patterns. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 159–175. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-03848-8_12 CrossRefGoogle Scholar
  22. 22.
    Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)CrossRefzbMATHGoogle Scholar
  23. 23.
    Lee, Y., Lee, J.H., Jun, C.H.: Validation measures of bicluster solutions. Ind. Eng. Manag. Syst. 8(2), 101–108 (2009)MathSciNetGoogle Scholar
  24. 24.
    Lee, Y., Lee, J., Jun, C.H.: Stability-based validation of bicluster solutions. Pattern Recognit. 44(2), 252–264 (2011)CrossRefzbMATHGoogle Scholar
  25. 25.
    Maruster, L.: A machine learning approach to understand business processes. Eindhoven University of Technology (2003)Google Scholar
  26. 26.
    Mirkin, B.: Choosing the number of clusters. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 252–260 (2011)CrossRefGoogle Scholar
  27. 27.
    Song, M., Günther, C.W., van der Aalst, W.M.P.: Trace clustering in process mining. In: Ardagna, D., Mecella, M., Yang, J. (eds.) BPM 2008. LNBIP, vol. 17, pp. 109–120. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-00328-8_11 CrossRefGoogle Scholar
  28. 28.
    Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B (Statistical Methodol.) 63, 411–423 (2001)Google Scholar
  29. 29.
    Van der Aalst, W., Adriansyah, A., Van Dongen, B.: Replaying history on process models for conformance checking and performance analysis. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2(2), 182–192 (2012)CrossRefGoogle Scholar
  30. 30.
    Weidlich, M., Polyvyanyy, A., Desai, N., Mendling, J., Weske, M.: Process compliance analysis based on behavioural profiles. Inform. Syst. 36(7), 1009–1025 (2011)CrossRefzbMATHGoogle Scholar
  31. 31.
    Weijters, A.J.M.M., van der Aalst, W.: Rediscovering workflow models from event-based data using little thumb. Integr. Comput. Eng. 10, 151–162 (2003)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.Research Center for Management Informatics, Faculty of Economics and BusinessKU LeuvenLeuvenBelgium

Personalised recommendations