Skip to main content

Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring

  • 483 Accesses

Part of the Lecture Notes in Business Information Processing book series (LNBIP,volume 436)

Abstract

Advances in AI, and especially machine learning, are increasingly drawing research interest and efforts towards predictive process monitoring, the subfield of process mining (PM) that concerns predicting next events, process outcomes and remaining execution times. Unfortunately, researchers use a variety of datasets and ways to split them into training and test sets. The documentation of these preprocessing steps is not always complete. Consequently, research results are hard or even impossible to reproduce and to compare between papers. At times, the use of non-public domain knowledge further hampers the fair competition of ideas. Often the training and test sets are not completely separated, a data leakage problem particular to predictive process monitoring. Moreover, test sets usually suffer from bias in terms of both the mix of case durations and the number of running cases. These obstacles pose a challenge to the field’s progress. The contribution of this paper is to identify and demonstrate the importance of these obstacles and to propose preprocessing steps to arrive at unbiased benchmark datasets in a principled way, thus creating representative test sets without data leakage with the aim of levelling the playing field, promoting open science and contributing to more rapid progress in predictive process monitoring.

Keywords

  • Predictive process monitoring
  • Remaining time prediction
  • Bias
  • Benchmarking
  • Reproducibility
  • Datasets
  • Preprocessing

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-94343-1_2
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-94343-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   89.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.

Notes

  1. 1.

    https://gluebenchmark.com/.

  2. 2.

    https://arxiv.org/abs/2004.07219.

  3. 3.

    http://yann.lecun.com/exdb/mnist/.

  4. 4.

    https://www.cs.toronto.edu/~kriz/cifar.html.

  5. 5.

    https://image-net.org/.

  6. 6.

    https://github.com/hansweytjens/predictive-process-monitoring-benchmarks.

  7. 7.

    https://data.4tu.nl/articles/dataset/BPI_Challenge_2012/12689204.

  8. 8.

    https://data.4tu.nl/collections/BPI_Challenge_2015/5065424.

  9. 9.

    https://data.4tu.nl/articles/dataset/BPI_Challenge_2017/12696884.

  10. 10.

    https://data.4tu.nl/articles/dataset/BPI_Challenge_2019/12715853.

  11. 11.

    https://data.4tu.nl/collections/BPI_Challenge_2020/5065541.

References

  1. Teinemaa, I., Dumas, M., La Rosa, M., Maggi, F.M.: Outcome-oriented predictive process monitoring: review and benchmark. ACM Trans. Knowl. Discov. Data (TKDD) 13(2), Article No. 17 (2019)

    Google Scholar 

  2. Kratsch, W., Manderscheid, J., Roeglinger, M., Seyfried, J.: Machine learning in business process monitoring: a comparison of deep learning and classical approaches used for outcome prediction. Bus. Inf. Syst. Eng. 63, 261–276 (2020). https://doi.org/10.1007/s12599-020-00645-0

    CrossRef  Google Scholar 

  3. Verenich, I., Dumas, M., La Rosa, M., Maggi, F.M., Teinemaa, I.: Survey and cross-benchmark comparison of remaining time prediction methods in business process monitoring. ACM Trans. Intell. Syst. Technol. (TIST) 10(4), 1–34 (2019)

    CrossRef  Google Scholar 

  4. Neu D. A., Lahann J., Fettke P.: A systematic literature review on state-of-the-art deep learning methods for process prediction. Artif. Intell. Rev. (2021). https://doi.org/10.1007/s10462-021-09960-8

  5. Tax, N., Verenich, I., La Rosa, M., Dumas, M.: Predictive business process monitoring with LSTM neural networks. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 477–492. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_30

    CrossRef  Google Scholar 

  6. Evermann, J., Rehse, J.-R., Fettke, P.: Predicting process behaviour using deep learning. Decis. Support Syst. 100, 129–140 (2017)

    CrossRef  Google Scholar 

  7. Polato, M., Sperduti, A., Burattin, A, de Leoni, M.: Data-aware remaining time prediction of business process Instances. In: Proceedings of the International Joint Conference on Neural Networks, pp. 816–823 (2014)

    Google Scholar 

  8. Camargo, M., Dumas, M., González-Rojas, O.: Learning accurate LSTM models of business processes. In: Hildebrandt, T., van Dongen, B.F., Röglinger, M., Mendling, J. (eds.) BPM 2019. LNCS, vol. 11675, pp. 286–302. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26619-6_19

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hans Weytjens .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Weytjens, H., De Weerdt, J. (2022). Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring. In: Marrella, A., Weber, B. (eds) Business Process Management Workshops. BPM 2021. Lecture Notes in Business Information Processing, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-030-94343-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-94343-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-94342-4

  • Online ISBN: 978-3-030-94343-1

  • eBook Packages: Computer ScienceComputer Science (R0)