Skip to main content

Automated Business Process Discovery from Unstructured Natural-Language Documents

  • Conference paper
  • First Online:
Business Process Management Workshops (BPM 2020)

Abstract

Understanding the processes followed by organizations is important to ensure business outcomes are achieved in an optimal, efficient and compliant manner. Process mining techniques rely on the existence of structured event logs captured by process management systems. These systems are not always employed and may not capture all process steps, leaving out those that occur through emails and chat software or edits to documents and knowledge-management systems.

Here we present an algorithm for the automated extraction of processes from unstructured natural-language documents. Action and topic analysis is used to generate an event log, from which process models are mined using standard techniques. We show the algorithm is capable of generating consistent software-development processes from an Apache Camel email dataset.

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. van der Aalst, W., Adriansyah, A., van Dongen, B.: Replaying history on process models for conformance checking and performance analysis. Wiley Interdisc. Rev. Data Mining Knowl. Discov. 2(2), 182–192 (2012). https://doi.org/10.1002/widm.1045

    Article  Google Scholar 

  2. van der Aalst, W., Nikolov, A.: EMailAnalyzer: an E-Mail mining plug-in for the ProM framework. BPM Center Report-07-16 (2007)

    Google Scholar 

  3. Adriansyah, A., van Dongen, B., van der Aalst, W.: Conformance checking using cost-based fitness analysis. In: 2011 IEEE 15th International Enterprise Distributed Object Computing Conference, pp. 55–64, August 2011. https://doi.org/10.1109/EDOC.2011.12

  4. Allard, T., Alvino, P., Shing, L., Wollaber, A., Yuen, J.: A dataset to facilitate automated workflow analysis. PLoS ONE 14(2), 1–22 (2019). https://doi.org/10.1371/journal.pone.0211486

    Article  Google Scholar 

  5. Apache Camel (2019). https://camel.apache.org/. Accessed 31 Oct 2019

  6. Bellotti, V., Ducheneaut, N., Howard, M., Smith, I.: Taskmaster: recasting email as task management. In: CSCW Workshop on Re-designing E-mail for the 21st Century (2002)

    Google Scholar 

  7. Berti, A., van Zelst, S.J., van der Aalst, W.M.P.: Process mining for python (PM4PY): bridging the gap between process- and data science. In: 2019 International Conference on Process Mining (ICPM) (2019)

    Google Scholar 

  8. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017). https://doi.org/10.1109/CVPR.2017.195

  9. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Tran. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002). https://doi.org/10.1109/34.1000236

    Article  Google Scholar 

  10. Cooley, R.: Classification of news stories using support vector machines. In: Proceedings 16th International Joint Conference on Artificial Intelligence Text Mining Workshop. Citeseer (1999)

    Google Scholar 

  11. Corston-Oliver, S., Ringger, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out, Barcelona, Spain, pp. 43–50. Association for Computational Linguistics, July 2004

    Google Scholar 

  12. Di Ciccio, C., Mecella, M., Scannapieco, M., Zardetto, D., Catarci, T.: MailOfMine – analyzing mail messages for mining artful collaborative processes. In: Aberer, K., Damiani, E., Dillon, T. (eds.) SIMPDA 2011. LNBIP, vol. 116, pp. 55–81. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34044-4_4

    Chapter  Google Scholar 

  13. Dredze, M., Lau, T., Kushmerick, N.: Automatically classifying emails into activities. In: Proceedings of the 11th International Conference on Intelligent User Interfaces, IUI 2006, New York, NY, USA, pp. 70–77. ACM (2006). https://doi.org/10.1145/1111449.1111471

  14. Dufour-Lussier, V., Ber, F.L., Lieber, J., Nauer, E.: Automatic case acquisition from texts for process-oriented case-based reasoning. Inf. Syst. 40, 153–167 (2014). https://doi.org/10.1016/j.is.2012.11.014

    Article  Google Scholar 

  15. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936). https://doi.org/10.1007/BF02288367

    Article  MATH  Google Scholar 

  16. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)

    Google Scholar 

  17. Fournier-Viger, P., Tseng, V.S.: TNS: mining top-k non-redundant sequential rules. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 2013, New York, NY, USA, pp. 164–166, ACM (2013). https://doi.org/10.1145/2480362.2480395

  18. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800

    Article  MathSciNet  MATH  Google Scholar 

  19. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/aos/1013203451

    Article  MathSciNet  MATH  Google Scholar 

  20. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI 2006, pp. 1301–1306. AAAI Press (2006)

    Google Scholar 

  21. Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, September 2015. https://doi.org/10.18653/v1/D15-1162

  22. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985). https://doi.org/10.1007/BF01908075

    Article  MATH  Google Scholar 

  23. Jiang, J.Y., Cheng, W.H., Chiou, Y.S., Lee, S.J.: A similarity measure for text processing. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1460–1465, July 2011. https://doi.org/10.1109/ICMLC.2011.6016998

  24. Jlailaty, D., Grigori, D., Belhajjame, K.: Email business activities extraction and annotation. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, R. (eds.) ISIP 2018. CCIS, vol. 1040, pp. 69–86. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30284-9_5

    Chapter  Google Scholar 

  25. Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30549-1_43

    Chapter  Google Scholar 

  26. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998). https://doi.org/10.1080/01638539809545028

    Article  Google Scholar 

  27. Leemans, S.J.J., Poppe, E., Wynn, M.T.: Directly follows-based process mining: exploration & a case study. In: 2019 International Conference on Process Mining (ICPM), pp. 25–32, June 2019. https://doi.org/10.1109/ICPM.2019.00015

  28. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  29. McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017). https://doi.org/10.21105/joss.00205

    Article  Google Scholar 

  30. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: International Conference on Learning Representations (2018)

    Google Scholar 

  31. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. Association for Computational Linguistics, July 2004

    Google Scholar 

  32. Pearson, K.: LIII: on lines and planes of closest fit to systems of points in space. Philos. Mag. J. Sci. 2(11), 559–572 (1901). https://doi.org/10.1080/14786440109462720

    Article  MATH  Google Scholar 

  33. Pelevina, M., Arefiev, N., Biemann, C., Panchenko, A.: Making sense of word embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 174–183. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/W16-1620

  34. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/00330330610681286

    Article  Google Scholar 

  35. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)

    Google Scholar 

  36. Ratcliff, J.W., Metzener, D.E.: Pattern-matching-the gestalt approach. Dr Dobbs J. 13(7), 46 (1988)

    Google Scholar 

  37. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, vol. 1, pp. 1–20. Wiley, Hoboken (2010). https://doi.org/10.1002/9780470689646.ch1

    Book  Google Scholar 

  38. Schumacher, P., Minor, M., Walter, K., Bergmann, R.: Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches. In: Proceedings of the 21st International Conference on World Wide Web, New York, NY, USA, pp. 739–747. WWW 2012 Companion, ACM (2012). https://doi.org/10.1145/2187980.2188194

  39. Shing, L., et al.: Extracting workflows from natural language documents: a first step. In: Daniel, F., Sheng, Q.Z., Motahari, H. (eds.) BPM 2018. LNBIP, vol. 342, pp. 294–300. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11641-5_23

    Chapter  Google Scholar 

  40. Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972). https://doi.org/10.1108/eb026526

    Article  Google Scholar 

  41. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995). https://doi.org/10.1007/978-1-4757-3264-1

    Book  MATH  Google Scholar 

  42. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)

    MathSciNet  MATH  Google Scholar 

  43. Zeng, J., Li, J., He, Y., Gao, C., Lyu, M.R., King, I.: What you say and how you say it: joint modeling of topics and discourse in microblog conversations. Trans. Assoc. Comput. Linguist. 7, 267–281 (2019). https://doi.org/10.1162/tacl_a_00267

    Article  Google Scholar 

  44. Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, New York, NY, USA, pp. 116–123. ACM (2004). https://doi.org/10.1145/1015330.1015332

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander J. Chambers .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chambers, A.J. et al. (2020). Automated Business Process Discovery from Unstructured Natural-Language Documents. In: Del Río Ortega, A., Leopold, H., Santoro, F.M. (eds) Business Process Management Workshops. BPM 2020. Lecture Notes in Business Information Processing, vol 397. Springer, Cham. https://doi.org/10.1007/978-3-030-66498-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66498-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66497-8

  • Online ISBN: 978-3-030-66498-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics