Abstract
Understanding the processes followed by organizations is important to ensure business outcomes are achieved in an optimal, efficient and compliant manner. Process mining techniques rely on the existence of structured event logs captured by process management systems. These systems are not always employed and may not capture all process steps, leaving out those that occur through emails and chat software or edits to documents and knowledge-management systems.
Here we present an algorithm for the automated extraction of processes from unstructured natural-language documents. Action and topic analysis is used to generate an event log, from which process models are mined using standard techniques. We show the algorithm is capable of generating consistent software-development processes from an Apache Camel email dataset.
DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
van der Aalst, W., Adriansyah, A., van Dongen, B.: Replaying history on process models for conformance checking and performance analysis. Wiley Interdisc. Rev. Data Mining Knowl. Discov. 2(2), 182–192 (2012). https://doi.org/10.1002/widm.1045
van der Aalst, W., Nikolov, A.: EMailAnalyzer: an E-Mail mining plug-in for the ProM framework. BPM Center Report-07-16 (2007)
Adriansyah, A., van Dongen, B., van der Aalst, W.: Conformance checking using cost-based fitness analysis. In: 2011 IEEE 15th International Enterprise Distributed Object Computing Conference, pp. 55–64, August 2011. https://doi.org/10.1109/EDOC.2011.12
Allard, T., Alvino, P., Shing, L., Wollaber, A., Yuen, J.: A dataset to facilitate automated workflow analysis. PLoS ONE 14(2), 1–22 (2019). https://doi.org/10.1371/journal.pone.0211486
Apache Camel (2019). https://camel.apache.org/. Accessed 31 Oct 2019
Bellotti, V., Ducheneaut, N., Howard, M., Smith, I.: Taskmaster: recasting email as task management. In: CSCW Workshop on Re-designing E-mail for the 21st Century (2002)
Berti, A., van Zelst, S.J., van der Aalst, W.M.P.: Process mining for python (PM4PY): bridging the gap between process- and data science. In: 2019 International Conference on Process Mining (ICPM) (2019)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017). https://doi.org/10.1109/CVPR.2017.195
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Tran. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002). https://doi.org/10.1109/34.1000236
Cooley, R.: Classification of news stories using support vector machines. In: Proceedings 16th International Joint Conference on Artificial Intelligence Text Mining Workshop. Citeseer (1999)
Corston-Oliver, S., Ringger, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out, Barcelona, Spain, pp. 43–50. Association for Computational Linguistics, July 2004
Di Ciccio, C., Mecella, M., Scannapieco, M., Zardetto, D., Catarci, T.: MailOfMine – analyzing mail messages for mining artful collaborative processes. In: Aberer, K., Damiani, E., Dillon, T. (eds.) SIMPDA 2011. LNBIP, vol. 116, pp. 55–81. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34044-4_4
Dredze, M., Lau, T., Kushmerick, N.: Automatically classifying emails into activities. In: Proceedings of the 11th International Conference on Intelligent User Interfaces, IUI 2006, New York, NY, USA, pp. 70–77. ACM (2006). https://doi.org/10.1145/1111449.1111471
Dufour-Lussier, V., Ber, F.L., Lieber, J., Nauer, E.: Automatic case acquisition from texts for process-oriented case-based reasoning. Inf. Syst. 40, 153–167 (2014). https://doi.org/10.1016/j.is.2012.11.014
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936). https://doi.org/10.1007/BF02288367
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
Fournier-Viger, P., Tseng, V.S.: TNS: mining top-k non-redundant sequential rules. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 2013, New York, NY, USA, pp. 164–166, ACM (2013). https://doi.org/10.1145/2480362.2480395
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/aos/1013203451
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI 2006, pp. 1301–1306. AAAI Press (2006)
Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, September 2015. https://doi.org/10.18653/v1/D15-1162
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985). https://doi.org/10.1007/BF01908075
Jiang, J.Y., Cheng, W.H., Chiou, Y.S., Lee, S.J.: A similarity measure for text processing. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1460–1465, July 2011. https://doi.org/10.1109/ICMLC.2011.6016998
Jlailaty, D., Grigori, D., Belhajjame, K.: Email business activities extraction and annotation. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, R. (eds.) ISIP 2018. CCIS, vol. 1040, pp. 69–86. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30284-9_5
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30549-1_43
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998). https://doi.org/10.1080/01638539809545028
Leemans, S.J.J., Poppe, E., Wynn, M.T.: Directly follows-based process mining: exploration & a case study. In: 2019 International Conference on Process Mining (ICPM), pp. 25–32, June 2019. https://doi.org/10.1109/ICPM.2019.00015
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017). https://doi.org/10.21105/joss.00205
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: International Conference on Learning Representations (2018)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. Association for Computational Linguistics, July 2004
Pearson, K.: LIII: on lines and planes of closest fit to systems of points in space. Philos. Mag. J. Sci. 2(11), 559–572 (1901). https://doi.org/10.1080/14786440109462720
Pelevina, M., Arefiev, N., Biemann, C., Panchenko, A.: Making sense of word embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 174–183. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/W16-1620
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/00330330610681286
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)
Ratcliff, J.W., Metzener, D.E.: Pattern-matching-the gestalt approach. Dr Dobbs J. 13(7), 46 (1988)
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, vol. 1, pp. 1–20. Wiley, Hoboken (2010). https://doi.org/10.1002/9780470689646.ch1
Schumacher, P., Minor, M., Walter, K., Bergmann, R.: Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches. In: Proceedings of the 21st International Conference on World Wide Web, New York, NY, USA, pp. 739–747. WWW 2012 Companion, ACM (2012). https://doi.org/10.1145/2187980.2188194
Shing, L., et al.: Extracting workflows from natural language documents: a first step. In: Daniel, F., Sheng, Q.Z., Motahari, H. (eds.) BPM 2018. LNBIP, vol. 342, pp. 294–300. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11641-5_23
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972). https://doi.org/10.1108/eb026526
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995). https://doi.org/10.1007/978-1-4757-3264-1
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
Zeng, J., Li, J., He, Y., Gao, C., Lyu, M.R., King, I.: What you say and how you say it: joint modeling of topics and discourse in microblog conversations. Trans. Assoc. Comput. Linguist. 7, 267–281 (2019). https://doi.org/10.1162/tacl_a_00267
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, New York, NY, USA, pp. 116–123. ACM (2004). https://doi.org/10.1145/1015330.1015332
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chambers, A.J. et al. (2020). Automated Business Process Discovery from Unstructured Natural-Language Documents. In: Del RÃo Ortega, A., Leopold, H., Santoro, F.M. (eds) Business Process Management Workshops. BPM 2020. Lecture Notes in Business Information Processing, vol 397. Springer, Cham. https://doi.org/10.1007/978-3-030-66498-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-66498-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66497-8
Online ISBN: 978-3-030-66498-5
eBook Packages: Computer ScienceComputer Science (R0)