Automated Business Process Discovery from Unstructured Natural-Language Documents

Chambers, Alexander J.; Stringfellow, Amy M.; Luo, Ben B.; Underwood, Sophie J.; Allard, Tony G.; Johnston, Ian A.; Brockman, Sarah; Shing, Leslie; Wollaber, Allan; VanDam, Courtland

doi:10.1007/978-3-030-66498-5_18

Alexander J. Chambers⁹,
Amy M. Stringfellow⁹,
Ben B. Luo⁹,
Sophie J. Underwood⁹,
Tony G. Allard⁹,
Ian A. Johnston⁹,
Sarah Brockman¹⁰,
Leslie Shing¹⁰,
Allan Wollaber¹⁰ &
…
Courtland VanDam¹⁰

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 397))

Included in the following conference series:

International Conference on Business Process Management

1459 Accesses
9 Citations

Abstract

Understanding the processes followed by organizations is important to ensure business outcomes are achieved in an optimal, efficient and compliant manner. Process mining techniques rely on the existence of structured event logs captured by process management systems. These systems are not always employed and may not capture all process steps, leaving out those that occur through emails and chat software or edits to documents and knowledge-management systems.

Here we present an algorithm for the automated extraction of processes from unstructured natural-language documents. Action and topic analysis is used to generate an event log, from which process models are mined using standard techniques. We show the algorithm is capable of generating consistent software-development processes from an Apache Camel email dataset.

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

van der Aalst, W., Adriansyah, A., van Dongen, B.: Replaying history on process models for conformance checking and performance analysis. Wiley Interdisc. Rev. Data Mining Knowl. Discov. 2(2), 182–192 (2012). https://doi.org/10.1002/widm.1045
Article Google Scholar
van der Aalst, W., Nikolov, A.: EMailAnalyzer: an E-Mail mining plug-in for the ProM framework. BPM Center Report-07-16 (2007)
Google Scholar
Adriansyah, A., van Dongen, B., van der Aalst, W.: Conformance checking using cost-based fitness analysis. In: 2011 IEEE 15th International Enterprise Distributed Object Computing Conference, pp. 55–64, August 2011. https://doi.org/10.1109/EDOC.2011.12
Allard, T., Alvino, P., Shing, L., Wollaber, A., Yuen, J.: A dataset to facilitate automated workflow analysis. PLoS ONE 14(2), 1–22 (2019). https://doi.org/10.1371/journal.pone.0211486
Article Google Scholar
Apache Camel (2019). https://camel.apache.org/. Accessed 31 Oct 2019
Bellotti, V., Ducheneaut, N., Howard, M., Smith, I.: Taskmaster: recasting email as task management. In: CSCW Workshop on Re-designing E-mail for the 21st Century (2002)
Google Scholar
Berti, A., van Zelst, S.J., van der Aalst, W.M.P.: Process mining for python (PM4PY): bridging the gap between process- and data science. In: 2019 International Conference on Process Mining (ICPM) (2019)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017). https://doi.org/10.1109/CVPR.2017.195
Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Tran. Pattern Anal. Mach. Intell. 24(5), 603–619 (2002). https://doi.org/10.1109/34.1000236
Article Google Scholar
Cooley, R.: Classification of news stories using support vector machines. In: Proceedings 16th International Joint Conference on Artificial Intelligence Text Mining Workshop. Citeseer (1999)
Google Scholar
Corston-Oliver, S., Ringger, E., Gamon, M., Campbell, R.: Task-focused summarization of email. In: Text Summarization Branches Out, Barcelona, Spain, pp. 43–50. Association for Computational Linguistics, July 2004
Google Scholar
Di Ciccio, C., Mecella, M., Scannapieco, M., Zardetto, D., Catarci, T.: MailOfMine – analyzing mail messages for mining artful collaborative processes. In: Aberer, K., Damiani, E., Dillon, T. (eds.) SIMPDA 2011. LNBIP, vol. 116, pp. 55–81. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34044-4_4
Chapter Google Scholar
Dredze, M., Lau, T., Kushmerick, N.: Automatically classifying emails into activities. In: Proceedings of the 11th International Conference on Intelligent User Interfaces, IUI 2006, New York, NY, USA, pp. 70–77. ACM (2006). https://doi.org/10.1145/1111449.1111471
Dufour-Lussier, V., Ber, F.L., Lieber, J., Nauer, E.: Automatic case acquisition from texts for process-oriented case-based reasoning. Inf. Syst. 40, 153–167 (2014). https://doi.org/10.1016/j.is.2012.11.014
Article Google Scholar
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936). https://doi.org/10.1007/BF02288367
Article MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
Google Scholar
Fournier-Viger, P., Tseng, V.S.: TNS: mining top-k non-redundant sequential rules. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 2013, New York, NY, USA, pp. 164–166, ACM (2013). https://doi.org/10.1145/2480362.2480395
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800
Article MathSciNet MATH Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001). https://doi.org/10.1214/aos/1013203451
Article MathSciNet MATH Google Scholar
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI 2006, pp. 1301–1306. AAAI Press (2006)
Google Scholar
Honnibal, M., Johnson, M.: An improved non-monotonic transition system for dependency parsing. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1373–1378. Association for Computational Linguistics, September 2015. https://doi.org/10.18653/v1/D15-1162
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985). https://doi.org/10.1007/BF01908075
Article MATH Google Scholar
Jiang, J.Y., Cheng, W.H., Chiou, Y.S., Lee, S.J.: A similarity measure for text processing. In: 2011 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1460–1465, July 2011. https://doi.org/10.1109/ICMLC.2011.6016998
Jlailaty, D., Grigori, D., Belhajjame, K.: Email business activities extraction and annotation. In: Kotzinos, D., Laurent, D., Spyratos, N., Tanaka, Y., Taniguchi, R. (eds.) ISIP 2018. CCIS, vol. 1040, pp. 69–86. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30284-9_5
Chapter Google Scholar
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial Naive Bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30549-1_43
Chapter Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998). https://doi.org/10.1080/01638539809545028
Article Google Scholar
Leemans, S.J.J., Poppe, E., Wynn, M.T.: Directly follows-based process mining: exploration & a case study. In: 2019 International Conference on Process Mining (ICPM), pp. 25–32, June 2019. https://doi.org/10.1109/ICPM.2019.00015
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
McInnes, L., Healy, J., Astels, S.: HDBSCAN: hierarchical density based clustering. J. Open Source Softw. 2(11), 205 (2017). https://doi.org/10.21105/joss.00205
Article Google Scholar
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: International Conference on Learning Representations (2018)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, pp. 404–411. Association for Computational Linguistics, July 2004
Google Scholar
Pearson, K.: LIII: on lines and planes of closest fit to systems of points in space. Philos. Mag. J. Sci. 2(11), 559–572 (1901). https://doi.org/10.1080/14786440109462720
Article MATH Google Scholar
Pelevina, M., Arefiev, N., Biemann, C., Panchenko, A.: Making sense of word embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 174–183. Association for Computational Linguistics, August 2016. https://doi.org/10.18653/v1/W16-1620
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980). https://doi.org/10.1108/00330330610681286
Article Google Scholar
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)
Google Scholar
Ratcliff, J.W., Metzener, D.E.: Pattern-matching-the gestalt approach. Dr Dobbs J. 13(7), 46 (1988)
Google Scholar
Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic Keyword Extraction from Individual Documents, vol. 1, pp. 1–20. Wiley, Hoboken (2010). https://doi.org/10.1002/9780470689646.ch1
Book Google Scholar
Schumacher, P., Minor, M., Walter, K., Bergmann, R.: Extraction of procedural knowledge from the web: a comparison of two workflow extraction approaches. In: Proceedings of the 21st International Conference on World Wide Web, New York, NY, USA, pp. 739–747. WWW 2012 Companion, ACM (2012). https://doi.org/10.1145/2187980.2188194
Shing, L., et al.: Extracting workflows from natural language documents: a first step. In: Daniel, F., Sheng, Q.Z., Motahari, H. (eds.) BPM 2018. LNBIP, vol. 342, pp. 294–300. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11641-5_23
Chapter Google Scholar
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972). https://doi.org/10.1108/eb026526
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995). https://doi.org/10.1007/978-1-4757-3264-1
Book MATH Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Zeng, J., Li, J., He, Y., Gao, C., Lyu, M.R., King, I.: What you say and how you say it: joint modeling of topics and discourse in microblog conversations. Trans. Assoc. Comput. Linguist. 7, 267–281 (2019). https://doi.org/10.1162/tacl_a_00267
Article Google Scholar
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML 2004, New York, NY, USA, pp. 116–123. ACM (2004). https://doi.org/10.1145/1015330.1015332

Download references

Author information

Authors and Affiliations

Defence Science and Technology, Edinburgh, SA, 5111, Australia
Alexander J. Chambers, Amy M. Stringfellow, Ben B. Luo, Sophie J. Underwood, Tony G. Allard & Ian A. Johnston
MIT Lincoln Laboratory, Lexington, MA, 02421, USA
Sarah Brockman, Leslie Shing, Allan Wollaber & Courtland VanDam

Authors

Alexander J. Chambers
View author publications
You can also search for this author in PubMed Google Scholar
Amy M. Stringfellow
View author publications
You can also search for this author in PubMed Google Scholar
Ben B. Luo
View author publications
You can also search for this author in PubMed Google Scholar
Sophie J. Underwood
View author publications
You can also search for this author in PubMed Google Scholar
Tony G. Allard
View author publications
You can also search for this author in PubMed Google Scholar
Ian A. Johnston
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Brockman
View author publications
You can also search for this author in PubMed Google Scholar
Leslie Shing
View author publications
You can also search for this author in PubMed Google Scholar
Allan Wollaber
View author publications
You can also search for this author in PubMed Google Scholar
Courtland VanDam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander J. Chambers .

Editor information

Editors and Affiliations

Universidad de Sevilla, Seville, Spain
Adela Del Río Ortega
Kühne Logistics University, Hamburg, Germany
Henrik Leopold
Universidade do Estado do Rio de Janeiro – UERJ, Rio de Janeiro, Brazil
Flávia Maria Santoro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chambers, A.J. et al. (2020). Automated Business Process Discovery from Unstructured Natural-Language Documents. In: Del Río Ortega, A., Leopold, H., Santoro, F.M. (eds) Business Process Management Workshops. BPM 2020. Lecture Notes in Business Information Processing, vol 397. Springer, Cham. https://doi.org/10.1007/978-3-030-66498-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-66498-5_18
Published: 19 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66497-8
Online ISBN: 978-3-030-66498-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics