Empirical Software Engineering

, Volume 23, Issue 6, pp 3630–3683 | Cite as

Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals

An approach and evaluation at Roche Diagnostics GmbH
  • Thomas QuirchmayrEmail author
  • Barbara Paech
  • Roland Kohl
  • Hannes Karey
  • Gunar Kasdepke


Mature software systems comprise a vast number of heterogeneous system capabilities which are usually requested by different groups of stakeholders and which evolve over time. Software features describe and bundle low level capabilities logically on an abstract level and thus provide a structured and comprehensive overview of the entire capabilities of a software system. Software features are often not explicitly managed. Quite the contrary, feature-relevant information is often spread across several software engineering artifacts (e.g., user manual, issue tracking systems). It requires huge manual effort to identify and extract feature-relevant information from these artifacts in order to make feature knowledge explicit. In this paper we present a two-step-approach to extract feature-relevant information from a user manual: First we semi-automatically extract a domain terminology from a natural language user manual based on linguistic patterns. Then, we apply natural language processing techniques based on the extracted domain terminology and structural sentence information. Our approach is able to extract atomic feature-relevant information with an F1-score of at least 92.00%. We describe the implementation of the approach as well as evaluations based on example sections of a user manual taken from industry.


Software feature Terminology extraction Atomic information extraction NLP 



We would like to thank Roche Diagnostics GmbH for the financial support of this research project. Many thanks also to the GDC experts for their participation in the case study and valuable discussions of the results.


  1. Abney SP (2012) Parsing by chunks. In: Principle-based parsing, pp 257–278Google Scholar
  2. Acher M, Cleve A, Perrouin G, Heymans P, Vanbeneden C, Collet P, Lahire P (2012) On extracting feature models from product descriptions. In: Proceedings of 6th International Workshop on Variability Modeling of Software-Intensive Systems (VaMoS’12), pp 45–54Google Scholar
  3. Aggarwal C, Zhai C (2012) Mining Text DataCrossRefGoogle Scholar
  4. Alves V, Schwanninger C, Barbosa L, Rashid A, Sawyer P, Rayson P, Pohl C, Rummler A (2008) An exploratory study of information retrieval techniques in domain analysis. In: Proceedings of 12th International Software Product Line Conference (SPLC’08), pp 67–76Google Scholar
  5. Apel S, Kästner C (2009) An Overview of Feature-Oriented Software Development. Obj Tec 8(5):49–84CrossRefGoogle Scholar
  6. Bakar NH, Kasirun ZM, Salleh N (2015a) Feature extraction approaches from natural language requirements for reuse in software product lines. Syst Softw 106 (C):132–149CrossRefGoogle Scholar
  7. Bakar NH, Kasirun ZM, Salleh N (2015b) Terms extractions: An approach for requirements reuse. In: 2nd Int. Conf. on Information Science and Security (ICISS), pp 1–4Google Scholar
  8. Balachandran K, Ranathunga S (2016) Domain-specific term extraction for concept identification in ontology construction. In: International Conference on Web Intelligence (WI), pp 34–41Google Scholar
  9. Beliga S, Meṡtrović A, Martinċić-Ipṡić S (2015) An overview of graph-based keyword extraction methods and approaches. J Inf Organ Sci 39(1):1–20Google Scholar
  10. Berry DM (2017) Evaluation of Tools for Hairy Requirements and Software Engineering Tasks. In: Proceedings of the 25th Int. Requirements Engineering Conference Workshops (REW), pp 284–291Google Scholar
  11. Berry DM, Gacitua R, Sawyer P, Tjong SF (2012) The case for dumb requirements engineering tools. In: Proceedings of the 18th International Conference on Requirements Engineering (REFSQ’12), pp 211–217CrossRefGoogle Scholar
  12. Bishop CM (2006) Pattern recognition and machine learningGoogle Scholar
  13. Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retr 15(1):54–92CrossRefGoogle Scholar
  14. Bosch J (2000) Design and use of software architectures: adopting and evolving a product-line approachGoogle Scholar
  15. Boutkova E, Houdek F (2011) Semi-automatic identification of features in requirement specifications. In: Proceedings of 19th International Requirements Engineering Conference (RE’11), pp 313–318Google Scholar
  16. Brinton LJ, Brinton D (2010) The linguistic structure of modern EnglishGoogle Scholar
  17. Chandrasekar R, Doran C, Srinivas B (1996) Motivations and methods for text simplification. In: Proceedings of the 16th Conference on Computational Linguistics (COLING), pp 1041–1044Google Scholar
  18. Charniak E (1997) Statistical parsing with a context-free grammar and word statistics. AAAI/IAAIGoogle Scholar
  19. Chen J, Chau R, Yeh CH (2004) Discovering parallel text from the world wide web. In: Proceedings of the 2nd Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation, pp 157–161Google Scholar
  20. Chen K, Zhang W, Zhao H, Mei H (2005) An approach to constructing feature models based on requirements clustering. In: Proceedings of 13th International Requirements Engineering Conference (RE’05), pp 31–40Google Scholar
  21. Chen P I, Lin S J (2010) Automatic keyword prediction using google similarity distance. Expert Syst Appl 37(3):1928–1938CrossRefGoogle Scholar
  22. Classen A, Heymans P, Schobbens PY (2008) What’s in a feature: A requirements engineering perspective. In: Proceedings of 11th International Conference on Fundamental Approaches to Software Engineering (FASE’08), pp 16–30Google Scholar
  23. Conrado MS, Pardo TAS, Rezende SO (2014) The main challenge of semi-automatic term extraction methods. In: Proceedings of the 11th International Workshop on Natural Language Processing and Cognitive Science (NLPCS’14), pp 49–59Google Scholar
  24. Corbett G (2006) Linguistic features. Encyclopedia of language and linguistics 2(7):193–194CrossRefGoogle Scholar
  25. Drymonas E, Zervanou K, Petrakis EGM (2010) Unsupervised ontology acquisition from plain texts: the OntoGain system. In: International Conference on Application of Natural Language to Information Systems, pp 277–287CrossRefGoogle Scholar
  26. Earls A, Embury S, Turner N (2002) A method for the manual extraction of business rules from legacy source code. BT Technology 20(4):127–145CrossRefGoogle Scholar
  27. Eisenbarth T, Koschke R, Simon D (2003) Locating features in source code. IEEE Trans Softw Eng 29(3):210–224CrossRefGoogle Scholar
  28. Ercan G, Cicekli I (2007) Using lexical chains for keyword extraction. Inf Process Manag 43(6):1705–1714CrossRefGoogle Scholar
  29. Gao X, Murugesan S, Lo B (2005) Extraction of keyterms by simple text mining for business information retrieval. In: Proceedings of the International Conference on e-Business Engineering (ICEBE’15), pp 332–339Google Scholar
  30. Ghosh S, Elenius D, Li W, Lincoln P, Shankar N, Steiner W (2016) Arsenal: Automatic requirements specification extracting from natural language. In: Proceedings of 8th Int. Symp. of NASA Formal Methods (NFM’16), pp 41–46Google Scholar
  31. Guzman E, Maalej W (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In: Proceedings of 22nd International Requirements Engineering Conference (RE’14), IEEE, pp 153–162Google Scholar
  32. IEEE (1990) IEEE standard glossary of software engineering terminology. IEEE StdGoogle Scholar
  33. Indurkhya N, Damerau FJ (2010) Handbook of natural language processingGoogle Scholar
  34. John I, Dörr J (2003) Elicitation of requirements from user documentation. In: 9th International Workshop on Requirements Engineering (REFSQ’03), pp 17–26Google Scholar
  35. Jonnalagadda S, Tari L, Hakenberg J, Baral C, Gonzalez G (2009) Towards effective sentence simplification for automatic processing of biomedical text. In: Proceedings of Human Language Technologies (HLT’09), pp 177–180Google Scholar
  36. Kim S N, Baldwin T, Kan M Y (2009) An unsupervised approach to domain-specific term extraction. In: Australasian Language Technology Association Workshop, vol 2009, pp 94–98Google Scholar
  37. Klein D, Manning C D (2003) Fast exact inference with a factored model for natural language parsing. In: Becker S, Thrun S, Obermayer K (eds) Advances in Neural Information Processing Systems, vol 15, pp 3–10Google Scholar
  38. Kleinberg J M (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632MathSciNetCrossRefGoogle Scholar
  39. Kof L (2009) Requirements analysis: concept extraction and translation of textual specifications to executable models, pp 79–90CrossRefGoogle Scholar
  40. Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: Proceedings of 5th International Conference on Language Resources and Evaluation (LREC’06), pp 2231–2234Google Scholar
  41. Li Y, Guzman E, Tsiamoura K, Schneider F, Bruegge B (2015) Automated requirements extraction for scientific software. Procedia Comput Sci 51:582–591CrossRefGoogle Scholar
  42. Liu F, Pennell D, Liu F, Liu Y (2009) Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of human language technologies: The 2009 annual Conf. of the North American chapter of the association for computational linguistics, pp 620–628Google Scholar
  43. Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014a) Biomedical terminology extraction: A new combination of statistical and web mining approaches. In: JADT: Journées d’Analyse statistique des Données Textuelles, pp 421–432Google Scholar
  44. Lossio-Ventura JA, Jonquet C, Roche M, Teisseire M (2014b) Yet another ranking function for automatic multiword term extraction. In: International Conference on Natural Language Processing (NLP’14), pp 52–64Google Scholar
  45. Loughran N, Sampaio A, Rashid A (2006) From Requirements Documents to Feature Models for Aspect Oriented Product Line Implementation. In: Int. Conf. on Model Driven Engineering Languages and Systems, pp 262–271CrossRefGoogle Scholar
  46. Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System DemonstrationsGoogle Scholar
  47. Marciuska S, Gencel C, Abrahamsson P (2014) Automated feature identification in web applications. In: Proceedings of 14th International Conference on Software Quality (QSIC’14), pp 100–114CrossRefGoogle Scholar
  48. Meijer K, Frasincar F, Hogenboom F (2014) A semantic approach for extracting domain taxonomies from text. Decis Support Syst 62:78–93CrossRefGoogle Scholar
  49. Melville P, Gryc W, Lawrence RD (2009) Sentiment analysis of blogs by combining lexical knowledge with text classification. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge discovery and data mining, pp 1275–1284Google Scholar
  50. Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B (2016) Software feature request detection in issue tracking systems. In: Proceedings of 24th Int. Requirements Engineering Conference (RE’16), pp 166–175Google Scholar
  51. Mu Y, Wang Y, Guo J (2009) Extracting software functional requirements from free text documents. In: Proceedings of 1st International Conference on Information and Multimedia Technology (ICIMT’09), pp 194–198Google Scholar
  52. Nixon M (2008) Feature extraction & image processingGoogle Scholar
  53. Paech B, Hübner P, Merten T (2014) What Are the Features of This Software? An Exploratory Study. In: Proceedings of 9th International Conference on Software Engineering Advances (ICSEA’14), pp 114–125Google Scholar
  54. Pikkarainen M, Haikara J, Salo O, Abrahamsson P, Still J (2008) The impact of agile practices on communication in software development. J Empir Softw Eng 13(3):303–337CrossRefGoogle Scholar
  55. Quirchmayr T, Paech B, Kohl R, Karey H (2017) Semi-automatic software feature-relevant information extraction from natural language user manuals. In: Proceedings of the 23rd International Conference on Requirements Engineering (REFSQ’17), Springer, pp 255–272Google Scholar
  56. Rose S, Engel D, Cramer N, Cowley W (2010) Automatic keyword extraction from individual documents. In: Berry MW, Kogan J (eds) Text Mining: Applications and Theory, pp 1–20Google Scholar
  57. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47MathSciNetCrossRefGoogle Scholar
  58. Shaker P, Atlee JM, Wang S (2012) A feature-oriented requirements modelling language. In: Proceedings of 20th International Requirements Engineering Conference (RE’12), pp 151–160Google Scholar
  59. da Silva Conrado M, Pardo TAS, Rezende SO (2013) A machine learning approach to automatic term extraction using a rich feature set. In: HLT-NAACL, pp 16–23Google Scholar
  60. Venu SH, Mohan V, Urkalan K, Geetha TV (2016) Unsupervised domain ontology learning from text. In: International Conference on Mining Intelligence and Knowledge Exploration, pp 132–143CrossRefGoogle Scholar
  61. Ward LJ, Woods G (2013) English grammar for dummiesGoogle Scholar
  62. Weston N, Chitchyan R, Rashid A (2009) A framework for constructing semantically composable feature models from natural language requirements. In: Proceedings of the 13th International Software Product Line Conf. (SPLC’09), pp 211–220Google Scholar
  63. Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: An introduction and a survey of current approaches. Inf Sci 36(3):306–323CrossRefGoogle Scholar
  64. Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: A look back and into the future. ACM Comput Surv 44(4):20CrossRefGoogle Scholar
  65. Zapata JCM, Losada BM, Gonzalez-Calderon G (2012) An approach for using procedure manuals as a source for requirements elicitation. In: Proceedings of 38th Conf. Latinoamericana En Informatica (CLEI’12), pp 1–8Google Scholar
  66. Zhang K, Xu H, Tang J, Li J (2006) Keyword extraction using support vector machine. In: International Conference on Web-Age Information Management, pp 85–96CrossRefGoogle Scholar
  67. Zorn-Pauli G, Paech B, Wittkopf J (2012) Strategic release planning challenges for global information systems - a position paper. In: Proceedings of 6th International Workshop on Software Product Management (IWSPM’12), pp 186–191Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute for Computer ScienceUniversity of HeidelbergHeidelbergGermany
  2. 2.Roche Diagnostics GmbHMannheimGermany

Personalised recommendations