Automated Software Engineering

, Volume 23, Issue 4, pp 649–686 | Cite as

Concept extraction from business documents for software engineering projects

  • Pierre André MénardEmail author
  • Sylvie Ratté


Acquiring relevant business concepts is a crucial first step for any software project for which the software experts are not domain experts. The wealth of information buried within an organization’s written documentation is a precious source of concepts, relationships and attributes which can be used to model the enterprise’s domain. The lack of targeted extraction tools can make perusing through this type of resource a lengthy and costly process. We propose a domain model focused extraction process aimed at the rapid discovery of knowledge relevant to the software expert. To avoid undesirable noise from high-level linguistic tools, the process is mainly composed of positive and negative base filters that are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. A new metric to assess the performance discovery speed of relevant concepts is introduced. The annotation of a gold standard definition of software engineering oriented concepts for knowledge extraction tasks is also presented.


Automated extraction Conceptual model Domain model Relevance evaluation Software project Knowledge modeling 


  1. Abeillé, A., Clément, L., Toussenel, F.: Building a Treebank for French. Treebanks Kluwer, Dordrecht (2003)CrossRefGoogle Scholar
  2. Abran, A., Moore, J., Bourque, P., Dupuis, R., Tripp, L.: Guide to the Software Engineering Body of Knowledge. IEEE Computer Society Press, Los Alamitos (2004)Google Scholar
  3. Anderson, T.D.: Studying human judgments of relevance: interactions in context. In: Proceedings of the 1st International Conference on Information Interaction in Context, ACM, pp. 6–14 (2006)Google Scholar
  4. Batini, C., Ceri, S., Navathe, S.: Conceptual Database Design: An Entity-Relationship Approach. Benjamin/Cummings Pub, Co, San Francisco (1992)zbMATHGoogle Scholar
  5. Borgida, A.: How knowledge representation meets software engineering (and often databases). Autom. Softw. Eng. 14(4), 443–464 (2007). doi: 10.1007/s10515-007-0018-0 CrossRefGoogle Scholar
  6. Chen, P.: English sentence structure and entity-relationship diagrams. Inf. Sci. 29(2–3), 127–149 (1983). doi: 10.1016/0020-0255(83)90014-2 CrossRefGoogle Scholar
  7. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., et al.: Text Processing with GATE (Version 6) (2011)Google Scholar
  8. Deeptimahanti, D., Sanyal, R.: Semi-automatic generation of UML models from natural language requirements. In: Proceedings of the 4th India Software Engineering Conference, ACM, pp. 165–174 (2011)Google Scholar
  9. Farrell, J.: IBM Watson: a brief overview and thoughts for healthcare education and performance improvement. Accessed 14 June 2015 (2009)
  10. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge, MA (1998)zbMATHGoogle Scholar
  11. Green, S., de Marneffe, M., Bauer, J., Manning, C.: Multiword expression identification with tree substitution grammars: a parsing tour de force with french. In: Conference on Empirical Methods in Natural Language Processing (2011)Google Scholar
  12. Grenon, P., Smith, B.: SNAP and SPAN: towards dynamic spatial ontology. Spat. Cognit. Comput. 4(1), 69–103 (2004)Google Scholar
  13. Ittoo, A., Maruster, L., Wortmann, H., Bouma, G.: Textractor: A Framework for Extracting Relevant Domain Concepts from Irregular Corporate Textual Datasets, pp. 71–82. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-12814-1_7
  14. Kof, L.: Requirements analysis: concept extraction and translation of textual specifications to executable models. In: Natural Language Processing and Information Systems, Springer, Berlin, Heidelberg (2010). doi: 10.1007/978-3-642-12550-8_7
  15. Kotonya, G., Sommerville, I.: Requirements Engineering : Processes and Techniques. Wiley, New York (1998)Google Scholar
  16. Leroy, G., Chen, H., Martinez, J.: A shallow parser based on closed-class words to capture relations in biomedical text. J. Biomed. Inf. 36(3), 145–158 (2003)CrossRefGoogle Scholar
  17. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA (1999)zbMATHGoogle Scholar
  18. Maynard, D., Saggion, H., Yankova, M.: Natural language technology for information integration in business intelligence. In: Business Information Systems. Springer, Berlin, Heidelberg (2007). doi: 10.1007/978-3-540-72035-5_28
  19. Ménard, P.A., Ratté, S.: Classifier-based acronym extraction for business documents. Knowl. Inf. Syst. (2010). doi: 10.1007/s10115-010-0341-9
  20. Nenkova, A., Passonneau, R.: Evaluating content selection in summarization: the pyramid method. In: Proceedings of HLT-NAACL (2004)Google Scholar
  21. Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of the International Conference on Formal Ontology in Information Systems, pp. 2–9 (2001)Google Scholar
  22. Nivre, J., Hall, J.: MaltParser : a language-independent system for data-driven dependency parsing. In: Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (2005)Google Scholar
  23. Pfleeger, S.L., Atlee, J.M.: Software Engineering: Theory and Practice, 4th edn. Prentice Hall, Upper Saddle River (2009)Google Scholar
  24. Popescu, D., Rugaber, S., Medvidovic, N., Berry, D.: Reducing ambiguities in requirements specifications via automatically created object-oriented models. Lecture Notes in Computer Science, vol. 1, pp. 103–124, Springer, Heidelberg (2008)Google Scholar
  25. Pressman, R.: Software Engineering: A Practitioner’s Approach. Palgrave Macmillan, London (2001)zbMATHGoogle Scholar
  26. Prieto-Diaz, R.: Domain analysis concepts and research directions. In: Prieto-Diaz , R., Arango, G. (eds.) Domain Analysis and Software Systems Modeling, pp. 9–32. IEEE Computer Society Press (1991)Google Scholar
  27. Rose, S., Engel, D., Cramer, N.: Automatic keyword extraction from individual documents. Text Min. pp. 1–20 (2010)Google Scholar
  28. Schamber, L.: Relevance and information behavior. Annu. Rev. Inf. Sci. Technol. 29, 3–48 (1994)Google Scholar
  29. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on new methods in language processing, pp. 44–49 (1994)Google Scholar
  30. Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch, New York (1998)Google Scholar
  31. Vossen, P.: EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Dordrecht (1998)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.École de technologie supérieureMontrealCanada

Personalised recommendations