Skip to main content

Generalizability of Document Features for Identifying Rationale

Abstract

One of the challenges in using statistical machine learning for text mining is coming up with the right set of text features. We have developed a system that uses genetic algorithms (GAs) to evaluate candidate feature sets to classify sentences in a document. We have applied this tool to find design rationale (the reasons behind design decisions) in two different datasets to evaluate our approach for finding rationale and to see how features might differ for the same classification target in different types of data. We used Chrome bug reports and transcripts of design sessions. We found that we were able to get results with less overfitting by using a smaller set of features common to the set optimized for each document type.

Keywords

  • Information Gain
  • Text Mining
  • Linguistic Feature
  • Sentence Length
  • Machine Learning Classifier

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-44989-0_34
  • Chapter length: 19 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   259.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-44989-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   329.99
Price excludes VAT (USA)
Hardcover Book
USD   329.99
Price excludes VAT (USA)
Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  • AuditMyPC (2010) Glossary of internet security. http://www.auditmypc.com/glossary-of-internet-security-terms.asp. Retrieved 23 Nov 2010

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117

    CrossRef  Google Scholar 

  • Burge J (2005) Software engineering using design RATionale. Ph.D. thesis, Worcester Polytechnic Institute

    Google Scholar 

  • Cunningham H, Maynard D, Bontcheva K, Tablan (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the Association for Computational Linguistics (ACL’02). Philadelphia, July 2002

    Google Scholar 

  • de la Maza M, Tidor B (1993) An analysis of selection procedures with particular attention paid to proportional and Boltzmann selection. In: Forrest S (ed) Proceedings of the 5th international conference on genetic algorithms. Morgan Kaufmann, San Francisco, pp 124–131

    Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutmann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Google Scholar 

  • Liang Y, Liu Y, Kwong C, Lee W (2012) Learning the ‘whys’: discovering design rationale using text mining—an algorithm perspective. Comput Aided Des 44(10):916–930

    CrossRef  Google Scholar 

  • López C, Codocedo V, Astudillo H, Cysneiros LM (2012) Bridging the gap between software architecture rationale formalisms and actual architecture documents: an ontology-driven approach. Sci Comput Program 77(1):66–80

    CrossRef  Google Scholar 

  • Marcus M, Marcinkiewicz M, Santorini B (1993) Building a large annotated corpus of English: the penn treebank. Comput Linguist 19(2):313–330

    Google Scholar 

  • Mathur T (2015) Improving classification results using class imbalance solutions & evaluating the generalizability of rationale extraction techniques. Master of Computer Science thesis, Miami University. https://etd.ohiolink.edu/ap/10?0::NO:10:P10_ETD_SUBID:100565

  • Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge

    MATH  Google Scholar 

  • Oliveira A, Braga P, Lima R, Cornelio M (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52:11

    CrossRef  Google Scholar 

  • Palau M, Moens M-F (2009) Argumentation mining: the detection, classification and structure of arguments in text. In: Proceedings of the 12th international conference on artificial intelligence and law (ICAIL ‘09). ACM, New York, pp 98–107

    Google Scholar 

  • Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: Proceedings of the international conference on software engineering, pp 522–531

    Google Scholar 

  • Rogers B, Gung J, Qaio Y, Burge JE (2012) Exploring techniques for rationale extraction from existing documents. In: Proceedings of international conference on software engineering. IEEE Press, pp 1313–1316

    Google Scholar 

  • Rogers B, Qaio Y, Gung J, Mathur T, Burge J (2014) Using text mining to extract rationale from existing documentation. In: Gero J (ed) Design, computing, and cognition. Springer

    Google Scholar 

  • Salcedo-Sanz S, Prado-Cumplido M, Perez-Cruz F, Bousono-Calzon C (2002) Feature selection via genetic optimization. In: Dorronsoro JR (ed) Proceedings of the international conference on artificial neural networks (ICANN ‘02). Springer, London, pp 547–552

    Google Scholar 

  • Tan F (2007) Improving feature selection techniques for machine learning. Ph.D. dissertation. Georgia State University, Atlanta. Advisor(s) Anu G. Bourgeois. AAI3293841

    Google Scholar 

  • Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400

    CrossRef  Google Scholar 

  • Yi Z, Zhao J, Mei (2012) Mining binary constraints in the construction of feature models. In: Proceedings of the IEEE international requirements engineering conference (RE 2012). IEEE, pp 141–150

    Google Scholar 

Download references

Acknowledgements

We would like to thank Miami graduate students John Malloy and Jennifer Flowers for their work in annotating the SPSD data. The design sessions that produced the SPSD data were funded by the National Science Foundation (Award CCF-0845840). We would like to thank the workshop organizers, André van der Hoek, Marian Petre, and Alex Baker for granting access to the transcripts. We would also like to thank Dr. Mike Zmuda for suggesting we move the information gain calculation outside of the GA. This work was supported by NSF CAREER Award CCF-0844638 (Burge). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Janet E. Burge .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this paper

Cite this paper

Rogers, B., Justice, C., Mathur, T., Burge, J.E. (2017). Generalizability of Document Features for Identifying Rationale. In: Gero, J. (eds) Design Computing and Cognition '16. Springer, Cham. https://doi.org/10.1007/978-3-319-44989-0_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44989-0_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44988-3

  • Online ISBN: 978-3-319-44989-0

  • eBook Packages: EngineeringEngineering (R0)