Abstract
One of the challenges in using statistical machine learning for text mining is coming up with the right set of text features. We have developed a system that uses genetic algorithms (GAs) to evaluate candidate feature sets to classify sentences in a document. We have applied this tool to find design rationale (the reasons behind design decisions) in two different datasets to evaluate our approach for finding rationale and to see how features might differ for the same classification target in different types of data. We used Chrome bug reports and transcripts of design sessions. We found that we were able to get results with less overfitting by using a smaller set of features common to the set optimized for each document type.
Keywords
- Information Gain
- Text Mining
- Linguistic Feature
- Sentence Length
- Machine Learning Classifier
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options




References
AuditMyPC (2010) Glossary of internet security. http://www.auditmypc.com/glossary-of-internet-security-terms.asp. Retrieved 23 Nov 2010
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117
Burge J (2005) Software engineering using design RATionale. Ph.D. thesis, Worcester Polytechnic Institute
Cunningham H, Maynard D, Bontcheva K, Tablan (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the Association for Computational Linguistics (ACL’02). Philadelphia, July 2002
de la Maza M, Tidor B (1993) An analysis of selection procedures with particular attention paid to proportional and Boltzmann selection. In: Forrest S (ed) Proceedings of the 5th international conference on genetic algorithms. Morgan Kaufmann, San Francisco, pp 124–131
Hall M, Frank E, Holmes G, Pfahringer B, Reutmann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Liang Y, Liu Y, Kwong C, Lee W (2012) Learning the ‘whys’: discovering design rationale using text mining—an algorithm perspective. Comput Aided Des 44(10):916–930
López C, Codocedo V, Astudillo H, Cysneiros LM (2012) Bridging the gap between software architecture rationale formalisms and actual architecture documents: an ontology-driven approach. Sci Comput Program 77(1):66–80
Marcus M, Marcinkiewicz M, Santorini B (1993) Building a large annotated corpus of English: the penn treebank. Comput Linguist 19(2):313–330
Mathur T (2015) Improving classification results using class imbalance solutions & evaluating the generalizability of rationale extraction techniques. Master of Computer Science thesis, Miami University. https://etd.ohiolink.edu/ap/10?0::NO:10:P10_ETD_SUBID:100565
Mitchell M (1996) An introduction to genetic algorithms. MIT Press, Cambridge
Oliveira A, Braga P, Lima R, Cornelio M (2010) GA-based method for feature selection and parameters optimization for machine learning regression applied to software effort estimation. Inf Softw Technol 52:11
Palau M, Moens M-F (2009) Argumentation mining: the detection, classification and structure of arguments in text. In: Proceedings of the 12th international conference on artificial intelligence and law (ICAIL ‘09). ACM, New York, pp 98–107
Panichella A, Dit B, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms. In: Proceedings of the international conference on software engineering, pp 522–531
Rogers B, Gung J, Qaio Y, Burge JE (2012) Exploring techniques for rationale extraction from existing documents. In: Proceedings of international conference on software engineering. IEEE Press, pp 1313–1316
Rogers B, Qaio Y, Gung J, Mathur T, Burge J (2014) Using text mining to extract rationale from existing documentation. In: Gero J (ed) Design, computing, and cognition. Springer
Salcedo-Sanz S, Prado-Cumplido M, Perez-Cruz F, Bousono-Calzon C (2002) Feature selection via genetic optimization. In: Dorronsoro JR (ed) Proceedings of the international conference on artificial neural networks (ICANN ‘02). Springer, London, pp 547–552
Tan F (2007) Improving feature selection techniques for machine learning. Ph.D. dissertation. Georgia State University, Atlanta. Advisor(s) Anu G. Bourgeois. AAI3293841
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400
Yi Z, Zhao J, Mei (2012) Mining binary constraints in the construction of feature models. In: Proceedings of the IEEE international requirements engineering conference (RE 2012). IEEE, pp 141–150
Acknowledgements
We would like to thank Miami graduate students John Malloy and Jennifer Flowers for their work in annotating the SPSD data. The design sessions that produced the SPSD data were funded by the National Science Foundation (Award CCF-0845840). We would like to thank the workshop organizers, André van der Hoek, Marian Petre, and Alex Baker for granting access to the transcripts. We would also like to thank Dr. Mike Zmuda for suggesting we move the information gain calculation outside of the GA. This work was supported by NSF CAREER Award CCF-0844638 (Burge). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this paper
Cite this paper
Rogers, B., Justice, C., Mathur, T., Burge, J.E. (2017). Generalizability of Document Features for Identifying Rationale. In: Gero, J. (eds) Design Computing and Cognition '16. Springer, Cham. https://doi.org/10.1007/978-3-319-44989-0_34
Download citation
DOI: https://doi.org/10.1007/978-3-319-44989-0_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44988-3
Online ISBN: 978-3-319-44989-0
eBook Packages: EngineeringEngineering (R0)