Application of mutual information-based sequential feature selection to ISBSG mixed data
There is still little research work focused on feature selection (FS) techniques including both categorical and continuous features in Software Development Effort Estimation (SDEE) literature. This paper addresses the problem of selecting the most relevant features from ISBSG (International Software Benchmarking Standards Group) dataset to be used in SDEE. The aim is to show the usefulness of splitting the ranked list of features provided by a mutual information-based sequential FS approach in two, regarding categorical and continuous features. These lists are later recombined according to the accuracy of a case-based reasoning model. Thus, four FS algorithms are compared using a complete dataset with 621 projects and 12 features from ISBSG. On the one hand, two algorithms just consider the relevance, while the remaining two follow the criterion of maximizing relevance and also minimizing redundancy between any independent feature and the already selected features. On the other hand, the algorithms that do not discriminate between continuous and categorical features consider just one list, whereas those that differentiate them use two lists that are later combined. As a result, the algorithms that use two lists present better performance than those algorithms that use one list. Thus, it is meaningful to consider two different lists of features so that the categorical features may be selected more frequently. We also suggest promoting the usage of Application Group, Project Elapsed Time, and First Data Base System features with preference over the more frequently used Development Type, Language Type, and Development Platform.
KeywordsFeature selection Mutual information ISBSG Software development effort estimation k-nearest neighbor
- Awada, W., Khoshgoftaar, T. M., Dittman, D., Wald, R., Napolitano, A. (2012). A review of the stability of feature selection techniques for bioinformatics data. In 2012 I.E. 13th International Conference on Information Reuse and Integration (IRI) (pp. 356–363). Presented at the 2012 I.E. 13th International Conference on Information Reuse and Integration (IRI). https://doi.org/10.1109/IRI.2012.6303031.
- Chatzipetrou, P., Papatheocharous, E., Angelis, L., Andreou, A. S. (2012). An investigation of software effort phase distribution using compositional data analysis. In 2012 38th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) (pp. 367–375). Presented at the 2012 38th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). https://doi.org/10.1109/SEAA.2012.50.
- Deng, K., & MacDonell, S. G. (2008). Maximising data retention from the ISBSG repository. In Proceedings of the 12th international conference on evaluation and assessment in software engineering (pp. 21–30). Swinton: British Computer Society http://dl.acm.org/citation.cfm?id=2227115.2227118. Accessed 21 Jan 2014.Google Scholar
- Doquire, G., & Verleysen, M. (2011). An hybrid approach to feature selection for mixed categorical and continuous data. In International Conference on Knowledge Discovery and Information Retrieval. http://hdl.handle.net/2078.1/90765. Accessed 2 Nov 2015.
- Fayyad, U.M., & Irani, K.B. (1993). Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In Proceedings of the International Joint Conference on Uncertainty in AI (pp. 1022–1027). Presented at the International Joint Conference on Uncertainty in AI. https://www.researchgate.net/publication/220815890_Multi-Interval_Discretization_of_Continuous-Valued_Attributes_for_Classification_Learning. Accessed 22 June 2016.
- Ferreira, A., & Figueiredo, M. (2011). Unsupervised joint feature discretization and selection. In J. Vitrià, J. M. Sanches, & M. Hernández (Eds.), Pattern recognition and image analysis (Vol. 6669, pp. 200–207). Berlin, Heidelberg: Springer Berlin Heidelberg http://link.springer.com/10.1007/978-3-642-21257-4_25. Accessed 4 Mar 2016.CrossRefGoogle Scholar
- Gupta, P., Jain, S., & Jain, A. (2014). A review of fast clustering-based feature subset selection algorithm. International Journal of Scientific & Technology Research, 3(11), 86–91.Google Scholar
- Hill, P. (2010). Practical software project estimation: a toolkit for estimating software development effort & duration. McGraw Hill Professional.Google Scholar
- ISBSG. (2013a). ISBSG Dataset Release 12. ISBSG. http://isbsg.org/. Accessed 1 Mar 2016.
- ISBSG. (2013b). ISBSG Guidelines Release 12.Google Scholar
- ISBSG. (2013c). ISBSG Data Demographics Release 12.Google Scholar
- Jeffery, R., Ruhe, M., Wieczorek, I. (2001). Using public domain metrics to estimate software development effort. In Software Metrics Symposium, 2001. METRICS 2001. Proceedings. Seventh International (pp. 16–27). https://doi.org/10.1109/METRIC.2001.915512.
- Jiang, Z., & Comstock, C. (2007). The factors significant to software development productivity. In C. Ardil (Ed.), Proceedings of World Academy of Science, Engineering and Technology, Vol 19 (Vol. 19, pp. 160–164). Presented at the Conference of the World-Academy-of-Science-Engineering-and-Technology, Bangkok: World Acad Sci, Eng & Tech-Waset.Google Scholar
- Kadoda, G., Cartwright, M., Chen, L., Shepperd, M. (2000). Experiences using case-based reasoning to predict software project effort. In EASE 2000 (pp. 2–3). Presented at the EASE 2000, Staffordshire, UK.Google Scholar
- Kirsopp, C., Shepperd, M. J., Hart, J. (2002). Search heuristics, case-based reasoning and software project effort prediction. In Proceedings of the Genetic and Evolutionary Computation Conference (pp. 9–13). New York, USA. http://v-scheiner.brunel.ac.uk/handle/2438/1554. Accessed 27 Jan 2016.
- Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media. https://books.google.es/books?hl=en&lr=&id=aaDbBwAAQBAJ&oi=fnd&pg=PP10&dq=Feature+selection+for+knowledge+discovery+and+data+mining&ots=iuMhcWZGcf&sig=KlmNEIcsBdDVs-m1HUuICfpYZiM. Accessed 25 Jan 2016.
- Liu, Q., Wang, J., Xiao, J., Zhu, H. (2014). Mutual information based feature selection for symbolic interval data. In International Conference on Software Intelligence Technologies and Applications International Conference on Frontiers of Internet of Things 2014 (pp. 62–69). Presented at the International Conference on Software Intelligence Technologies and Applications International Conference on Frontiers of Internet of Things 2014. https://doi.org/10.1049/cp.2014.1537.
- Lokan, C. (2005). What should you optimize when building an estimation model? In Software Metrics, 2005. 11th IEEE International Symposium (pp. 1–10). https://doi.org/10.1109/METRICS.2005.55.
- Lokan, C., & Mendes, E. (2009b). Applying moving windows to software effort estimation. In Proceedings of the 2009 3rd international symposium on empirical software engineering and measurement (pp. 111–122). Washington, DC: IEEE Computer Society. https://doi.org/10.1109/ESEM.2009.5316019.CrossRefGoogle Scholar
- Lokan, C., & Mendes, E. (2012). Investigating the use of duration-based moving windows to improve software effort prediction. In Software Engineering Conference (APSEC), 2012 19th Asia-Pacific (Vol. 1, pp. 818–827). Presented at the Software Engineering Conference (APSEC), 2012 19th Asia-Pacific. https://doi.org/10.1109/APSEC.2012.74.
- Lustgarten, J.L., Visweswaran, S., Grover, H., Gopalakrishnan, V. (2008). An evaluation of discretization methods for learning rules from biomedical datasets. In BIOCOMP (pp. 527–532).Google Scholar
- Mendes, E., Lokan, C., Harrison, R., Triggs, C. (2005). A replicated comparison of cross-company and within-company effort estimation models using the ISBSG database. In Software Metrics, 2005. 11th IEEE International Symposium (pp. 1–10). https://doi.org/10.1109/METRICS.2005.4.
- Núñez, H., Sànchez-Marrè, M., Cortés, U., Comas, J., Martínez, M., Rodríguez-Roda, I., & Poch, M. (2004). A comparative study on the use of similarity measures in case-based reasoning to improve the classification of environmental system situations. Environmental Modelling & Software, 19(9), 809–819. https://doi.org/10.1016/j.envsoft.2003.03.003.CrossRefGoogle Scholar
- Romanski, P., & Kotthoff, L. (2014). FSelector: Selecting attributes. R package version 0.20. https://CRAN.R-project.org/package=FSelector.
- Top, O. O., Ozkan, B., Nabi, M., Demirors, O. (2011). Internal and External Software Benchmark Repository Utilization for Effort Estimation. In Software Measurement, 2011 Joint Conference of the 21st Int’l Workshop on and 6th Int’l Conference on Software Process and Product Measurement (IWSM-MENSURA) (pp. 302–307). https://doi.org/10.1109/IWSM-MENSURA.2011.41.
- Vinh, L.T., Thang, N.D., Lee, Y.-K. (2010). An improved maximum relevance and minimum redundancy feature selection algorithm based on normalized mutual information. In 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet (SAINT) (pp. 395–398). Presented at the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet (SAINT). https://doi.org/10.1109/SAINT.2010.50.
- Witten, I.H., Frank, E., Hall, M.A., Pal, C.J. (2011). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google Scholar