Instance Ranking Using Data Complexity Measures for Training Set Selection

  • Junaid Alam
  • T. Sobha RaniEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11941)


A classifier’s performance is dependent on the training set provided for the training. Hence training set selection holds an important place in the classification task. This training set selection plays an important role in improving the performance of the classifier and reducing the time taken for training. This can be done using various methods like algorithms, data-handling techniques, cost-sensitive methods, ensembles and so on. In this work, one of the data complexity measures, Maximum Fisher’s discriminant ratio (F1), has been used to determine the good training instances. This measure discriminates any two classes using a specific feature by comparing the class means and variances. This measure in particular provides the overlap between the classes. In the first phase, F1 of the whole data set is calculated. After that, F1 using leave-one-out method is computed to rank each of the instances. Finally, the instances that lower the F1 value are all removed as a batch from the data set. According to F1, a small value represents a strong overlap between the classes. Therefore if those instances that cause more overlap are removed then overlap will reduce further. Empirically demonstrated in this work, the efficacy of the proposed reduction algorithm (DRF1) using 4 different classifiers (Random Forest, Decision Tree-C5.0, SVM and kNN) and 6 data sets (Pima, Musk, Sonar, Winequality(R and W) and Wisconsin). The results confirm that the DRF1 leads to a promising improvement in kappa statistics and classification accuracy with the training set selection using data complexity measure. Approximately 18–50% reduction is achieved. There is a huge reduction of training time also.


Maximum Fisher’s discriminant ratio Classification Batch removal Kappa statistics Instance ranking 


  1. 1.
    Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991). Scholar
  2. 2.
    Basu, M., Ho, T.K.: The learning behavior of single neuron classifiers on linearly separable or nonseparable input. In: IJCNN 1999, International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), vol. 2, pp. 1259–1264, July 1999.
  3. 3.
    Basu, M., Ho, T.: Data Complexity in Pattern Recognition, January 2006. Scholar
  4. 4.
    Baumgartner, R., Somorjai, R.: Data complexity assesment in undersampled classification of high dimensional biomedical data. Pattern Recogn. Lett. 27, 1383–1389 (2006). Scholar
  5. 5.
    Cano, J.R.: Analysis of data complexity measures for classification. Expert Syst. Appl. 40(12), 4820–4831 (2013). Scholar
  6. 6.
    Dey, D., Solorio, T., Montes y Gómez, M., Escalante, H.J.: Instance selection in text classification using the silhouette coefficient measure. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011. LNCS (LNAI), vol. 7094, pp. 357–369. Springer, Heidelberg (2011). Scholar
  7. 7.
    Dua, D., Graff, C.: UCI machine learning repository (2017).
  8. 8.
    El-hasnony, I.M., Bakry, H.M.E., Saleh, A.A.: Article: comparative study among data reduction techniques over classification accuracy. Int. J. Comput. Appl. 122(2), 9–15 (2015)Google Scholar
  9. 9.
    Hornik, K., Buchta, C., Zeileis, A.: Open-source machine learning: R meets Weka. Comput. Stat. 24(2), 225–232 (2009). Scholar
  10. 10.
    Hothorn, T., Zeileis, A.: partykit: a modular toolkit for recursive partytioning in R. J. Mach. Learn. Res. 16, 3905–3909 (2015).
  11. 11.
    Kuhn, M.: caret: Classification and Regression Training (2018)., r package version 6.0-80
  12. 12.
    Kuhn, M., Quinlan, R.: C50: C5.0 Decision Trees and Rule-Based Models (2018)., r package version 0.1.2
  13. 13.
    Liaw, A., Wiener, M.: Classification and Regression by random forest. R News 2(3), 18–22 (2002). ,
  14. 14.
    Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien (2018)., r package version 1.7-0
  15. 15.
    Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002). Scholar
  16. 16.
    Urbanek, S.: rJava: Low-Level R to Java Interface (2018)., r package version 0.9-10
  17. 17.
    Weihs, C., Ligges, U., Luebke, K., Raabe, N.: klaR analyzing german business cycles. In: Baier, D., Decker, R., Schmidt-Thieme, L. (eds.) Data Analysis and Decision Support, pp. 335–343. Springer, Berlin (2005). Scholar
  18. 18.
    Wikipedia contributors: Cohen’s kappa – Wikipedia, the free encyclopedia (2019). Accessed 31 May 2019
  19. 19.
    Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learningalgorithms. Mach. Learn. 38(3), 257–286 (2000). Scholar
  20. 20.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)zbMATHGoogle Scholar
  21. 21.
    Li, Y., Dong, M., Kothari, R.: Classifiability-based omnivariate decision trees. IEEE Trans. Neural Netw. 16(6), 1547–1560 (2005). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Computer and Information SciencesUniversity of HyderabadHyderabadIndia

Personalised recommendations